Different communication libraries, including MPI, have
been studied and implemented for the SCC [11], [12],
[13]. Porting CHARM++ on top of them is a future study,
which may result in performance improvements. There
are also many other SCC-related studies going on in
the Many-core Applications Research Community of Intel
(//communities.intel.com/community/marc).
Esmaeilzadeh et al. [14] provide an extensive report and
analysis of the chip power and performance of five different
generations of Intel processors with a vast amount of diverse
benchmarks [14]. Such work, however, does not consider
many-cores or GPUs, which are promising architectures for
the future of parallel computing.
VII. CONCLUSION
Large increases in the number of transistors, accompanied
by power and energy limitations, introduce new challenges
for architectural design of new processors. There are several
alternatives to consider, such as heavy-weight multi-cores,
light-weight many-cores, low-power designs and SIMD-like
(GPGPU) architectures. In choosing among them, several
possibly conflicting goals must be kept in mind, such as
speed, power, energy, programmability and portability. In
this work, we evaluated platforms representing the above-
mentioned design alternatives using five scalable CHARM++
and MPI applications: Jacobi, NAMD, NQueens, CG and Sort.
The Intel SCC is a research chip using a many-core ar-
chitecture. Many-cores like the SCC offer an opportunity to
build future machines that consume low power and can run
CHARM++ and MPI code fast. They represent an intersting
and balanced design point, as they consume lower power
than heavy-weight multi-cores but are faster than low-power
processors and do not have the generality or portability issues
of GPGPU architectures. In our analysis of the SCC, we
suggested improvements in sequential performance, especially
in floating-point operation speed, and suggested adding a
global collectives network.
We showed that heavy-weight multicores are still an ef-
fective solution for dynamic and complicated applications, as
well as for those with irregular accesses and communications.
In addition, GPGPUs are exceptionally powerful for many
applications in speed, power and energy. However, they lack
the sophisticated architecture to execute complex and irregular
applications efficiently. They also require a high programming
effort to write new code, and are unable to run legacy codes.
Finally, as seen from the Intel Atom experiments, we observe
that low-power designs do not necessarily result in low energy
consumption, since they may increase the execution time
significantly. Therefore, there is no single best solution to fit
all the applications and goals.
ACKNOWLEDGMENTS
We thank Intel for giving us an SCC system and the anony-
mous reviewers for their comments. We also thank Laxmikant
Kale, Marc Snir and David Padua for their comments and
support. This work was supported in part by Intel under the
Illinois-Intel Parallelism Center (I2PC).
REFERENCES
[1] T. G. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy,
J. Howard, S. Vangal, N. Borkar, G. Ruhl, and S. Dighe, “The 48-core
SCC processor: The programmer’s view,” in International Conference
for High Performance Computing, Networking, Storage and Analysis,
2010, pp. 1–11.
[2] L. V. Kale and G. Zheng, “Charm++ and AMPI: Adaptive runtime strate-
gies via migratable objects,” in Advanced Computational Infrastructures
for Parallel and Distributed Applications, M. Parashar, Ed. Wiley-
Interscience, 2009, pp. 265–282.
[3] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl,
D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain,
T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow,
M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss,
T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, and
T. Mattson, “A 48-core IA-32 message-passing processor with DVFS in
45nm CMOS,” in Solid-State Circuits Conference Digest of Technical
Papers, 2010, pp. 108–109.
[4] R. F. van der Wijngaart, T. G. Mattson, and W. Haas, “Light-weight com-
munications on Intel’s single-chip cloud computer processor,” SIGOPS
Oper. Syst. Rev., vol. 45, pp. 73–83, February 2011.
[5] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa,
C. Chipot, R. D. Skeel, L. Kal
´
e, and K. Schulten, “Scalable molecular
dynamics with NAMD,” Journal of Computational Chemistry, vol. 26,
no. 16, pp. 1781–1802, 2005.
[6] D. Bailey, E. Barszcz, L. Dagum, and H. Simon, “NAS parallel bench-
mark results,” in Proc. Supercomputing, Nov. 1992.
[7] L. V. Kale, G. Zheng, C. W. Lee, and S. Kumar, “Scaling applications
to massively parallel machines using Projections performance analysis
tool,” in Future Generation Computer Systems Special Issue on: Large-
Scale System Performance Modeling and Analysis, vol. 22, no. 3,
February 2006, pp. 347–358.
[8] B. Marker, E. Chan, J. Poulson, R. van de Geijn, R. F. Van der
Wijngaart, T. G. Mattson, and T. E. Kubaska, “Programming many-core
architectures - a case study: Dense matrix computations on the Intel
Single-chip Cloud Computer processor,” Concurrency and Computation:
Practice and Experience, 2011.
[9] R. David, P. Bogdan, R. Marculescu, and U. Ogras, “Dynamic power
management of voltage-frequency island partitioned networks-on-chip
using Intel Single-chip Cloud Computer,” in International Symposium
on Networks-on-Chip, 2011, pp. 257–258.
[10] P. Salihundam, S. Jain, T. Jacob, S. Kumar, V. Erraguntla, Y. Hoskote,
S. Vangal, G. Ruhl, and N. Borkar, “A 2 Tb/s 6 × 4 mesh network for a
single-chip cloud computer with DVFS in 45 nm CMOS,” IEEE Journal
of Solid-State Circuits, vol. 46, no. 4, pp. 757 –766, April 2011.
[11] C. Clauss, S. Lankes, P. Reble, and T. Bemmerl, “Evaluation and
improvements of programming models for the Intel SCC many-core pro-
cessor,” in International Conference on High Performance Computing
and Simulation (HPCS), 2011, pp. 525–532.
[12] I. Ure
˜
na, M. Riepen, and M. Konow, “RCKMPI–Lightweight MPI im-
plementation for Intels Single-chip Cloud Computer (SCC),” in Recent
Advances in the Message Passing Interface: 18th European MPI Users
Group Meeting. Springer-Verlag New York Inc, 2011, p. 208.
[13] C. Clauss, S. Lankes, and T. Bemmerl, “Performance tuning of SCC-
MPICH by means of the proposed MPI-3.0 tool interface,” Recent
Advances in the Message Passing Interface, pp. 318–320, 2011.
[14] H. Esmaeilzadeh, T. Cao, Y. Xi, S. M. Blackburn, and K. S. McKinley,
“Looking back on the language and hardware revolutions: Measured
power, performance, and scaling,” in International Conference on Ar-
chitectural Support for Programming Languages and Operating Systems,
2011, pp. 319–332.