Different communication libraries, including MPI, have been studied and implemented for the SCC [11], [12], [13]. Porting CHARM++ on top of them is a future study, which may result in performance improvements. There are also many other SCC-related studies going on in the Many-core Applications Research Community of Intel (http://communities.intel.com/community/marc). Esmaeilzadeh et al. [14] provide an extensive report and analysis of the chip power and performance of five different generations of Intel processors with a vast amount of diverse benchmarks [14]. Such work, however, does not consider many-cores or GPUs, which are promising architectures for the future of parallel computing. VII. CONCLUSION Large increases in the number of transistors, accompanied by power and energy limitations, introduce new challenges for architectural design of new processors. There are several alternatives to consider, such as heavy-weight multi-cores, light-weight many-cores, low-power designs and SIMD-like (GPGPU) architectures. In choosing among them, several possibly conflicting goals must be kept in mind, such as speed, power, energy, programmability and portability. In this work, we evaluated platforms representing the above- mentioned design alternatives using five scalable CHARM++ and MPI applications: Jacobi, NAMD, NQueens, CG and Sort. The Intel SCC is a research chip using a many-core ar- chitecture. Many-cores like the SCC offer an opportunity to build future machines that consume low power and can run CHARM++ and MPI code fast. They represent an intersting and balanced design point, as they consume lower power than heavy-weight multi-cores but are faster than low-power processors and do not have the generality or portability issues of GPGPU architectures. In our analysis of the SCC, we suggested improvements in sequential performance, especially in floating-point operation speed, and suggested adding a global collectives network. We showed that heavy-weight multicores are still an ef- fective solution for dynamic and complicated applications, as well as for those with irregular accesses and communications. In addition, GPGPUs are exceptionally powerful for many applications in speed, power and energy. However, they lack the sophisticated architecture to execute complex and irregular applications efficiently. They also require a high programming effort to write new code, and are unable to run legacy codes. Finally, as seen from the Intel Atom experiments, we observe that low-power designs do not necessarily result in low energy consumption, since they may increase the execution time significantly. Therefore, there is no single best solution to fit all the applications and goals. ACKNOWLEDGMENTS We thank Intel for giving us an SCC system and the anony- mous reviewers for their comments. We also thank Laxmikant Kale, Marc Snir and David Padua for their comments and support. This work was supported in part by Intel under the Illinois-Intel Parallelism Center (I2PC). REFERENCES [1] T. G. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, and S. Dighe, “The 48-core SCC processor: The programmer’s view,” in International Conference for High Performance Computing, Networking, Storage and Analysis, 2010, pp. 1–11. [2] L. V. Kale and G. Zheng, “Charm++ and AMPI: Adaptive runtime strate- gies via migratable objects,” in Advanced Computational Infrastructures for Parallel and Distributed Applications, M. Parashar, Ed. Wiley- Interscience, 2009, pp. 265–282. [3] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson, “A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS,” in Solid-State Circuits Conference Digest of Technical Papers, 2010, pp. 108–109. [4] R. F. van der Wijngaart, T. G. Mattson, and W. Haas, “Light-weight com- munications on Intel’s single-chip cloud computer processor,” SIGOPS Oper. Syst. Rev., vol. 45, pp. 73–83, February 2011. [5] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kal ´ e, and K. Schulten, “Scalable molecular dynamics with NAMD,” Journal of Computational Chemistry, vol. 26, no. 16, pp. 1781–1802, 2005. [6] D. Bailey, E. Barszcz, L. Dagum, and H. Simon, “NAS parallel bench- mark results,” in Proc. Supercomputing, Nov. 1992. [7] L. V. Kale, G. Zheng, C. W. Lee, and S. Kumar, “Scaling applications to massively parallel machines using Projections performance analysis tool,” in Future Generation Computer Systems Special Issue on: Large- Scale System Performance Modeling and Analysis, vol. 22, no. 3, February 2006, pp. 347–358. [8] B. Marker, E. Chan, J. Poulson, R. van de Geijn, R. F. Van der Wijngaart, T. G. Mattson, and T. E. Kubaska, “Programming many-core architectures - a case study: Dense matrix computations on the Intel Single-chip Cloud Computer processor,” Concurrency and Computation: Practice and Experience, 2011. [9] R. David, P. Bogdan, R. Marculescu, and U. Ogras, “Dynamic power management of voltage-frequency island partitioned networks-on-chip using Intel Single-chip Cloud Computer,” in International Symposium on Networks-on-Chip, 2011, pp. 257–258. [10] P. Salihundam, S. Jain, T. Jacob, S. Kumar, V. Erraguntla, Y. Hoskote, S. Vangal, G. Ruhl, and N. Borkar, “A 2 Tb/s 6 × 4 mesh network for a single-chip cloud computer with DVFS in 45 nm CMOS,” IEEE Journal of Solid-State Circuits, vol. 46, no. 4, pp. 757 –766, April 2011. [11] C. Clauss, S. Lankes, P. Reble, and T. Bemmerl, “Evaluation and improvements of programming models for the Intel SCC many-core pro- cessor,” in International Conference on High Performance Computing and Simulation (HPCS), 2011, pp. 525–532. [12] I. Ure ˜ na, M. Riepen, and M. Konow, “RCKMPI–Lightweight MPI im- plementation for Intels Single-chip Cloud Computer (SCC),” in Recent Advances in the Message Passing Interface: 18th European MPI Users Group Meeting. Springer-Verlag New York Inc, 2011, p. 208. [13] C. Clauss, S. Lankes, and T. Bemmerl, “Performance tuning of SCC- MPICH by means of the proposed MPI-3.0 tool interface,” Recent Advances in the Message Passing Interface, pp. 318–320, 2011. [14] H. Esmaeilzadeh, T. Cao, Y. Xi, S. M. Blackburn, and K. S. McKinley, “Looking back on the language and hardware revolutions: Measured power, performance, and scaling,” in International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, 2011, pp. 319–332. |