White Papers

What Does Energy Efficiency Really Mean? (中文 | 日本語で)

Every well-known MCU manufacturer has one or more product families which are focused on so-called ultra-low-power applications. Their marketing messages promise the lowest possible energy consumption, but how much of this is based on reality? To set the records straight, we have made the measurements and will present them in this article. By Dr. Claus Kühnel and Frank Riemenschneider

Exploring CoreMark™ – A Benchmark Maximizing Simplicity and Efficacy

There have been many attempts to provide a single number that can totally quantify the ability of a CPU. Be it MHz, MOPS, MFLOPS - all are simple to derive but misleading when looking at actual performance potential. Dhrystone was the first attempt to tie a performance indicator, namely DMIPS, to execution of real code - a good attempt, which has long served the industry, but is no longer meaningful.

Make Accurate Power Measurements with NI Tools

This white paper from National Instruments discusses methodologies and solutions for measuring the power/energy consumption of embedded systems, including use of the EEMBC EnergyBench tool, which allows

Limits on Thread-Level Speculative Parallelism in Embedded Applications

As multicore microprocessors are becoming widely adopted, the need to extract thread-level parallelism from sequential single-threaded applications in a seamless fashion increases. In "Limits on Thread-Level Speculative Parallelism in Embedded Applications," teams from Chalmers University of Technology and the University of Southern California use EEMBC benchmarks to examine the limits of performance speedup for embedded applications using parallelizing compilers on platforms with and without support for thread-level speculation. Their systematic study provides new insights into the importance of having a thread-level speculation substrate and a low overhead for thread management. On an eight-way multi-core system, they find that it is possible to achieve a speedup of four, on average, for six out of the ten applications of EEMBC which they have analyzed. Authors: Mafijul Md. Islam, Alexander Busck, Mikael Engbom, and Per Stenström (Chalmers); Simji Lee and Michel Dubois (USC). Presented at INTERACT-11: Eleventh Workshop on Interaction between Compilers and Computer Architectures, part of HPCA-13, the 13th International Symposium on High-Performance Computer Architecture (Phoenix, February 10-14, 2007).

Challenges in Exploitation of Loop Parallelism in Embedded Applications

Embedded processors have been increasingly exploiting hardware parallelism. Vector units, multiple processors or cores, hyper-threading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above, have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. How this hardware parallelism can be exploited by applications is directly related to the amount of parallelism inherent in a target application. In this paper, the authors evaluate the performance potential of different types of parallelism, including true thread-level parallelism, speculative thread-level parallelism and vector parallelism, when executing loops. Applications from the industry-standard EEMBC 1.1, EEMBC 2.0 and the MiBench embedded benchmark suites are analyzed using the Intel C compiler. The results show what can be achieved today, provide upper bounds on the performance potential of different types of thread parallelism, and point out a number of issues that need to be addressed to improve performance. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution. Authors: Arun Kejariwal (UC Irvine), Alexander V. Veidenbaum (UC Irvine), Alexandru Nicolau (UC Irvine), Milind Girkarmark (Intel), Xinmin Tian (Intel), Hideki Saito (Intel). Presented at the 4th international Conference on Hardware/Software Codesign and System Synthesis. Seoul, Korea. October 22-25, 2006.

Execution Schemes for Dynamically Reconfigurable Architectures

Mapping applications onto reconfigurable architectures can be done in many different ways, but the features of the target architecture significantly constrain the way an application can be mapped and executed. In this paper presented at the 2006 Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI) in Nagoya, Japan, the authors show how execution schemes can be generated as an intermediate format in an approach to application mapping and how they constitute a useful level to compare features of different architectures. This paper describes how the authors have established execution schemes for coarse grained dynamically reconfigurable architectures, including use of EEMBC benchmarks to exemplify their basic flow. It presents area, timing, and gate-level power estimations derived from a synthesized architecture model. Authors: T. Oppold, T. Schweizer, J. Oliveira Filho, S. Eisenhardt, T. Kuhn, and W. Rosenstiel from the University of Tuebingen, Wilhelm-Schickard-Institute, Computer Engineering in Germany.

Scalable Vector Media-Processors for Embedded Systems

This insightful paper focuses on the development of efficient architectures for embedded multimedia systems. The author argues that it is possible to design processors that deliver high performance, have low energy consumption, and are simple to implement. The basis for the argument is the ability of vector architectures to efficiently exploit the data-level parallelism in multimedia applications. EEMBC benchmarks were utilized to derive data points to prove his theory. Paper author: Christoforos Kozyrakis, Doctor of Philosophy in Computer Science, University of California at Berkeley.

Understanding EEMBC Networking V2.0 Benchmarks

This September 2004 application note from IBM Microelectronics describes the results of the EEMBC benchmarks performed on IBM's PPC750GX microprocessor. These industry standard benchmarks make it easier for networking industry to choose processors for their routers and other communications products.

VIRAM1: A Media­Oriented Vector Processor with Embedded DRAM

Processors for mobile multimedia devices must be low power while having excellent performance on media applications. The VIRAM1 processor accomplishes this by combining vector processing with embedded DRAM. VIRAM1 includes a scalar core, 13 megabytes (104 megabits) of DRAM, and four vector datapaths. It consumes 2 W at 200 MHz and executes up to 9.6 giga-ops (16 bit) per second. ViRAM1 is compared with a representative variety of embedded processors using EEMBC benchmarks. Presented at DAC 2004 June 7-11, 2004, San Diego, by Joseph Gebis, Sam Williams, and David Patterson, Computer Science Division, University of California, Berkeley; and Christos Kozyrakis, Electrical Engineering Department, Stanford University.

Overcoming the Limitations of Conventional Vector Processors

Despite their superior performance for multimedia applications, vector processors have three limitations that hinder their widespread acceptance. First, the complexity and size of the centralized vector register file limits the number of functional units. Second, precise exceptions for vector instructions are difficult to implement. Third, vector processors require an expensive on-chip memory system that supports high bandwidth at low access latency. This paper introduces CODE, a scalable vector microarchitecture that addresses these three shortcomings. To demonstrate the potential of CODE, we use the VIRAM instruction set for multimedia processing and compare it to the VIRAM media-processor, a multi-lane vector design with a centralized VRF. For the EEMBC benchmarks and assuming equal die area, CODE is 26% faster than VIRAM. Presented at the 30th International Symposium on Computer Architecture (ISCA), San Diego, June 2003 by Christos Kozyrakis (Electrical Engineering Department, Stanford University) and David Patterson (Computer Science Division, University of California at Berkeley).

Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks

Multimedia processing on embedded devices requires an architecture that leads to high performance, low power consumption, reduced design complexity, and small code size. In this paper, Christoforos Kozyrakis (Stanford University) and David Patterson (University of California at Berkeley) use EEMBC, an industrial benchmark suite, to compare the VIRAM vector architecture to superscalar and VLIW processors for embedded multimedia applications. The comparison covers the VIRAM instruction set, vectorizing compiler, and the prototype chip that integrates a vector processor with DRAM main memory. Originally presented at 35th Annual ACM/IEEE International Symposium on Microarchitecture, Istanbul, 2002.

Dhrystone Benchmark: History, Analysis, Scores and Recommendations

This 2002 white paper explains what "benchmarking" is, how it is utilized, and offers a set of intended uses. Dhrystone benchmarking is explained, including its creation, evolution and intended purpose, technical details of how it works, and what it measures. The paper then distills a reasonable set of run-rules consistent with the author's intent, reports some interesting scores, and explores how Dhrystone is being used - and misused - by many in the industry. By Alan R. Weiss.