The Embedded Industry's First Floating Point Benchmark Software Suite.
General FPMark Features
Both single- and double-precision workloads
Broad applicability - small, medium, and large data sets
- Small useful for low-end microcontrollers and emulation/simulation platforms
- Large useful for high-end processors
Multicore support – ability to launch multiple contexts
53 workloads test FP performance in a balanced way
- Very wide range of workloads
- Not overly dependent on specific operations
- Minimal requirement for FP library support
Comprises pre-existing benchmarks and ‘home-grown’
All Workloads include self-verification
The Need for FPMark
Floating point arithmetic appearing in many embedded applications such as audio, DSP/math, graphics, automotive, motor control
In the same way that CoreMark® was intended to be a “better Dhrystone”, FPMark provides something better than the “somewhat quirky” Whetstone and Linpack
Several FP benchmarks are already in general use (i.e. Linpack, Nbench, Livermore loops)
- Each with multiple versions
- No standardized way of running them or reporting results
FPMark Introduction (pdf)
License the EEMBC FPMark benchmark suite to evaluate and compare processor floating-point capabilities.
|How do I calculate the Marks|
|Has EEMBC tested the FPMark on a system/architecture which supports fused-multiply-[add,sub] floating-point instructions?|
Q: Regarding the precision/accuracy of the expected results, has EEMBC tested the FPMark on a system/architecture which supports fused-multiply-[add,sub] floating-point instructions (assuming the binary was compiled with the fused multiply instructions enabled), and does it pass? The release notes list the following with regards to testing: "Quad core Linux 64b (Ubuntu) /x86", but it is not clear if this x86 system supported fused instructions. Though, the FPMarkIntroduction.pdf does mention "FMA (Intel)".
A: FPMark was tested on devices with multiply and accumulate instructions and devices with multiple accumulators, both ARM and Intel. In order to allow such devices to use those instructions and still pass verification, we had to take care in the selection of our datasets and algorithms. For generation of the reference data, operations are done in order though, and multiply+accumulate is not used, nor are multiple accumulators.
|Are floating-point operations poorly defined because they differ from one architecture to another?|
No, instructions differ from one architecture to another, but floating-point operations are one of the few well defined terms in benchmarking. These are the basic computational operations as found in IEEE-754-2008 section 5.4 “format of general-computational operations” and they include addition, subtraction, multiplication, division, square root, conversions, comparisons and so on. Fused multiply-add, such as chained multiply add, has always been treated as two operations for benchmarking purposes. Floating-point operations do not include loads, stores, moves, or higher level instructions such as sin or tan. If an architecture has an instruction that performs, for example, 500 multiply-adds and 1500 loads, it executes 1000 floating-point operations.
|How do you determine the number of floating-point operations that are run in an iteration of a benchmark test?|
Run the test on a simulator and count the number of floating-point operations architecturally (as opposed to speculatively) executed. The easiest way to do this is to count how many of each type of floating-point instructions executed and weight them by how many floating-point operations are performed by each of those instructions.
|That seems straight forward but different compilers will produce different code, some more optimal than others. How do we then determine the number of FP operations?|
Since we do not control what the compilers produce, we need to create a workload equivalent. Basically, we pick an architecture (not an implementation) and a good compiler and use that code as a baseline. We then calculate how many operations are performed in an iteration. This should be a reasonable estimate of what other good compilers would produce. If another compiler can do the same work with fewer operations, then that will show up in the score. That only makes sense as we are determine the performance of the system, which goes well beyond the floating-point computational unit; it includes a variety of factors including compilers, operating systems, and the memory subsystem.
|What about library calls or complex instructions, such as transcendentals; how many floating-point operations are those?|
This is not much different from the issue with the different compilers. However, some computers do implement such instructions as sine and cosine and it becomes difficult to count the floating-point operations executed internally to the CPU. The best solution may be to choose a mutually agreed upon target architecture that doesn’t include such instructions. If the computer with complex instructions can perform these operations more efficiently internally, that will be reflected in the score, as it should.
|Why don't we just scale each score by that of a reference platform?|
The problem with this approach is the reference is arbitrary and so the scores are arbitrary (much more so than an approximated workload based approach). For example, if the reference platform excels at medium workloads, but falls on its face for large or small workloads, the relative scores will not allow consumers to see how a different processor scales with different sized workloads relative to itself. Using the same example, a different design that has even performance for all three workloads would get very high scores for the large and small workloads but a middling score for the medium workloads because it is compared to an arbitrary reference with its own idiosyncrasies. However, if the scores were normalized to the workload, such a design should get fairly consistent scores for the different sized workloads.