FPMark™ FAQs
How do I calculate the Marks 
The official ‘FPMark’ is calculated by taking the geomean of the 6 DL, DM, DS, SS, SM, SL scores and multiplying the result by 100 for scaling. The official ‘MicroFPMark’ Score is calculated by taking the geometric mean of all singleprecision/small data workloads (i.e. SpS mark).
Other Submarks

Has EEMBC tested the FPMark on a system/architecture which supports fusedmultiply[add,sub] floatingpoint instructions? 
Q: Regarding the precision/accuracy of the expected results, has EEMBC tested the FPMark on a system/architecture which supports fusedmultiply[add,sub] floatingpoint instructions (assuming the binary was compiled with the fused multiply instructions enabled), and does it pass? The release notes list the following with regards to testing: "Quad core Linux 64b (Ubuntu) /x86", but it is not clear if this x86 system supported fused instructions. Though, the FPMarkIntroduction.pdf does mention "FMA (Intel)". A: FPMark was tested on devices with multiply and accumulate instructions and devices with multiple accumulators, both ARM and Intel. In order to allow such devices to use those instructions and still pass verification, we had to take care in the selection of our datasets and algorithms. For generation of the reference data, operations are done in order though, and multiply+accumulate is not used, nor are multiple accumulators. 
Are floatingpoint operations poorly defined because they differ from one architecture to another? 
No, instructions differ from one architecture to another, but floatingpoint operations are one of the few well defined terms in benchmarking. These are the basic computational operations as found in IEEE7542008 section 5.4 “format of generalcomputational operations” and they include addition, subtraction, multiplication, division, square root, conversions, comparisons and so on. Fused multiplyadd, such as chained multiply add, has always been treated as two operations for benchmarking purposes. Floatingpoint operations do not include loads, stores, moves, or higher level instructions such as sin or tan. If an architecture has an instruction that performs, for example, 500 multiplyadds and 1500 loads, it executes 1000 floatingpoint operations. 
How do you determine the number of floatingpoint operations that are run in an iteration of a benchmark test? 
Run the test on a simulator and count the number of floatingpoint operations architecturally (as opposed to speculatively) executed. The easiest way to do this is to count how many of each type of floatingpoint instructions executed and weight them by how many floatingpoint operations are performed by each of those instructions. 
That seems straight forward but different compilers will produce different code, some more optimal than others. How do we then determine the number of FP operations? 
Since we do not control what the compilers produce, we need to create a workload equivalent. Basically, we pick an architecture (not an implementation) and a good compiler and use that code as a baseline. We then calculate how many operations are performed in an iteration. This should be a reasonable estimate of what other good compilers would produce. If another compiler can do the same work with fewer operations, then that will show up in the score. That only makes sense as we are determine the performance of the system, which goes well beyond the floatingpoint computational unit; it includes a variety of factors including compilers, operating systems, and the memory subsystem. 
What about library calls or complex instructions, such as transcendentals; how many floatingpoint operations are those? 
This is not much different from the issue with the different compilers. However, some computers do implement such instructions as sine and cosine and it becomes difficult to count the floatingpoint operations executed internally to the CPU. The best solution may be to choose a mutually agreed upon target architecture that doesn’t include such instructions. If the computer with complex instructions can perform these operations more efficiently internally, that will be reflected in the score, as it should. 
Why don't we just scale each score by that of a reference platform? 
The problem with this approach is the reference is arbitrary and so the scores are arbitrary (much more so than an approximated workload based approach). For example, if the reference platform excels at medium workloads, but falls on its face for large or small workloads, the relative scores will not allow consumers to see how a different processor scales with different sized workloads relative to itself. Using the same example, a different design that has even performance for all three workloads would get very high scores for the large and small workloads but a middling score for the medium workloads because it is compared to an arbitrary reference with its own idiosyncrasies. However, if the scores were normalized to the workload, such a design should get fairly consistent scores for the different sized workloads. 