If your company isn’t already grappling with the complexities of audio processing, it probably will before too long. Voice interaction has grown, in the last few years, from a specialized interface mostly associated with Alexa and Siri to something we encounter every day. In addition to the ever-expanding range of smart speakers out there, everything from laptops and smartphones to earbuds and smart TVs now take voice commands, demanding specialized processors to pick up keywords and turn them into actions. Together, these developments make audio processing one of the fastest growing fields in the microprocessor industry.
Supporting our members as they adapt to these kinds of technology and market shifts has always been the main driver of new benchmarks at EEMBC. So it should come as no surprise to learn that we’ve already started development of an end-to-end benchmark called AudioMark, that measures performance over the entire audio processing pipeline. Slated for release in mid-2023, AudioMark will focus on the most common audio-processing tasks, including voice analysis, word- and phrase-spotting, and speaker identification and amplification.
In one way, AudioMark represents a departure from our benchmarks of recent years, which have focused primarily on measuring performance on specific tasks. These task-specific benchmarks are still crucial, of course; concurrent with AudioMark, we’re also developing the next-generation CoreMark benchmark with a broad offering of modern compute workloads.
But besides becoming more common, audio processing is also an extremely variable computing task, with performance dependent on far more than just the speed and power consumption of individual steps in the pipeline. Listening for a single word or phrase (“lights on!”) is a vastly different task from understanding a vocabulary of 20 words, for example. Whether that language is processed locally or sent to the cloud adds another set of variables. So does the decision to use floating- or fixed-point, and how big of a cache to use, if one is employed. And if you’re working with multiple audio streams, for example when identifying speakers, performance can vary even further. More broadly, the right balance of accuracy and speed against hardware investment is unique to every device and application. The result is a ubiquitous processing task whose overall performance is nearly impossible to predict from its component processes.
Fortunately, the structure of the audio processing pipeline is fairly consistent, with a few key variations. Nearly all of these pipelines start with a microphone with 16-bit output at 44.1 kHz, which then undergoes spectral analysis via Fourier transform. If the direction of the signal is important (to identify who’s speaking, for example), then a beam-forming process comes next. Echo and noise cancellation is nearly requisite. Beyond that, there’s a clear split between language analysis tasks (smart speakers and other voice controlled devices) and hearing assistance, but within these two broad groups, a lot of the processing is predictable.
Taking these consistencies and variables into account, what’s clearly needed is a customizable end-to-end benchmark with a few specific options. Providing the right amount of flexibility—without creating a benchmark so variable that it becomes meaningless—means working with experts, and we’ve recruited some good ones. Intel, Arm, onsemi, Renesas, Infineon, STMicroelectronics, Synopsys, and Texas Instruments have already signed on to assist in AudioMark’s development and testing, and their help will be invaluable.
But to make AudioMark as useful as possible, we need your input too. The development of this benchmark is just getting started, and there are still dozens of small decisions to make as we progress. You can help provide the information that will lead us to the right ones. Becoming an EEMBC member brings a wide range of benefits, but one of the most valuable in the long term is the ability to inform and ultimately help shape the benchmarks that go on to shape the industry. So if audio processing is a big part of your world—or you anticipate that it’s going to be—there’s never been a better time to join.
The typical audio pipeline of today combines technologies that date back to radar and RF broadcasting of the early 20th century--such as beamforming and direction of arrival--as well as more modern filters like acoustic echo cancellation and noise suppression. Keeping in line with recent tech trends, we've added a neural net to perform keyword-spotting or wake-word classification. AudioMark will exercise different data formats, increase the instruction cache demand, and even allow integration of acceleration such as DSPs or other dedicated audio hardware. However, it will be sufficiently balanced to not allow any one technology to dominate.
The diagram below illustrates a rough outline of the benchmark's pipeline (excluding physical transducers):
The key components will consist of:
The working group is actively developing this benchmark. It is in the planning phase, where the basic architecture has been outlined, and the specific behavioral characteristics and parameterization, as well as API, are being defined.
Join the EEMBC AudioMark benchmark working group to: