The Unterminated String

Embedded Things and Software Stuff

Comparing CMSIS FIR Filter Variants

Posted at — Mar 6, 2016


Recently I’ve been playing about with a STM32F4DISCOVERY project which requires some DSP - nothing special, just a low pass filter. My DSP is pretty rusty however, and it probably could have been described that way even when I was studying it in uni. I was always able to recall the various formula and when to use them, but the fundamental understanding of just what the math was doing never materialised.

So when I found myself requiring to implement a low pass filter “in real life”, and not as a question on some test paper or coursework, I done what any sane person would do. I dusted off the text book went looking for suitable libraries I could use. After a brief search, I decided that the FIR functions offered in ARM’s CMSIS package seemed like a pretty solid bet.

As part of the CMSIS library, ARM offers several different FIR filtering functions. Some versions operate on different datatypes while others are “fast” variants; typically sacrificing some accuracy for speed. Being curious about this plethora of FIR filtering functions and how they perform, I put together a crude test to compare them. The results of this can be seen below.


All the code I used can be found in this github repository, at the tag TEST_CMSIS_FIR_1. If you are interested in reproducing the test, I have included some basic instructions on how to do this. They can be found in test/cmsis_fir_filters/

An overview of the process is as follows:

  1. The file test/cmsis_fir_filters/utils/ is responsible for creating the C files containing FIR filter coefficients and a test signal. Additionally it computes the expected filtered signal.

  2. The output coefficients and test signal are to be compiled along with the rest of test/cmsis_fir_filters/src/, flashed to the MCU and executed. The filtered output from the CMSIS arm_fir_*() functions and timing data can be retrieved with GDB.

    I was somewhat curious to see how well GCC optimised the CMSIS functions, so this step is repeated twice, once with a binary compiled with the option -O0 and again with -O2.

  3. The second Python script, test/cmsis_fir_filters/utils/ parses the output from GDB and test/cmsis_fir_filters/utils/ Using this information, it graphs the deviation between expected and calculated filtered signals as well as the CPU cycles consumed by the filtering functions.


The generated test signal to be filtered is 100 ms (1600 samples) of a 16 kHz signal. It consists of 4 individual sinusoids as well as a small amount of noise, which has been thrown in for good measure. If anyone is curious, the signal and it’s FFT can be seen below.

The 16 kHz test signal

FFT of the 16 kHz test signal

FIR Design

The FIR filter is a 128 tap (127 order) filter, intended to produce a cut-off frequency at 6 kHz for a sampling frequency of 16 kHz.

A graph of the filter’s response can be seen below.

FIR Filter Response

In the next graph, you can see the anticipated effect of filtering the test signal through this.

FFT of the Filtered Test Signal



The CMSIS FIR functions are designed to work on continuous data (as opposed to the single call of scipy.signal.lfilter() in They require an argument containing an array (“block”) of newly acquired samples to be processed. The 100 ms test sample was (somewhat arbitrarily) divided up into 25, 64-sample blocks for processing. This equates to one block being provided to the CPU every 4 ms. During this test, the STM32F407 was being operated at its maximum clock speed of 168 MHz, in which case 4 ms is the equivalent of 672000 clock cycles.

If filtering the signal was the only processing required, this value would be the hard limit on the number of cycles available to filter one block of samples. The actual limit will be significantly less than this; varying with the program’s intended application. Typically the signal will have been filtered for some purpose, so extra overhead will exist in the form of additional signal processing or communication with another device, etc.

ChibiOS’s Time Measurement functionality was used to measure the (approximate) number of cycles required by the filtering functions to process a block of data. On the STM32F4 this value is retrieved from the DWT unit’s CYCCNT register, so hopefully should be pretty accurate.

The table below shows the measured cycles per block calculation, and the equivalent percentage load this had on the CPU.

Filter Unoptimised (-O0) Cycles Unoptimised (-O0) Load (%) Optimised (-O2) Cycles Optimised (-O2) Load (%)
Q15 256424 38.2 83846 12.5
Q15 Fast 155267 23.1 14382 2.1
Q31 386913 57.6 33838 5.0
Q31 Fast 239335 35.6 81469 12.1
Float 135114 20.1 28862 4.3

Same results, but a little prettier:

CMSIS FIR Block Processing Time

Some observations on these results:


As mentioned above, the Python script not only produced the filter coefficients but also calculated the “expected” filtered signal. This was then compared against each output of the various CMSIS functions.

These measurements should be taken with a pinch of salt since floating point numbers were being passed around in string form and I couldn’t find how to tweak GDB’s floating point representation. In hindsight I probably should have considered passing about the hexadecimal variants, but then of course I would have had to worry about the conversion back to float.

To compare the accuracy of the results, I picked two simple metrics:

These two measurements can be seen in the graphs below for each filter. Note that the scale is logarithmic to ensure that Q15 results can exist on the same scale as the other functions. As you would expect, the Q15 functions suffer greatly from only having half the number of bits available to store the samples.

Max Deviation against calculated signal

Root Mean Square Deviation against calculated signal

Below is a table with the results. For all but float, the calculated output was unchanged with optimisation so the results have not been duplicated.

Filter Max Deviation (-O0) Root Mean Square Deviation (-O0)
Q15 0.000246477676267 0.000115541718721
Q15 Fast 0.000246477676267 0.000115541718721
Q31 6.42909881998e-08 1.59465738538e-08
Q31 Fast 1.16589762678e-07 6.04082295228e-08
Float 2.39170974359e-07 4.87415531383e-08
Filter Max Deviation (-O2) Root Mean Square Deviation (-O2)
Float 2.86760586443e-07 4.83186829429e-08

Some observations:


I originally attempted to design the filter in Octave, which is essentially a Matlab clone. However, it wasn’t long before the many reasons why I disliked using Matlab came flooding back. This lead me to change camps to the Python SciPy/NumPy/matplotlib alternative, which I will happily recommend to anyone doing something similar.

For my future project, I’ve settled on using the floating point function, arm_fir_f32(). With the floating point hardware on the STM32F407 enabling fast and accurate floating point calculations I think I would be silly not to. It’s noted that the Q31 functions could squeeze out an extra drop of precision at the cost of only a few cycles. In my use case however I can afford to sacrifice some precision in order to avoid using datatypes non-native to C.

This was the first time I’ve dabbled with fixed point calculations in C and will admit to finding them fairly off-putting. If you are looking for somewhere to start I found Fixed Point Arithmetic on the ARM to be short, sweet and to the point on the matter. This wasn’t the only document I consulted, but its presentation of the maths helped tie everything else I read together.


Package Version Source
ChibiOS 20cb09025dc2884dd3269ac889c4ac2a5c74c93b
CMSIS 8c0e1a91341cde86532b05625f2ad584ce856118
gcc-arm-none-eabi 4.8.3-11ubuntu1+11 utopic/universe amd64 Packages