On Raspberry Pi performance

Warning: somewhat technical post – look away now if you don’t know what bit twiddling means (that’s a technical term by the way)…

Because of the tight timing constraints required of a real-time audio application, code performance is highly important. It has been a while since I did any profiling of the Pithesiser, so it’s about time to experiment on it a bit. New features such as filters, configurable controllers and high-resolution MIDI support do eat into the available performance budget after all.

My theme for this round of profiling was “float vs fixed point”. As I originally plumped for fixed point math code in the Pithesiser, I thought I should do something to justify that choice rather than going on gut feel. So I made some floating point implementations of the filtering, procedural sine oscillator and wavetable oscillator components and set up some timing tests around repeating these chunks of code numerous times.

It started out interesting, then got a little unexpected – which all adds up to the fact one should alway adhere to the golden rule of performance coding – “time everything, assume nothing”.

Float is faster than fixed point…

With my filter implementation, I achieved a three times speedup with the float version. Yup, scarily big – I wasn’t expecting quite that leap. So I took a detailed look at the code generated by the compiler for the fixed point version – it was massive, way larger than the floating point alternative. Lots of register setup and manipulation on each iteration of the sample processing loop, really dragging down performance.

I did some digging around in ARM documentation, considering if I could hand-roll my own assembler version – but ultimately decided that I didn’t want to yet spend the time needed to achieve that now. Because of the relatively high precision I was using (18 bit fractional part), a lot of the fixed point math had to be done using 64-bit values to avoid overflow – and that’s what I think was behind the gnarly amount of code. The Pi’s ARM is 32 bit, so has to use multiple instructions to achieve the 64 bit math (apart from multiplication).

Float continues to be faster than fixed point…

Then I moved on to the oscillator code, setting up float versions of the procedural sine and wavetable driven oscillators. Timing tests of these also showed the float procedural version to be faster, but the wavetable version not to be so. That turned out to be the conversion of the phase parameter into an integer index into the wavetable; when I replaced that part with a fixed point approach the float-based wavetable implementation was also faster. Looking good for floats so far – a further average gain of 15% to 20%.

Another upside of this float-based code is that it looks simpler and is easier to read than the equivalent fixed point code. Double win!

Doh! Sound hardware doesn’t do floating point.

Then I hit a wall – the USB sound card only does signed 16-bit integer samples. So if I switch the Pithesiser to floating point, there will have to be a final conversion stage – and judging by the float-to-int conversion issue from the wavetable, that could be a significant hit on performance. I whipped up a simple “scale-&-cast” float-to-int audio buffer conversion and timed it – lo and behold, it was expensive. About as expensive as my fixed point filter implementation. Swings and roundabouts, frying pans and fires.

The news gets worse for floats (in a way).

Next I considered optimising the float-to-int conversion. The generated code seemed ok, there were no function calls involved (as there can be on x86) so no easy wins there – it would have to be a bit-twiddling approach. But to ensure it worked, I would need to generate a float waveform to check the end results – so I took the float oscillator code from the earlier test and made that into its own module so I could reuse it. To be on the safe side, I  ran the performance tests again with the new module… but now the float implementation ran slower than the fixed point! Up to 55% slower.

It turned out that when the float oscillator code was in the test module, it was being inlined into the test loop rather than being called. And I was running a very large number of iterations, so the cost of doing the call (including caching penalties) was murdering the speed. The fixed point oscillator code was in the original Pithesiser oscillator module, so that wasn’t being inlined. That meant the original test wasn’t really fair.

So I levelled the playing field, dialled down the number of iterations and ran the oscillator tests again. This time there wasn’t much in it – the fixed point oscillators were on the whole a little faster.

The moral of the story

There’s that golden profiling rule of “time everything, assume nothing” – and for me, this experience bears that out. To which I would add that “context is everything” – you can create an artificial benchmark to prove that approach A is faster than approach B, but reality may ultimately make a nonsense of your results.

Getting hard figures that I could use to plan my next steps really helped. As it would appear that there wasn’t much to chose between float or fixed for oscillators, it came down to whether I could accept the float-to-int conversion hit and have faster float filters, or avoid the conversion but have slower fixed point filters.

But maybe I don’t have to! With some tweaking and tuning, I got the fixed point filters to be about 10% quicker than the floating point versions by dialling down the precision to 14 bits and consequently avoiding a lot of the 64-bit math without hurting quality.

Consequently I’m going to go forward for now with fixed point, reasonably confident that there’s enough performance there to increase the workload of the synth without getting into trouble… yet.

(c) 2013 Nicholas Tuckett