[maemo-developers] [maemo-developers] Improving Cairo performance on the N800

Wed Jan 17 10:38:34 EET 2007

On Tuesday 16 January 2007 12:08, Zeeshan Ali wrote:

> > Now, the recently announced Nokia N800 is different from the 770 in
> > various ways that are interesting for Cairo performance. I've got my
> > eye on the ARMv6 SIMD instructions and the PowerVR MBX accelerator.
>
>    Yeah! me too. The combined power of these two can make it possible
> to optimize a lot of nice free software out there for the N800 device.
>  However! while former is fully documented and the documentation is
> available for general public, it doesn't have a lot to offer. ARMv6
> SIMD only operate on 32-bit words and hence i find it unlikely that it
> can be used to optimize double fp emulation in contrast to the intel
> wirelesss MMX, which provides a big bunch of 128-bit (CORRECTME: or
> was it 64- bit?) SIMD instructions. OTOH, these few SIMD instructions
> can still be used to optimize a lot of code but would it be a good
> idea for cairo if you need to convert the operand values to ints and
> the result(s) back to float?

Well, OMAP2420 seems to support floating point in hardware, so all this stuff
is probably not needed anymore :)

>   I have already been thinking on utilizing ARMv6 before the N800 was
> release to public. My proposed plan of attack for the community (and
> also the Nokia employees) is simply the following:
>
> 1. Patch GCC to provide ARMv6 intrinsics. (1 MM at most)
> 2. Patch liboil [1] to utilize these intrinsics when compiled for
> ARMv6 target (1-3 MM)
> 3. Make all the software utilize liboil wherever appropriate or ARMv6
> intrinsics directly if needed.
>
>    The 3rd step would ensure that you are optimizing your software for
> all the platforms for which liboil provides optimizations. OTOH! one
> can skip step#1 and write liboil implementations in assembly.
>
>    I already did a little progress on this and the result is two
> header files which provides inline functions abstracting the assembly
> instructions. I am attaching the headers. One of my friend was
> supposed to convert them to gcc intrinsics and patch gcc but i never
> got around to finish them. However I am attaching the headers so
> anyone can use it as a starter if he/she likes.

According to my tests, performance improvement from using such header 
files is minimal. They are easy to use, but the improvement is generally not
very good.

When I benchmarked idct performance, I also tested C implementaion with some
macros for fast armv5te 16-bit multiplication out of curiasity. Performance
improvement was only about 5%. While at the same time, handcrafted code
improves performance by as much as 50% (and still has potential for more
optimizations):
http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2006-September/045837.html

The very similar minimal effect is obtained from using such macros in ffmpeg
mp3 decoder.

The explanation is simple. Compiler is not able to shedule instructions 
as good as human especially if it has some 'alien' parts of code inserted 
in the flow of its instructions via inline asm. For example, this multiply
instruction takes 1 cycle to execute, but the result has 1 extra cycle latency
(for ARM9, it is even higher for ARM11 and is equal to 2 cycles) and you can't
use it immediately in the next instruction. As gcc does not know about the
sheduling of such instructions when using just macros, it may try to use
the result immediately and suffer form 1 or more cycles penalty because of
pipeline interlock.

So if really good performance is required, nothing can beat handcrafted
assembly yet. Of course it makes sense to profile code and optimize only 
time critical relatively small leaf functions.

By the way, free software is really poorly optimized for ARM right now. For
example, SDL is not optimized for ARM, xserver is probably not optimized 
as well, a lot of performance critical parts of code in various software are
still only implemented in C for ARM while they have x86 assembly 
optimizations long ago. Considering that Internet Tablets might have a tight
competition  with x86 UMPC devices in the near future, ARM poweded devices 
are at some disadvantage now. Is this something that we should try to
change? :-)