[maemo-developers] [maemo-developers] Cairo performance comparison, 770 / N800 / PXA-320

Sun Jan 14 00:11:37 EET 2007

On Saturday 13 January 2007 21:00, Kalle Vahlman wrote:

> We have all sorts of funny hardware at the office, so I thought I'd
> make a quick run of cairo-perf with the Cairo 1.3.10 snapshot and see
> how they relate to each other.
>
> There's some funny things I encountered in the results, and I hope
> people on both lists can offer insights on why.
>
> Details at
>
>   http://syslog.movial.fi
>
> but let's just say that the results were predictable in general, with
> some surprises:
>
> N800 is naturally faster than 770, but I didn't expect the xlib
> backend to have so big differences between the two.

Maybe these devices were just running different linux kernels (task 
sheduler may be different) and xservers? So quite a lot of code could 
be different and these results can't be used to compare these cpus 
directly.

> For the cairo audience there's the question of the tessellation
> process, can it really be so fast on the PXA-320 or is there a bug
> somewhere that twists the results? What could be so good in PXA-320
> (or not-good on the other devices) that the results are so drastic?

What is the amount of cache on all these devices? If PXA-320 has 
more cache and all the necessary code/data for this test fit it but not on 
the competing device, that could explain the difference.

By the way, here you can take some code for benchmarking cpu clock frequency: 
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/tests/testfreq.c?root=mplayer&view=markup
It performs two test runs, the first run contains a loop with 10 add
instructions, the second run just contains the same loop but empty.
Substracting time of the second run from the time of the first run we get 
the time of executing these add instructions only. Number of such 
instructions executed per second can be used to measure cpu clock 
frequency. For getting best precision you may want to increase 
TESTS_COUNT define, it will result in a longer test time though.
This test program can show results a bit lower than the actual clock 
frequency (as we have a multitasking OS and other processes also 
take some time). But real cpu clock frequency can't be lower than the 
result benchmarked :) Even for superscalar cpus, these add 
instructions can't be run in parallel as each new instruction depends
on the result of the previous one (hmm, just thought that the last add 
instruction in a loop can be run in parallel with subs which decreases 
loop counter, maybe some additional tweak will be required).

Also Nokia 770 runs not at 220MHz as stated on your page, but at 
something closer to 250MHz as shown by this test code program 
(and confirmed to be actually 252MHz by somebody from Nokia 
on #maemo about half a year ago).

As for optimizing code for ARM (targeting Nokia 770), there are a few things
that are slow (maybe this list is still incomplete):
1. Floating point math is slow without vfp (cairo contains a lot of fp math)
2. Integer division is slow ('/' and '% operators) as ARM does not have
hardware instruction for it and much less efficient software implementation is
used.
3. write access to noncached memory is slow for read-allocate cache on arm926
core (data is not loaded into cache on write), see more details here:
http://maemo.org/pipermail/maemo-developers/2006-December/006579.html
I have some crude patch for valgrind (callgrind part) to simulate
read-allocate cache behaviour (instead of write-allocate as is simulated 
by default), it can show parts of code which have lots of cache misses. If
anybody is interested, I can try to clean it up and submit upstream:
http://ufo2000.xcomufo.com/maemo/vg-read-allocate-cache-patch.diff

I also had a quick look at cairo sources (without benchmarking it, just to
see general coding style). Some parts of code in it are not optimal. For
example this code chunk from cairo-path-stroke.c relies on integer division
(it is unlikely to cause severe performance decrease here, but may become 
a real problem for tight loops):
[cut]
	for (i=start; i != stop; i = (i+1) % pen->num_vertices) {
	    tri[2] = f->point;
	    _translate_point (&tri[2], &pen->vertices[i].point);
	    _cairo_traps_tessellate_triangle (stroker->traps, tri);
	    tri[1] = tri[2];
	}
[/cut]
If we go deeper into _cairo_traps_tessellate_triangle, we will notice the
following:
[cut]
    memcpy (tsort, t, 3 * sizeof (cairo_point_t));
    qsort (tsort, 3, sizeof (cairo_point_t), _compare_point_fixed_by_y);
[/cut]
There is unnecessary memcpy operation, also qsort is called for just three
elements! And such performance bottlenecks are quite easy to spot almost
everywhere. Most likely the code that is performance critical, is optimized a
lot better, but anyway at least this part deserved a comment such as 
/* I know that it is slow, but this code is not performance critical and I'm
too lazy to optimize it */ :-) 

Anyway, now I see no surprise that such huge improvements were possible
recently :-)

Also this code does not have any assembly optimizations, but almost every self
respecting multimedia library has them for low level stuff such as blitting,
filling, blending, etc. unless this stuff is hardware accelerated. After all
the obvious bottlenecks in C code get fixed (which is the first priority as I
see), assembly optimizations can also provide some nice performance 
boost.