[maemo-developers] [maemo-developers] Optimized memory copying functions for Nokia 770 (final part)

Tue Dec 5 09:25:24 EET 2006

Hello All,

Here is an old link with some benchmarks and initial information:
http://maemo.org/pipermail/maemo-developers/2006-March/003269.html

Now for more completeness, memcpy equivalent is also available and 
the functions exist in two flavours (either gcc inline macros, or just
assembly code), all the sources are here:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/fastmem-arm9/?root=mplayer

The easiest way to try this code is just linking 'fastmem-arm9.S' with your
code, it will override glibc 'memcpy' and 'memset' functions with this
optimized implementation. But it will probably not affect code that is
contained in other shared libararies, for example SDL will still most likely
use functions from glibc. If you decide to try using gcc inline macros, it may 
be not safe, beware of compiler bugs, more details and testcases are here:
https://maemo.org/bugzilla/show_bug.cgi?id=733

Anyway, this code may be useful for various games, emulators or any 
software that may need to clear/initialize or copy large memory blocks 
fast. So those who are interested, may scavenge something useful there :)

At least adding a variation of this this code to allegro game programming
library for bitmaps blitting/clearing functions allowed to improve framerate
in ufo2000 quite a lot. Sure, that's because of nonoptimal full screen update
method which is not very fast and battery friendly anyway and should be
changed to screen updates only for the parts of screen that were changed. 
But sometimes you may have to update full screen anyway, for example 
when you have it filled with fire and smoke animation. So having fast
bitmaps blitting code and being able to just update full screen and have no 
problems with performance may be a good thing.

Technical explanation (at least my understanding of it) is the following.
Nokia 770 cpu has some small amount of write back cache, but it is not 
write allocate. That means if some memory block is already cached, write
operation is fast and data is stored immediately to cache. But if some 
memory block is not cached, it can get to cpu data cache only after read
operation, but not write (read allocate cache behaviour). If destination
buffer in not in cache, write to it will be performed directly to memory using 
write buffer. Transfers to memory are performed using blocks of 4, 16, or 32 
bytes and these blocks should be aligned. See '5.7 TCM write buffer'
and '6.2.2 Transfer size' from http://www.arm.com/pdfs/DDI0198D_926_TRM.pdf
So if you write to memory one byte at once, memory bandwidth is wasted (you 
get only one byte written per memory bus transfer operation, while you could 
easily get 4 bytes written instead). Here is the worst possible memcpy
implementation for example, if you benchmark it, you will get some
interesting numbers:

void memcpy_trivial(uint8_t *dst, uint8_t *src, int count)
{
    while (--count >= 0) *dst++ = *src++;
}

But the best performance is achieved when using 16 bytes transfers (aligned 
at 16 bytes boundary, otherwise it will be just split into some 4 byte
transfers). This can't be coded in C, and the use of assembly STM instruction
with 4 registers as operands is needed (or any number of registers that is
multiple of 4).