[maemo-developers] [maemo-developers] Optimized memory copying functions for Nokia 770

Tue Mar 14 09:55:26 EET 2006

Hello All,

Here are the optimized memory copying functions for Nokia 770 (memset is
more than twice faster, memcpy improves about 10-40% depending on
relative data blocks alignment).

http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz

These functions were created as an attempt to experiment with getting
maximum memory bandwith on Nokia 770 (powered by TI OMAP1710) and also
learning ARM assembler in process. Getting maximum memory bandwidth
utilization is needed for 2D games and probably other applications which
need to process a lot of multimedia data. I'm particularly interested in
getting the best performance for Allegro game programming library
(http://alleg.sourceforge.net) on Nokia 770 and that was the motivation
for writing this code.

After a few experiments with reading/writing memory using different data
size for each memory access operation, appears that writing in a bigger
chunks is much more important for reading, that means writing 16-bits
per memory access is usually twice faster than writing using 8-bit,
32-bit memory access is also twice faster than 16-bit access. There is
no such significant performance degradation for reading with smaller
chunks, so optimizing reading seems to be less important. After trying
some orher half empirical experiments with writing to memory even more
seems like the most efficient memory bandwidth is achieved by using
16-byte burst writes aligned on 16-byte boundary using STM instruction.
And this seems to provide at least twice better memory bandwidth
utilization than the standard 'memset' function on Nokia 770. Having
such fantastic results, I decided to try making some optimized functions
that can serve as a replacement for standard memset/memcpy functions.
Aligned 16-byte write with STM instruction is a core part of all these
functions, all the rest of code deals with leading/trailing unaligned
data chunks.

It implements the following functions (see more detailed comments in the
code):
memset8, memset16, memset32 - replacements for memset, optimized
                               for different alignment
memcpy16, memset32          - replacements for memcpy, optimized
                               for different alignment

Testing framework is included, which allows to ensure that this code
provides valid results and is also really fast. In order to run the
tests, this file should be compiled as c-source with
FASTMEM_ARM_TEST_FRAMEWORK macro defined.

Requirements for running this code: little endian ARM v4 compatible cpu

Results from my Nokia 770 are the following:

    --- running correctness tests ---
    all the correctness tests passed
    --- running performance tests (memory bandwidth benchmark) ---:
    memset() memory bandwidth: 121.22MB/s
    memset8() memory bandwidth: 275.94MB/s
    memcpy() memory bandwidth (perfectly aligned): 104.86MB/s
    memcpy16() memory bandwidth (perfectly aligned): 113.98MB/s
    memcpy() memory bandwidth (16-bit aligned): 70.37MB/s
    memcpy16() memory bandwidth (16-bit aligned): 101.31MB/s
    --- testing performance for random blocks (size 0-15 bytes) ---
    memset time: 0.410
    memset8 time: 0.260
    --- testing performance for random blocks (size 0-511 bytes) ---
    memset time: 2.360
    memset8 time: 1.140

TODO:
    1. implement memcpy8 function (direct replacement for memcpy)
    2. provide big endian support (currently the code is little endian)
    3. investigate possibilities for getting the best performance
       on short buffer sizes
    4. better testing in real world and on different ARM based devices

I'm especially interested in getting feedback from running this code on
different devices. It is quite possible that these functions are only
optimal for OMAP1710, but do not provide any benefit on other devices.

Currently this code improves Allegro game programming library
performance quite a lot (in my not yet finished patch), but it might be
also used for SDL. It is interesting if using these functions can
improve GTK performance as well. In that case we could have a nice user
interface responsivety improvement.

As soon as a complete replacement for memcpy (memcpy8) is done, it can
be probably also used as a patch for glibc to improve performance of all
the programs automagically.

Waiting for feedback, suggestions and test results on other ARM devices
(not only Nokia 770).