[maemo-developers] [maemo-developers] Optimized memory copying functions for Nokia 770

From: Siarhei Siamashka siarhei.siamashka at gmail.com
Date: Tue Mar 14 17:04:14 EET 2006
Jack Jansen wrote:

> This looks very promising, especially if it could be used as a drop-in 
> replacement!

At least improved memset can be already used as a drop-in replacement,
only a patch for glibc is neded. So we need to have a look at glibc
sources and find a place to integrate it.

But in order for the patch to be accepted upstream, it needs to be very
clean, that means the patch should not break big endian machines and
also machines using ARM cpu older than v4. That's more work than it
seems at first.

Also it is very critical to know if this patch improves performance for
all ARM devices, or it only is helpful on Nokia 770. Depending on that,
submitting a patch to glibc might be in fact useless and keeping it only
as a local maemo patch would make sense. That's why I'm still waiting
for benchmark results, I know there are some people from familiar linux
reading this mailing list, maybe they could test this code on other
devices.

By the way, it seems to be important to compile programs for maemo with
'-march=armv5te' optimization option or something similar. Older ARM cpu
(older than v4) did not have 16-bit memory access instructions, so the
compiler generates code with two sequential byte access instructions by
default in such cases.

Also just improving glibc might not give the best results. Imagine a
code for 16bpp bitmaps blitting. It contains a tight loop of copying
pixels one line at a time. If we need to get the best performance
possible, especially for small bitmaps with only a few horizontal
pixels, extra overhead caused by a memcpy function call and also extra
check for alignment (which is known to be 16-bit in this case) might
make a noticeable difference. So directly inlining code from that
'memcpy16' macro will be better in this case.

> Have you by any chance checked whether malloc() returns aligned memory, 
> or could be made to do so for larger blocks?

Malloc should return memory aligned at least to the largest data type
used on the platform. So it is at least 32-bit aligned for sure, maybe
even 64-bit. And proper alignment is critical for ARM, improperly
aligned memory access operations produce 'unexpected' results (not that
they are unexpected, but they are different from what is observed on
x86). Improper alignment is one of the reasons why applications can work
fine on x86 in SDK, but fail on real device. So malloc surely allocates
aligned blocks of memory.

Nevertheless, 16-byte alignment seems to have some importance too. So
even the copying blocks of memory returned by malloc and aligned at 4
bytes might have different performance. That can be investigated. I just
tried to find best/worst case alignment for testing these new functions
and that numbers (10-40% improvement) reflect what I have seen so far.



More information about the maemo-developers mailing list