[maemo-developers] [maemo-developers] Optimized memory copying functions for Nokia 770
From: Siarhei Siamashka siarhei.siamashka at gmail.comDate: Tue Mar 14 17:04:14 EET 2006
- Previous message: [maemo-developers] Re: Measuring power consumption of 770
- Next message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jack Jansen wrote: > This looks very promising, especially if it could be used as a drop-in > replacement! At least improved memset can be already used as a drop-in replacement, only a patch for glibc is neded. So we need to have a look at glibc sources and find a place to integrate it. But in order for the patch to be accepted upstream, it needs to be very clean, that means the patch should not break big endian machines and also machines using ARM cpu older than v4. That's more work than it seems at first. Also it is very critical to know if this patch improves performance for all ARM devices, or it only is helpful on Nokia 770. Depending on that, submitting a patch to glibc might be in fact useless and keeping it only as a local maemo patch would make sense. That's why I'm still waiting for benchmark results, I know there are some people from familiar linux reading this mailing list, maybe they could test this code on other devices. By the way, it seems to be important to compile programs for maemo with '-march=armv5te' optimization option or something similar. Older ARM cpu (older than v4) did not have 16-bit memory access instructions, so the compiler generates code with two sequential byte access instructions by default in such cases. Also just improving glibc might not give the best results. Imagine a code for 16bpp bitmaps blitting. It contains a tight loop of copying pixels one line at a time. If we need to get the best performance possible, especially for small bitmaps with only a few horizontal pixels, extra overhead caused by a memcpy function call and also extra check for alignment (which is known to be 16-bit in this case) might make a noticeable difference. So directly inlining code from that 'memcpy16' macro will be better in this case. > Have you by any chance checked whether malloc() returns aligned memory, > or could be made to do so for larger blocks? Malloc should return memory aligned at least to the largest data type used on the platform. So it is at least 32-bit aligned for sure, maybe even 64-bit. And proper alignment is critical for ARM, improperly aligned memory access operations produce 'unexpected' results (not that they are unexpected, but they are different from what is observed on x86). Improper alignment is one of the reasons why applications can work fine on x86 in SDK, but fail on real device. So malloc surely allocates aligned blocks of memory. Nevertheless, 16-byte alignment seems to have some importance too. So even the copying blocks of memory returned by malloc and aligned at 4 bytes might have different performance. That can be investigated. I just tried to find best/worst case alignment for testing these new functions and that numbers (10-40% improvement) reflect what I have seen so far.
- Previous message: [maemo-developers] Re: Measuring power consumption of 770
- Next message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]