[maemo-developers] [maemo-developers] Optimized memory copying functions for Nokia 770
From: Siarhei Siamashka Siarhei.Siamashka at gmail.comDate: Wed Mar 15 00:38:14 EET 2006
- Previous message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Next message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Siarhei Siamashka wrote: > By the way, I tried to search for asm optimized versions of memcpy > for ARM platforms. Did not do that before as my mistake was that I > assumed glibc memcpy/memset implementations to be already optimized > as much as posible. > > Appears that there is fast memcpy implementation in uclibc and there > are also much more other implementations around. Seems like I tried > to reinvent the wheel. Too bad if it appears that spending the whole > 2 days on weekend was a useless waste of time :( Well, at least I did > not try to steal someone's else code and 'copyright' it. > > As I told before, my observations show that it is better to align > writes on 16-byte boundaries at least on Nokia 770. The code I have > posted is a proof of concept code and it shows that it is faster than > default memset/memcpy on the device. I'm going to compare my code > with uclibc implementation, if uclibc is in fact faster or has the > same performance, I'll have to apologize for causing this mess and go > away ashamed. Added uclibc benchmark to the test program: --- running correctness tests --- all the correctness tests passed --- running performance tests (memory bandwidth benchmark) ---: memset() memory bandwidth: 122.64MB/s memset_uclibc() memory bandwidth: 121.93MB/s memset8() memory bandwidth: 279.62MB/s memcpy() memory bandwidth (perfectly aligned): 102.30MB/s memcpy_uclibc() memory bandwidth (perfectly aligned): 110.96MB/s memcpy16() memory bandwidth (perfectly aligned): 110.96MB/s memcpy() memory bandwidth (16-bit aligned): 69.44MB/s memcpy_uclibc() memory bandwidth (16-bit aligned): 49.58MB/s memcpy16() memory bandwidth (16-bit aligned): 99.86MB/s --- testing performance for random blocks (size 0-15 bytes) --- memset time: 0.410 memset8 time: 0.270 --- testing performance for random blocks (size 0-511 bytes) --- memset time: 2.360 memset8 time: 1.140 So while uclibc also uses STM instruction for copying large chunk of memory at once, it does not use 16-byte alignment and performs quite poorly on not very aligned data. It was good that I did not search for other memcpy implementations first, but tried to make a new one. Beginners luck probably :) Without looking at other implementations, I just tried different instructions (including STRD instruction from the new DSP instruction set), order of instructions and data block sizes in memset32 function and almost accidently stumbled upon the combination which seems to work better. That's not really an 'invention' as there are not many things that can be variated within a dozen of instructions needed for memset function. It is strange that such 16-byte alignment trick was neither used in uclibc nor in glibc until now. One more option is that this improvement is only Nokia 770 specific and nobody else ever encountered it or had to use. Well, do we really care anyway? ;) Now I just really badly want to see the benchmark results from some other cpu, preferably intel xscale :)
- Previous message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Next message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]