[maemo-developers] [maemo-developers] Optimized memory copying functions for Nokia 770
From: Tomas Frydrych tf at o-hand.comDate: Tue Mar 14 11:23:03 EET 2006
- Previous message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Next message: [maemo-developers] Optimized memory copying functions for Nokia770
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
There seems to be no source for the functions in the tarball. Tomas Siarhei Siamashka wrote: > Hello All, > > Here are the optimized memory copying functions for Nokia 770 (memset is > more than twice faster, memcpy improves about 10-40% depending on > relative data blocks alignment). > > http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz > > These functions were created as an attempt to experiment with getting > maximum memory bandwith on Nokia 770 (powered by TI OMAP1710) and also > learning ARM assembler in process. Getting maximum memory bandwidth > utilization is needed for 2D games and probably other applications which > need to process a lot of multimedia data. I'm particularly interested in > getting the best performance for Allegro game programming library > (http://alleg.sourceforge.net) on Nokia 770 and that was the motivation > for writing this code. > > After a few experiments with reading/writing memory using different data > size for each memory access operation, appears that writing in a bigger > chunks is much more important for reading, that means writing 16-bits > per memory access is usually twice faster than writing using 8-bit, > 32-bit memory access is also twice faster than 16-bit access. There is > no such significant performance degradation for reading with smaller > chunks, so optimizing reading seems to be less important. After trying > some orher half empirical experiments with writing to memory even more > seems like the most efficient memory bandwidth is achieved by using > 16-byte burst writes aligned on 16-byte boundary using STM instruction. > And this seems to provide at least twice better memory bandwidth > utilization than the standard 'memset' function on Nokia 770. Having > such fantastic results, I decided to try making some optimized functions > that can serve as a replacement for standard memset/memcpy functions. > Aligned 16-byte write with STM instruction is a core part of all these > functions, all the rest of code deals with leading/trailing unaligned > data chunks. > > It implements the following functions (see more detailed comments in the > code): > memset8, memset16, memset32 - replacements for memset, optimized > for different alignment > memcpy16, memset32 - replacements for memcpy, optimized > for different alignment > > Testing framework is included, which allows to ensure that this code > provides valid results and is also really fast. In order to run the > tests, this file should be compiled as c-source with > FASTMEM_ARM_TEST_FRAMEWORK macro defined. > > Requirements for running this code: little endian ARM v4 compatible cpu > > Results from my Nokia 770 are the following: > > --- running correctness tests --- > all the correctness tests passed > --- running performance tests (memory bandwidth benchmark) ---: > memset() memory bandwidth: 121.22MB/s > memset8() memory bandwidth: 275.94MB/s > memcpy() memory bandwidth (perfectly aligned): 104.86MB/s > memcpy16() memory bandwidth (perfectly aligned): 113.98MB/s > memcpy() memory bandwidth (16-bit aligned): 70.37MB/s > memcpy16() memory bandwidth (16-bit aligned): 101.31MB/s > --- testing performance for random blocks (size 0-15 bytes) --- > memset time: 0.410 > memset8 time: 0.260 > --- testing performance for random blocks (size 0-511 bytes) --- > memset time: 2.360 > memset8 time: 1.140 > > TODO: > 1. implement memcpy8 function (direct replacement for memcpy) > 2. provide big endian support (currently the code is little endian) > 3. investigate possibilities for getting the best performance > on short buffer sizes > 4. better testing in real world and on different ARM based devices > > I'm especially interested in getting feedback from running this code on > different devices. It is quite possible that these functions are only > optimal for OMAP1710, but do not provide any benefit on other devices. > > Currently this code improves Allegro game programming library > performance quite a lot (in my not yet finished patch), but it might be > also used for SDL. It is interesting if using these functions can > improve GTK performance as well. In that case we could have a nice user > interface responsivety improvement. > > As soon as a complete replacement for memcpy (memcpy8) is done, it can > be probably also used as a patch for glibc to improve performance of all > the programs automagically. > > Waiting for feedback, suggestions and test results on other ARM devices > (not only Nokia 770). > > > _______________________________________________ > maemo-developers mailing list > maemo-developers at maemo.org > https://maemo.org/mailman/listinfo/maemo-developers
- Previous message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Next message: [maemo-developers] Optimized memory copying functions for Nokia770
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]