[maemo-developers] [maemo-developers] Optimized memory copying functions for Nokia 770
From: Siarhei Siamashka Siarhei.Siamashka at gmail.comDate: Tue Mar 14 09:55:26 EET 2006
- Previous message: [maemo-developers] Re: Virtual Keyboard and hiding "AutoComplete" content
- Next message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello All, Here are the optimized memory copying functions for Nokia 770 (memset is more than twice faster, memcpy improves about 10-40% depending on relative data blocks alignment). http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz These functions were created as an attempt to experiment with getting maximum memory bandwith on Nokia 770 (powered by TI OMAP1710) and also learning ARM assembler in process. Getting maximum memory bandwidth utilization is needed for 2D games and probably other applications which need to process a lot of multimedia data. I'm particularly interested in getting the best performance for Allegro game programming library (http://alleg.sourceforge.net) on Nokia 770 and that was the motivation for writing this code. After a few experiments with reading/writing memory using different data size for each memory access operation, appears that writing in a bigger chunks is much more important for reading, that means writing 16-bits per memory access is usually twice faster than writing using 8-bit, 32-bit memory access is also twice faster than 16-bit access. There is no such significant performance degradation for reading with smaller chunks, so optimizing reading seems to be less important. After trying some orher half empirical experiments with writing to memory even more seems like the most efficient memory bandwidth is achieved by using 16-byte burst writes aligned on 16-byte boundary using STM instruction. And this seems to provide at least twice better memory bandwidth utilization than the standard 'memset' function on Nokia 770. Having such fantastic results, I decided to try making some optimized functions that can serve as a replacement for standard memset/memcpy functions. Aligned 16-byte write with STM instruction is a core part of all these functions, all the rest of code deals with leading/trailing unaligned data chunks. It implements the following functions (see more detailed comments in the code): memset8, memset16, memset32 - replacements for memset, optimized for different alignment memcpy16, memset32 - replacements for memcpy, optimized for different alignment Testing framework is included, which allows to ensure that this code provides valid results and is also really fast. In order to run the tests, this file should be compiled as c-source with FASTMEM_ARM_TEST_FRAMEWORK macro defined. Requirements for running this code: little endian ARM v4 compatible cpu Results from my Nokia 770 are the following: --- running correctness tests --- all the correctness tests passed --- running performance tests (memory bandwidth benchmark) ---: memset() memory bandwidth: 121.22MB/s memset8() memory bandwidth: 275.94MB/s memcpy() memory bandwidth (perfectly aligned): 104.86MB/s memcpy16() memory bandwidth (perfectly aligned): 113.98MB/s memcpy() memory bandwidth (16-bit aligned): 70.37MB/s memcpy16() memory bandwidth (16-bit aligned): 101.31MB/s --- testing performance for random blocks (size 0-15 bytes) --- memset time: 0.410 memset8 time: 0.260 --- testing performance for random blocks (size 0-511 bytes) --- memset time: 2.360 memset8 time: 1.140 TODO: 1. implement memcpy8 function (direct replacement for memcpy) 2. provide big endian support (currently the code is little endian) 3. investigate possibilities for getting the best performance on short buffer sizes 4. better testing in real world and on different ARM based devices I'm especially interested in getting feedback from running this code on different devices. It is quite possible that these functions are only optimal for OMAP1710, but do not provide any benefit on other devices. Currently this code improves Allegro game programming library performance quite a lot (in my not yet finished patch), but it might be also used for SDL. It is interesting if using these functions can improve GTK performance as well. In that case we could have a nice user interface responsivety improvement. As soon as a complete replacement for memcpy (memcpy8) is done, it can be probably also used as a patch for glibc to improve performance of all the programs automagically. Waiting for feedback, suggestions and test results on other ARM devices (not only Nokia 770).
- Previous message: [maemo-developers] Re: Virtual Keyboard and hiding "AutoComplete" content
- Next message: [maemo-developers] Optimized memory copying functions for Nokia 770
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]