[maemo-users] Memory corruption during WLAN use: detailled analysis and workaround

Wed Sep 12 01:06:04 EEST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi!

This is going to be a bit longer, but may be interesting to many Nokia
770 users as I suspect that this problem is present on all 770s:

A few weeks ago I had a very bad spontaneous crash of my 770 making it
unbootable. The progress bar never showed up. I could reflash, but got
suspicious about what might have caused it. I started searching for a
memory checker and found:

http://pyropus.ca/software/memtester/

It compiles fine under scratchbox, here are my binaries:

http://freenet-homepage.de/tvogel/memtester-4.0.7-bin.tar.bz2

As I tested my device by running "./memtester 24" as root, I observed,
that I got memory corruption when there was WLAN activity: while
scanning for networks as well as when data is actually transmitted.

The problem does not always show up as it depends on what memory blocks
memtester is assigned by the kernel. Also the adresses where the errors
occur vary, but as user space processes live in virtual memory space,
addresses do not have a fixed mapping to physical memory anyway.

After that, I wanted to find out where the problems actually come from
and how much memory is affected. I found the Running Unix Memory Tester
(rumt-0.2) from:

http://www.normalesup.org/~george/comp/rumt/

It does compile under scratchbox but needs an additional patch in order
to work correctly on the 770. Find the patch here:

http://freenet-homepage.de/tvogel/rumt-n770.patch

and my binaries here:

http://freenet-homepage.de/tvogel/rumt-bin.tar.bz2

Before I describe how to reproduce, here are my results:

Depending on the memory location to which the modules umac.ko and
cx3110x.ko get loaded, exactly two consecutive bytes at fixed physical
locations in memory get overwritten by zeroes everytime there is WLAN
activity:

On a vanilla NOKIA770_2006SE_3.2006.49-2_PR_MR0, the modules get loaded
at (cat /proc/modules):
cx3110x 51420 0 - Live 0xbf03f000
umac 253316 1 cx3110x, Live 0xbf000000

In this case, the two bytes are at physical location 0x1304b8b4 and
0x1304b8b5 (these addresses include an offset of 0x10000000 - see
/proc/iomem).

When booting the same OS from an ext2 formatted MMC, then the modules are:
cx3110x 51420 0 - Live 0xbf04e000
umac 253316 1 cx3110x, Live 0xbf00f000
ext2 43524 1 - Live 0xbf003000
mbcache 7716 0 - Live 0xbf000000

I.e. due to the two extra modules, umac.ko and cx3110x.ko are shifted by
0xf000. And surprise, surprise, the corrupted bytes also get shifted by
0xf000 to 0x1305a8b4 and 0x1305a8b5.

Of course, I'd be very interested to know if this only occurs on my
device or if this is a common problem, so I'd be happy if some of you
could try to reproduce it. This procedure can be used:

- - open two root shells on your 770
- - start WLAN on the 770, flood ping "ping -f" your 770 in order to
create network traffic
- - in the first shell, start memtester starting with a size that shows
the corruption (the first argument is the size in MB)
- - successively reduce the size until you don't see corruption: This
makes it likely, that the next alloc of 1 MB will get the block with the
bad bytes
- - let memtester run and now in the second shell, start "urumt -p 256":
This will allocate 1 MB of memory, locate its physical addresses in
/dev/mem and start testing.

You'll get bit-precise location information on which bits get corrupted
into which direction: + (1->0) or - (0->1).

(I used this procedure because memtester is much faster than urumt.)

I'd be interested, if you also find this problem. If so, you can try
using my workaround:

My idea was to write a programm that tries to
allocate the bad memory block, lock it and then just sleep forever. This
would save other processes from stepping into the trap.

You can find my source code at:

http://freenet-homepage.de/tvogel/blockbad.c

or the binary at:

http://freenet-homepage.de/tvogel/blockbad

The programm takes as argument the memory page to block. If urumt
reported 123ef:8bc, strip off the leading 1 and the last 3 digits, i.e.
use 0x23ef in this example. The programm will always allocate 32MB RAM
in order to search for the block. This is currently hardcoded. After the
block is found, the other blocks are freed up again. Of course, you
should stop memtester and urumt before that.

If this works for you, you might consider starting blockbad 0x23ef at
the end of /etc/init.d/minircS.

Then you can check with "ps" if blockbad is running. If so, it found and
allocated the suspicious memory block. If not, it was out of luck and
didn't get that block assigned by the kernel.

Very interested in any feedback,

Tilman

PS. I had problems with some applications on my 770 (file manager and
bookmarks crashed) and it turned out the reason was a corrupted library
file (libhildonfm.so.1.0.0) which had erroneous zeroes at exactly the
suspicious offset 0x8b4 and 0x8b5!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFG5xFL9ZPu6Yae8lkRAqD5AJ9UF5Q4Qk5lHU76hZxX33/X3HHEbwCdHhk6
o0HGe4YcKFjhhV0CMSOUHLo=
=88Zi
-----END PGP SIGNATURE-----