[maemo-developers] N800 & Video playback

From: Siarhei Siamashka siarhei.siamashka at gmail.com
Date: Mon Apr 30 14:27:49 EEST 2007
On Friday 27 April 2007 04:43, Daniel Stone wrote:

> > I'll make a really optimized version of YV12 -> YUV420 convertor on this
> > weekend (removing branch is good, but I feel that it can be improved
> > more) and will try to use it on Nokia 770, any extra video performance
> > improvement will be useful there. I hope that the framebuffer driver on
> > Nokia 770 supports YUV420 color format properly.
> I don't think Tornado supports YUV420, but I can check in the specs
> tomorrow.  My better C version basically does two macroblocks at a time,
> ensuring all 32-bit writes (which _really_ helps over 16-bit writes,
> believe me).  This eliminates the branch, since your surface is
> guaranteed to be word-aligned, so if you do all 32-bit writes, you can
> just drop the branch as you know every write will be aligned.
> This will be really fast.

Optimized YV12 -> YUV420 convertor is done. The sources can be found here:

Take a look at 'arm_colorconv.h' and 'arm_colorconv.S' files. Also there is a
test program ('test_colorconv') which can ensure that everything works
correctly and fast:

~ $ ./test_colorconv
test: 'yv12_to_yuv420_xomap', 
time=7.332s, speed=32.878MP/s, memwritespeed=43.838MB/s

test: 'yv12_to_yuv420_xomap_nobranch', 
time=5.679s, speed=42.448MP/s, memwritespeed=56.597MB/s

test: 'yv12_to_yuv420_line_arm_', 
time=4.706s, speed=51.223MP/s, memwritespeed=68.297MB/s

test: 'yv12_to_yuv420_line_armv5_', 
time=3.356s, speed=71.824MP/s, memwritespeed=95.765MB/s

test: 'yv12_to_yuv420_line_armv6_', 
time=2.826s, speed=85.298MP/s, memwritespeed=113.731MB/s

ARMv6 optimized YV12->YUV420 convertor is about 2.5x faster
than current code used in N800 xserver. So it should provide a nice
improvement for video :)

I doubt that your better C version can beat it or even get any close. There
are two important optimizations in this code:
1. Cache prefetch with PLD instruction (added in '_armv5' version) which
boosts performance to 70 megapixels per second. Inner loop is unrolled
to process 32 pixels per iteration (cache line size is 32 bytes on ARM, so
such unrolling is convenient). This is the most important improvement.
You can try using __builtin_prefetch() from C code to do the same
2. The use of ARMv6 instruction REV16 to do bytes swapping for high and low
16-bit register parts, this optimization was added in '_armv6' version and
boosted performance even more to 85 megapixels per second. This 
optimization is highly unlikely probably impossible for C version at all.

I was a bit wrong about YUV420 format in my previous post.

Suppose we have planar YV12 image with the following data.
Y plane: Y1 Y2 Y3 Y4 ...
U plane: U1 __ U2 __ ...

Normal YUV420 (according to pictures in Epson docs)  would be the following:
U1 Y1 Y2 U2 Y3 Y4 ...

But appears (most likely because of 16-bit interface and some endian
differences between ARM and Epson chip) that each pair of bytes is 
swapped and we actually get the following somewhat weird layout:
Y1 U1 U2 Y2 Y4 Y3 ...

To do this byteswapping, ARMv6 instruction REV16 is very handy.

The assembly sources for ARMv6 code look a bit messy because 
instruction reordering was needed to correctly schedule them and avoid
ARM11 pipeline interlocks which negatively affect performance. Now this 
code is really fast with very little or no interlocks in the inner loop. And
gcc does not do a good job optimizing code on ARM, so C implementation
would be also at disadvantage here.

By the way, the benchmarks posted in my previous message should be 
discarded. I did not initialize source buffers that time and looks like ARM11
cpu has some 'cheat' which allows treating empty data pages in some 
special way and avoid reading from memory. So the numbers posted in the
previous benchmark were higher than usual. Now it is corrected.

As for the other possible Xv optimizations. You mentioned that fallback code
is not important at all. But imagine 640x480 video playback in windowed 
mode. Decoding it will require quite a lot of resources, but additionally
scaling it down using a slow fallback code will be a finishing blow. In
addition, a solution (fast JIT accelerated YV12->YUY2 scaler) for this 
problem already exists. I can also modify this scaler to support
YV12->YUV420 scaling. An interesting thing here is that this scaler
could be also used by xserver to solve graphics bus bandwidth 
issues. Imagine that we have some high resolution video with high 
framerate which exceeds graphics bus capabilities. In this case
this video can be downscaled in software using JIT scaler to lower 
resolution before sending data to LCD controller. What do you think?

> Sure.  Unfortunately my job has other functions than to make video
> decoding really, really fast, so I'm happy to merge, review, offer
> feedback, and help you out where I can be useful, but I can't throw much
> time at this myself.

That's fine. Now I'm waiting for further instructions :) Should I try to
prepare a complete patch for xserver? I'm really interested in getting
this optimization into xserver as it would help to play high resolution
videos. If you have any extra questions about the code or anything 
else (for example I wonder what free license would be appriopriate
for it), don't hesitate to contact me.

I did not try to build xserver sources yet as I did not have enough time 
for that and xserver requires quite a number of build dependencies. Can 
you  share some tips and tricks about maemo xserver development. Is it 
difficult to compile (do I need any extra build scripts, tools, or
configuration options) and install on N800 (is it safe to upgrade 
xserver on N800 from .deb file)?

I also tried to use YUV420 on Nokia 770, but it did not work well. According
to Epson, this format should be supported by hardware. Also there is a
constant OMAPFB_COLOR_YUV420 defined in omapfb.h in Nokia 770 kernel 
sources. But actually using YUV420 was not very successful. Full screen update
800x480 in YUV420 seems to deadlock Nokia 770. Playback of centered 
640x480 video in YUV420 format was a bit better, at least I could decipher
what's on the screen. But anyway, it looked like an old broken TV :) Image was
not fixed but floating up and down, there were mirrors, tearings, some color
distortion, etc. After video playback finished, the screen remained in
inconsistent state with a striped garbage displayed on it. Starting video
playback with YUY2 output fixed it. But anyway, looks like YUV420 is not
supported properly in the framebuffer driver from the latest OS2006 kernel. 
That's bad, it could provide ~30% improvement in video output perfrmance 
for Nokia 770. Maybe upgrading framebuffer driver can fix this issue (and add
tearsync support).

More information about the maemo-developers mailing list