[maemo-developers] N800 & Video playback

From: Siarhei Siamashka siarhei.siamashka at gmail.com
Date: Tue May 1 11:51:50 EEST 2007
On Monday 30 April 2007 17:49, Daniel Stone wrote:

> > ARMv6 optimized YV12->YUV420 convertor is about 2.5x faster
> > than current code used in N800 xserver. So it should provide a nice
> > improvement for video :)
> Indeed.  Unfortunately this is slightly misleading in that it only shows
> the raw write speed.  RFBI can't deal with the sorts of speeds that your
> hyper-optimised version is pumping out, e.g.  So it's mainly just about
> cutting the latency into the critical path to low enough that it makes
> no difference.

The 'framebuffer' is just the ordinary system memory, converting color format 
and copying data to framebuffer will be done with the same performance as 
simulated in this test. RFBI performance is only critical for asynchronous
DMA data transfer to LCD controller which does not introduce any overhead 
and is performed at the same time as ARM core is doing some other work
(decoding the next frame). RFBI performance matters only if data transfer to
LCD is still not complete at the time when the next frame is already decoded
and is ready to be displayed. When playing video, ARM core and LCD controller
are almost always working at the same time performing different tasks in
parallel. I think I had already explained these details in [1]

Well, as xomap server is probably compiled for thumb, tried to compile this
test program for thumb instructions set as well and got the following results
(thumb is slower than normal ARM), also fixed some bug in test program
which resulted in memory throughoutput statistics being slightly off, so
the following results should be final now:

# gcc -o test_colorconv -O2 -mthumb test_colorconv.c arm_colorconv.S

# ./test_colorconv
test: 'yv12_to_yuv420_xomap',
time=9.493s, speed=25.394MP/s, memwritespeed=38.091MB/s
test: 'yv12_to_yuv420_xomap_nobranch',
time=8.516s, speed=28.306MP/s, memwritespeed=42.460MB/s
test: 'yv12_to_yuv420_line_arm_',
time=4.736s, speed=50.895MP/s, memwritespeed=76.343MB/s
test: 'yv12_to_yuv420_line_armv5_',
time=3.395s, speed=71.011MP/s, memwritespeed=106.517MB/s
test: 'yv12_to_yuv420_line_armv6_',
time=2.876s, speed=83.817MP/s, memwritespeed=125.726MB/s

If you remember the information posted in [2], mplayer used 12 seconds 
for video output when playing Nokia_N800.avi  (it contains the same number 
of frames of the same size as used in this test for benchmarking). Color
format conversion code taken from xserver and compiled for thumb uses
9.5 seconds for doing the same amount of work.

So now the results of the tests are consistent - when doing video output, most
of ARM core cycles are spent in this 'omapCopyPlanarDataYUV420' function.
Optimizing it using 'yv12_to_yuv420_line_armv6' will definitely provide a huge
effect, video output overhead when using Xv will be at least halved providing
more cpu resources for video decoding.

> > That's fine. Now I'm waiting for further instructions :) Should I try to
> > prepare a complete patch for xserver? I'm really interested in getting
> > this optimization into xserver as it would help to play high resolution
> > videos. If you have any extra questions about the code or anything
> > else (for example I wonder what free license would be appriopriate
> > for it), don't hesitate to contact me.
> If you wanted to prepare a complete patch for the server, that would be
> great, as I don't have time to get to it right now (trying to finish off
> the merge with upstream, among others).  As for the license, just the
> standard MIT boilerplate in hw/kdrive/omap/* is fine, but replace Nokia
> Corporation/Daniel Stone with Siarhei Siamaskha, obviously.
> > I did not try to build xserver sources yet as I did not have enough time
> > for that and xserver requires quite a number of build dependencies. Can
> > you  share some tips and tricks about maemo xserver development. Is it
> > difficult to compile (do I need any extra build scripts, tools, or
> > configuration options) and install on N800 (is it safe to upgrade
> > xserver on N800 from .deb file)?
> It's completely safe to upgrade from a deb if it's not broken.  If you
> set up a standard Maemo build environment and run apt-get source
> xorg-server and apt-get build-dep xorg-server, it should work just fine,
> in theory.
> I don't have any tips, per se.  Once I get it all integrated it'll be in
> git, but for now, the only public source is the packages.

OK, thanks. It may take some time though. I'm still using old scratchbox
with mistral SDK here (did not have enough free time to upgrade yet). Until I
clean up my scratchbox mess, I can only provide some patch without testing, if
anybody courageous can try to build it :)

> > I also tried to use YUV420 on Nokia 770, but it did not work well.
> > According to Epson, this format should be supported by hardware. Also
> > there is a constant OMAPFB_COLOR_YUV420 defined in omapfb.h in Nokia 770
> > kernel sources. But actually using YUV420 was not very successful. Full
> > screen update 800x480 in YUV420 seems to deadlock Nokia 770. Playback of
> > centered 640x480 video in YUV420 format was a bit better, at least I
> > could decipher what's on the screen. But anyway, it looked like an old
> > broken TV :) Image was not fixed but floating up and down, there were
> > mirrors, tearings, some color distortion, etc. After video playback
> > finished, the screen remained in inconsistent state with a striped
> > garbage displayed on it. Starting video playback with YUY2 output fixed
> > it. But anyway, looks like YUV420 is not supported properly in the
> > framebuffer driver from the latest OS2006 kernel. That's bad, it could
> > provide ~30% improvement in video output perfrmance for Nokia 770. Maybe
> > upgrading framebuffer driver can fix this issue (and add tearsync
> > support).
> SoSSI is relatively quick, so you won't see much of a bandwidth win from
> using YUV420 over YUV422.  Aside from that, I don't know, though.

I do know that I will get this 30% improvement for video output, considering
all the information I have and initial test results. I just need an updated
Nokia 770 kernel with a proper YUV420 support. I also hope that this kernel
(if it becomes available) will be included into one of the next  "unofficial"
hackers edition firmware updates eventually.

Anyway, after having failed to use YUV420 with direct framebuffer access on
Nokia 770, tried the same code on N800 and surprisingly it worked perfectly,
I only had to figure out some information about framebuffer layout. It is
actually quite simple. When working with the framebuffer and performing
YUV420 screen updates, framebuffer can be treated as having the same 
layout as in RGB565 mode (two bytes for each pixel). Any rectangular area 
within this 16bpp framebuffer can be updated in YUV420 mode. Each line 
of pixels from this rectangular area can be filled with YUV420 data. Surely,
this YUV420 data will be shorter than the length of the line (end of the line
will be unused), but screen update ioctl works fine. It works in a similar way
as pixel doubling where a rectangular block of pixel is expanded twice and
covers much more area on the screen than in framebuffer.

Well, anyway, everything worked perfectly and I could play 640x480 video 
on N800 with the following statistics:

VIDEO:  [DIVX]  640x480  12bpp  23.976 fps  886.7 kbps (108.2 kbyte/s)
BENCHMARKs: VC:  87,757s VO:   8,712s A:   1,314s Sys:   3,835s =  101,618s
BENCHMARK%: VC: 86,3592% VO:  8,5736% A:  1,2932% Sys:  3,7740% = 100,0000%
BENCHMARKn: disp: 2044 (20,11 fps)  drop: 355 (14%)  total: 2399 (23,61 fps)

As you see, mplayer took 8.712 seconds to display 2044 VGA resolution frames. 
If we do the necessary calculations, that's 72 millions pixels per second,
quite close to 'yv12_to_yuv420_line_armv6' capabilities limit, so this
function is the only major contributor to video output time. Video output
took much less time than decoding, so it proves that video output 
overhead can be reduced to minimum (in this test tearsync was not used

The same file played with Xv video output and also tearsync disabled 
(XV_OMAP_VSYNC explicitly set to 0):

BENCHMARKs: VC:  77,176s VO:  19,550s A:   1,880s Sys:   3,851s =  102,457s
BENCHMARK%: VC: 75,3260% VO: 19,0809% A:  1,8346% Sys:  3,7586% = 100,0000%
BENCHMARKn: disp: 1637 (15,98 fps)  drop: 762 (31%)  total: 2399 (23,41 fps)

Performing the calculations 1637 * 640 * 480 / 19.550s we get 26 millions
pixels per second which is also more or less consistent
with 'yv12_to_yuv420_xomap' benchmark statistics.

When tearsync comes into action, everything gets a bit more complicated. I'm
still investigating its impact on video playback performance.

Well, I'm going to continue working on YUV420 direct framebuffer video output
for N800 for the next build of mplayer as this code could be also used on
Nokia 770 if it gets YUV420 support. Also while this method of video output
does not support hardware scaling, it seems to be quite good for unscaled VGA
resolution videos and may serve as a temporary solution until we get upgrade
to a new xserver with yv12->yuv420 conversion optimizations. 

1. http://maemo.org/pipermail/maemo-developers/2007-March/009202.html
2. http://maemo.org/pipermail/maemo-developers/2007-April/009925.html

More information about the maemo-developers mailing list