[maemo-developers] N800 & Video playback

Fri Apr 27 03:14:43 EEST 2007

On Tuesday 24 April 2007 12:36, Daniel Stone wrote:

> > My main performance concern is exactly about this
> > 'omapCopyPlanarDataYUV420' function. My experience from Nokia 770 video
> > output code optimization shows that optimization effect can be really
> > huge (it was 1.5x improvement on Nokia 770 for unscaled YV12 -> YUY2
> > conversion going from a simple loop in C to optimized assembly code, I
> > provided a link to the relevant code in my previous post). But N800 code
> > can be probably improved more because now it contains unnecessary branch
> > in the inner loop and branches are expensive on long pipeline CPUs. Such
> > color format conversion performance should be comparable to that of
> > memcpy if done right (it is about half memcpy speed on Nokia 770 for
> > unscaled YV12 -> YUY2 conversion).
>
> Right, the branch is a problem, and as I said, the branch can be avoided
> and the writes optimised to be three 32-bit writes for two macroblocks,
> instead of two 32-bit writes and two 16-bit writes.

I did not have much free time to do complete tests, but initial benchmarks
show that actually even removing this branch and using three 16-bit writes
improves performance quite significantly. The test program is here:
http://ufo2000.sourceforge.net/files/yuv420test.c

It produces the following results if compiled with optimization  
options "-O3 -fomit-frame-pointer -mcpu=arm1136j-s":

# ./yuv420test
test: 'yv12toyuv420_xomap', time=5.220, memory bandwidth=61.576MB/s
test: 'yv12toyuv420_yv12toyuv420_branch_removed', time=3.503, memory 
bandwidth=91.754MB/s

An interesting thing about this test is that it uses 2504 frames 400x240 
each, that's the same number of frames as Nokia_N800.avi video has. 
And mplayer spent 12,365s on video output when playing this video while 
YV12->YUV420 conversion should have taken 5.220s as benchmarked in 
this test.  So now color conversion is roughly half of the time spent on video
output for this resolution. Some tests with higher resolution videos will be
done later.

As you see from the benchmark results, we can get 1.5x improvement 
already for color conversion with just a trivial removal of a piece of
redundant code. Was that branch in the code supposed to improve 
performance? Seems like it resulted in quite the opposite effect.

I'll make a really optimized version of YV12 -> YUV420 convertor on this
weekend (removing branch is good, but I feel that it can be improved 
more) and will try to use it on Nokia 770, any extra video performance
improvement will be useful there. I hope that the framebuffer driver on 
Nokia 770 supports YUV420 color format properly.

By the way, does anybody know if it is possible to enable tearsync support
on Nokia 770 (by backporting some changes from N800 kernel or in some 
other way)?

> However, I don't think the lessons from the 770 are necessarily
> _directly_ applicable to the N800: on the 770, our bottleneck is
> decoding speed.  The bottleneck on the N800 is exactly the opposite:
> video output.

I can't agree here. Memory speed is actually a lot faster on N800, the only
trouble is graphics bus performance, but sending data to LCD controller
through this bus does not introduce any load on ARM core and it can freely
decode the next frame of video at the same time. At least this was the case
with the previous version of firmware (I did not have enough time to see what
was changed in framebuffer API and do any video tests with it).

But color conversion is done by ARM core and it consumes precious cpu 
cycles which could be used for decoding higher resolution/bitrate video.
Optimizing color conversion will improve video performance. The 
improvement will be most likely only within a few percents overall, but 
every little bit helps.

> Bear in mind that, unless you explicitly disable it (the Xv attribute is
> something like XV_OMAP_VSYNC), the X server _will_ flush all pending
> writes before the next frame is put through.  Else you get tearing,
> because you can be halfway through an update, and writing the next frame
> to the framebuffer, so which frame is being picked up, changes halfway
> through.
>
> Try forcing XV_OMAP_VSYNC (or whatever it is) to 0, and comparing the
> results.

OK, thanks, I'll try this test too and check if it affects Xv performance.
But I thought that using 12bpp color format _and_ sending only as much 
data as needed should solve the problem. Of course 800x480 * 16bpp * 30fps
would be 23MB/s and it is too much. But for example 640x480 * 12bpp * 30fps =
12.3MB/s. Is the graphics bus fast enough to handle this?

Or is there some other problem I'm not aware of?

> > N800 is almost able to play VGA resolution videos properly, it only needs
> > a bit more optimizations. Color format conversion performance for video
> > output is one of the important things that can be improved.
>
> I don't believe it's on the critical path.  The optimisation I mentioned
> before will bring us up to the point where any improvement that we can
> make in that conversion will be eclipsed by the time taken to send it
> over the bus, I believe.  But I can't prove that.

Well, I believe that every optimization which can provide a visible
improvement (at least a few percents) is worth it. Optimizations are
cumulative, a number of small 1-3% improvements added together result 
in a significant performance boost.

The opposite is also true, if one is lazy and adds inefficient code here and
there, these small performance regressions accumulate and the program 
starts to crawl :)

Of course everything depends on the task that is being solved, sometimes
optimizations do not make much sense and are too expensive. For example,
waiting for 10% more time to get some data processed may be even unnoticeable.
But video should be decoded in realtime, and the same 10% of performance
difference may have a huge effect (result in a watchable or totally
unwatchable movie).

> > Well, it was just a comment for 'omapCopyPlanarDataYUV420' function 
> > wrong and misleading,  nevermind :-) Now everything is clear.
>
> Hmm, is it?  Because, unless I was _really_ tired at the time I wrote it
> (which is entirely possible), that's what the code does, and it works,
> so ...

Yes, this part seems wrong to me (maybe it was an old stale comment?):
/*
 * Copy I420 data to the custom 'YUV420' format, which is actually:
 * y11 u11,u12,u21,u22 u13,u14,u23,u24 y12 y14 y13
 * y21 v11,v12,v21,v22 v13,v14,v23,v24 y22 y24 y23
 * ...
 */

I was pretty much confused until actually looked at the code. Shouldn't it be
something like this:

/*
 * ... 'YUV420' format, which is actually:
 * | u/v1 y1 y2 | u/v2 y3 y4 | ...
 * ('u/v' means 'u' for even lines and 'v' for odd lines)
 * ...
 */