[maemo-developers] [maemo-developers] Xvideo support for Nokia 770?

Wed Jan 17 10:42:42 EET 2007

On Wednesday 10 January 2007 01:51, Charles 'Buck' Krasic wrote:

> Siarhei Siamashka wrote:
> > Actually I have been thinking about trying to implement Xvideo
> > support on 770 for some time already. Now as N800 has Xvideo
> > support, it would be nice to have it on 770 as well for better
> > consistency and software compatibility.
>
> As you may recall, I was considering this back in August/September.
> I tried a few things, and reported some of my findings to this list.
> The code for all that is still available here:
> http://qstream.org/~krasic/770/dsp/

Yes, sure I remember. Thanks for doing these experiments and making 
the results available. It really helps to have more information around.

> > I see the following possible options:
> >
> > 1. Implement it just using ARM core and optimize it as much as
> > possible (using dynamically generated code for scaling to get the
> > best performance). Is quite a straightforward solution and only
> > needs time to implement it.
>
> It is my impression that this might be the most attractive option.
> I noticed that TCPMP which seems to be the most performant player for
> the ARM uses this approach, and it is available under GPL, so it may
> be possible to adapt some of its code.
>
> In the long run, I would hope that integrating TCPMP scaling code into
> libswscale of the ffmpeg project might be the most elegant approach,
> since that seems to be the most performant/featureful/widel adopted
> open-source scaling code (but not yet on ARM).   For mplayer, it works
> out of the box, since libswcale actually originated from mplayer, and
> only recently migrated to ffmpeg.

I see, thanks for the information (I checked TCPMP sources some time ago, 
but was interested in runtime cpu capabilities detection code and did not look
at the scaler that time). Using TCPMP code may be an interesting option. But I
also still may try to make my own scaler implementation for two reasons:
1. TCPMP is covered by GPL license, and most parts of ffmpeg are LGPL, so
probably it makes sense making a clean room implementation of JIT powered
scaler for ARM under LGPL license
2. I'm worried about the performance. Knowing how the cache and write buffer
work on arm926 core, it is possible to tune generated code for it and get the
best performance possible. So the results can be better than for TCPMP.

I have just committed some initial assembly optimizations for unscaled
yuv420p -> yuyv422 color format convertor to maemo mplayer SVN. It already
provides some performance improvement, for example on my test video file
(640x480 resolution, 24 fps) I get the following results now:

BENCHMARKs: VC: 114.526s VO:  21.055s A:   0.000s Sys:   1.582s =  137.163s
BENCHMARK%: VC: 83.4962% VO: 15.3503% A:  0.0000% Sys:  1.1535% = 100.0000%

We can compare it with the older results (decoding time was also 
improved a bit since that time because of recent assembly optimizations 
for dequantizer):
http://maemo.org/pipermail/maemo-developers/2006-December/006646.html

BENCHMARKs: VC: 121.282s VO:  31.538s A:   0.000s Sys:   1.577s =  154.397s
BENCHMARK%: VC: 78.5517% VO: 20.4267% A:  0.0000% Sys:  1.0216% = 100.0000%

Most of the speed improvement in color conversion and video output (VO: part)
is gained just from loop unrolling and avoiding using some extra instructions
as gcc does when compiling C code, but using STMD instruction to store 16
bytes at once at aligned location [1] provides at least 10% performance here.
If we estimate memory copy speed here with additional colorspace conversion
applied, it is about 70MB/s now for 640x480 24 fps video (though we need to
read a bit less data than write here, so it is a bit different from memcpy).
And I have observed peak memcpy performance about 110MB/s on Nokia 770. 
So this color convertor is quite close to memory bandwidth limit now. This
code can be optimized more by processing two image lines at once, so we can
get rid of some data read instructions and improve performance. Also
experimenting with prefetch reads may provide some improvement.

JIT generated code should have a bit worse performance, but not much. It we
decide to make 'nearest neghbour' scaling, the result should be probably as
fast as this nonscaled conversion. But I want to try some simplified variation
of bilinear scaling: each pixel in the destination buffer is either a copy of
some pixel in the source buffer or an average value of two pixels. This way it
should only introduce two extra instructions for each byte in output at
maximum: addition of two pixel color components and right shift.

> > 2. Try using dsp tasks that already exist on the device and are
> > used for dspfbsink. But the sources of gst plugins contain code
> > that limits video resolution for dspfbsink. I wonder if this check
> > was introduced artificially or it is the limitation of DSP scaler
> > and it can't handle anything larger than that. Also I wonder if
> > existing video scaler DSP task can support direct rendering [2].
>
> I tried direct rendering in the above mentioned experimentation.  I
> never got it to work exactly correctly, i.e. I could get images
> fragments on the screen, but they were not the whole image, and never
> in exactly the correct screen position.   I suspected this was tied to
> the baroque memory addressing constraints of the DSP (e.g. 16bit data
> item limitations).   I tried very hard to work around them but was not
> successful.
>
> I think the benefits of direct rendering may be a false temptation on
> the DSP anyway.    My impression was that the DSP access to
> framebuffer memory slowed down the scaling algorithm tremendously, so
> it was actually faster to scale into DSP local memory, and then do a
> fast bulk copy to the FB, or to SDRAM on the ARM side.    Plus you
> have all the AV synchronization headaches.

Looks like performance heavily depends on how you do memory acces. At 
least on ARM, memory copy performance can vary by a factor of 4 depending on
implementation [1] (memcpy_trivial is about 4.2x slower than the most
optimized variant). Probably DSP also has its own tips and tricks ;-)

> I think these gains pale compared to the gain from just using the fb
> in YUV mode, and doing all the video stuff on the ARM side.
> Hence, option 1 seems to sound very attractive.

I don't know about your results, but even for nonscaled color conversion of
640x480 video, about 20% of cpu resources are used on it. That's quite a lot,
so I think that trying to implement some 'zero overhead color format
convertor' using DSP may be useful for video and may improve video playback
performance. As for AV synchronization, it should not be too hard to
compensate it. We signal screen update to DSP and check time, after DSP
reports completion, we check time again and calculate the difference. As video
player handles audio delay anyway, it will be probably just a trivial change.

> > Maybe we can ask Nokia developers to provide some information about
> > the internals of these plugins. The most important questions are: *
> > What are the real capabilities of DSP based scaler, can it be used
> > for resolutions let's say up to 800x480?
>
> I doubt 800x480.   The added quality benefit over 400x240 with pixel
> doubling in the fb is probably way to marginal to justify the
> effort.   The DSP hardware doesn't seem to have any meaningful support
> for general scaling (beyond doubling).

Even having some extra instructions for pixels averaging can be helpful. Of
course, some assembly programming for DSP will be required as scaling 
of large resolution video is resource consuming process and we probably 
can't rely on the compiler to do the job properly.

> > * Where is the screen update performed after dsp has finished
> > scaling/converting video from mapped buffer to framebuffer? Is it
> > done on ARM side, or probably screen update can be also triggered
> > from DSP directly?
>
> I seem to have the rough impression from inspecting X code that ARM
> side does the final update (copy) to fb memory.  I'm not 100% sure on
> that right now though.

I also had this impression, but did not find any reference that could be
related to performing screen update with FB_COLOR_YUV422 or  
FB_COLOR_YUV420 in xserver-kdrive-6.6.3 sources. Maybe this screen 
update call is somewhere else. I would expect DSP core to report scaling
completion and ARM core performing screen update. But after looking for
this code and not finding it, I thought that DSP might be capable to perform
screen update itself. Probably running grep for FB_COLOR_YUV on all the 
maemo sources can help, but I don't all the sources downloaded now. So I 
thought that it would be easier to ask in the mailing list :)

> > 3. Try implementing a new DSP based scaler from scratch. The most
> > important thing to know is how to access framebuffer directly from
> > DSP and move data to it from mapped buffer without any overhead.
> > The first test implementation can just perform nonscaled planar
> > YV12 -> packed YUV422 conversion, if it proves to be fast and
> > useful, it could be extended to also support scaling.
>
> This is what I did in August.   I did YUV -> YUV scaling plus RGB
> conversion on the DSP.   I think I did YUV->YUV scaling later. The
> results (performance) were abysmal.   Maybe I committed some mortal
> DSP programming sins that dragged the performance down, but it was soo
> slow I gave up even hoping.   I think my DSP code was maxed out on the
> DSP at like 20 fps, where the ARM was able to do 24fps with about
> 10-20% cpu.

Well, in my tests scaling 640x360 to fullscreen (actually to 400x226 for
pixels doubling) with the standard mplayer/ffmepg fast bilinear scaler 
takes 132 seconds for 100 seconds movie, so it is also unable to do it in
realtime, at least for such resolution :) But that might be a bad scaler 
implementation in current ffmpeg.

> Anyway, my code is still there which may be a start if you want to
> attempt it.   However, I think your first option is probably the most
> fruitful option.    My little project made me very cynical of the
> value of the DSP.  ;-)

The recent release of N800 and the fact that it still uses C55x DSP renewed
my interest in it. As it should be probably around for 1 year more until N900
release or whatever comes next :-) I'll definitely try to do some development
using it, getting some experience in DSP programming may be interesting.

[1] http://maemo.org/pipermail/maemo-developers/2006-December/006579.html