[maemo-developers] N800 & Video playback
From: Siarhei Siamashka siarhei.siamashka at gmail.comDate: Mon Apr 30 14:27:49 EEST 2007
- Previous message: N800 & Video playback
- Next message: N800 & Video playback
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Friday 27 April 2007 04:43, Daniel Stone wrote: > > I'll make a really optimized version of YV12 -> YUV420 convertor on this > > weekend (removing branch is good, but I feel that it can be improved > > more) and will try to use it on Nokia 770, any extra video performance > > improvement will be useful there. I hope that the framebuffer driver on > > Nokia 770 supports YUV420 color format properly. > > I don't think Tornado supports YUV420, but I can check in the specs > tomorrow. My better C version basically does two macroblocks at a time, > ensuring all 32-bit writes (which _really_ helps over 16-bit writes, > believe me). This eliminates the branch, since your surface is > guaranteed to be word-aligned, so if you do all 32-bit writes, you can > just drop the branch as you know every write will be aligned. > > This will be really fast. Optimized YV12 -> YUV420 convertor is done. The sources can be found here: https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libswscale_nokia770/?root=mplayer Take a look at 'arm_colorconv.h' and 'arm_colorconv.S' files. Also there is a test program ('test_colorconv') which can ensure that everything works correctly and fast: ~ $ ./test_colorconv test: 'yv12_to_yuv420_xomap', time=7.332s, speed=32.878MP/s, memwritespeed=43.838MB/s test: 'yv12_to_yuv420_xomap_nobranch', time=5.679s, speed=42.448MP/s, memwritespeed=56.597MB/s test: 'yv12_to_yuv420_line_arm_', time=4.706s, speed=51.223MP/s, memwritespeed=68.297MB/s test: 'yv12_to_yuv420_line_armv5_', time=3.356s, speed=71.824MP/s, memwritespeed=95.765MB/s test: 'yv12_to_yuv420_line_armv6_', time=2.826s, speed=85.298MP/s, memwritespeed=113.731MB/s ARMv6 optimized YV12->YUV420 convertor is about 2.5x faster than current code used in N800 xserver. So it should provide a nice improvement for video :) I doubt that your better C version can beat it or even get any close. There are two important optimizations in this code: 1. Cache prefetch with PLD instruction (added in '_armv5' version) which boosts performance to 70 megapixels per second. Inner loop is unrolled to process 32 pixels per iteration (cache line size is 32 bytes on ARM, so such unrolling is convenient). This is the most important improvement. You can try using __builtin_prefetch() from C code to do the same optimization. 2. The use of ARMv6 instruction REV16 to do bytes swapping for high and low 16-bit register parts, this optimization was added in '_armv6' version and boosted performance even more to 85 megapixels per second. This optimization is highly unlikely probably impossible for C version at all. I was a bit wrong about YUV420 format in my previous post. Suppose we have planar YV12 image with the following data. Y plane: Y1 Y2 Y3 Y4 ... U plane: U1 __ U2 __ ... Normal YUV420 (according to pictures in Epson docs) would be the following: U1 Y1 Y2 U2 Y3 Y4 ... But appears (most likely because of 16-bit interface and some endian differences between ARM and Epson chip) that each pair of bytes is swapped and we actually get the following somewhat weird layout: Y1 U1 U2 Y2 Y4 Y3 ... To do this byteswapping, ARMv6 instruction REV16 is very handy. The assembly sources for ARMv6 code look a bit messy because instruction reordering was needed to correctly schedule them and avoid ARM11 pipeline interlocks which negatively affect performance. Now this code is really fast with very little or no interlocks in the inner loop. And gcc does not do a good job optimizing code on ARM, so C implementation would be also at disadvantage here. By the way, the benchmarks posted in my previous message should be discarded. I did not initialize source buffers that time and looks like ARM11 cpu has some 'cheat' which allows treating empty data pages in some special way and avoid reading from memory. So the numbers posted in the previous benchmark were higher than usual. Now it is corrected. As for the other possible Xv optimizations. You mentioned that fallback code is not important at all. But imagine 640x480 video playback in windowed mode. Decoding it will require quite a lot of resources, but additionally scaling it down using a slow fallback code will be a finishing blow. In addition, a solution (fast JIT accelerated YV12->YUY2 scaler) for this problem already exists. I can also modify this scaler to support YV12->YUV420 scaling. An interesting thing here is that this scaler could be also used by xserver to solve graphics bus bandwidth issues. Imagine that we have some high resolution video with high framerate which exceeds graphics bus capabilities. In this case this video can be downscaled in software using JIT scaler to lower resolution before sending data to LCD controller. What do you think? > Sure. Unfortunately my job has other functions than to make video > decoding really, really fast, so I'm happy to merge, review, offer > feedback, and help you out where I can be useful, but I can't throw much > time at this myself. That's fine. Now I'm waiting for further instructions :) Should I try to prepare a complete patch for xserver? I'm really interested in getting this optimization into xserver as it would help to play high resolution videos. If you have any extra questions about the code or anything else (for example I wonder what free license would be appriopriate for it), don't hesitate to contact me. I did not try to build xserver sources yet as I did not have enough time for that and xserver requires quite a number of build dependencies. Can you share some tips and tricks about maemo xserver development. Is it difficult to compile (do I need any extra build scripts, tools, or configuration options) and install on N800 (is it safe to upgrade xserver on N800 from .deb file)? I also tried to use YUV420 on Nokia 770, but it did not work well. According to Epson, this format should be supported by hardware. Also there is a constant OMAPFB_COLOR_YUV420 defined in omapfb.h in Nokia 770 kernel sources. But actually using YUV420 was not very successful. Full screen update 800x480 in YUV420 seems to deadlock Nokia 770. Playback of centered 640x480 video in YUV420 format was a bit better, at least I could decipher what's on the screen. But anyway, it looked like an old broken TV :) Image was not fixed but floating up and down, there were mirrors, tearings, some color distortion, etc. After video playback finished, the screen remained in inconsistent state with a striped garbage displayed on it. Starting video playback with YUY2 output fixed it. But anyway, looks like YUV420 is not supported properly in the framebuffer driver from the latest OS2006 kernel. That's bad, it could provide ~30% improvement in video output perfrmance for Nokia 770. Maybe upgrading framebuffer driver can fix this issue (and add tearsync support).
- Previous message: N800 & Video playback
- Next message: N800 & Video playback
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]