Box filter for YUV use rows with accumulation buffer for better memory behavior.  The old code would do columns accumulated into registers, and then store the result once.  This was slow from a memory point of view.  The new code does a row of source at a time, updating an accumulation buffer every row.  The accumulation buffer is small, and should fit cache.  Before each accumulation of N rows, the buffer needs to be reset to zero.  If the memset is a bottleneck, it would be faster to do the first row without an add, storing to the accumulation buffer, and then add for the remaining rows.
BUG=425
TESTED=out\release\libyuv_unittest --gtest_filter=*ScaleTo1x1*
R=harryjin@google.com

Review URL: https://webrtc-codereview.appspot.com/52659004

git-svn-id: http://libyuv.googlecode.com/svn/trunk@1428 16f28f9a-4ce2-e073-06de-1de4eb20be90
6 files changed