|
|
DescriptionAdding SSE2 Matrix-Matrix multiply to SetConcat (when in X64 mode).
Branch occurs at compile time rather than run time.
Performance goes from 3.25 units to 1.87 units.
Patch Set 1 #
Total comments: 4
Patch Set 2 : #Patch Set 3 : #
Total comments: 1
Patch Set 4 : #
Total comments: 1
Patch Set 5 : #Patch Set 6 : #
MessagesTotal messages: 29
https://codereview.appspot.com/7058077/diff/1/bench/Matrix44Bench.cpp File bench/Matrix44Bench.cpp (right): https://codereview.appspot.com/7058077/diff/1/bench/Matrix44Bench.cpp#newcode133 bench/Matrix44Bench.cpp:133: fM1.set(2, 0, 3.0f); Added these to make the benchmark use the general matrix-matrix rather than an early out.
Sign in to reply to this message.
1. do we already have adequate coverage in unittests to feel that we're getting the same "correctness" for setconcat in the two code paths? Do we need more tests there? 2. Lets not change the existing bench. Lets add a variant that does the more general case, so we can compare it against the existing special case(s) for translate-only and translate-scale. https://codereview.appspot.com/7058077/diff/1/include/utils/SkMatrix44.h File include/utils/SkMatrix44.h (right): https://codereview.appspot.com/7058077/diff/1/include/utils/SkMatrix44.h#newc... include/utils/SkMatrix44.h:104: class What are the implications of this pragma? I presume it means the sizeof this object, when embedded in other objects, may get larger? Do we need this always when _MSC_VER is defined, even if the other x86 || SSE flags are not true?
Sign in to reply to this message.
https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp File src/utils/SkMatrix44.cpp (right): https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... src/utils/SkMatrix44.cpp:438: p->w_zw = b0 * x_zw + b1 * y_zw + b2 * z_zw + b3 * w_zw; need { } around this "if" https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... src/utils/SkMatrix44.cpp:440: memcpy(fMat, &storageSSE2, sizeof(storageSSE2)); Can we cleanly refactor so we don't have to repeat this dirty+return code? Related, I wonder if its acceptable to jump to this (or the old C version) via a function ptr. If so, ... 1. it would be easier to read 2. it would allow us to access the SSE2 code on 32bit machines where we don't know at compile-time that we have SSE2, but we do detect that at runtime. Our blitter code does a *lot* of this sort of thing.
Sign in to reply to this message.
On 2013/01/10 19:28:03, reed1 wrote: > 1. do we already have adequate coverage in unittests to feel that we're getting > the same "correctness" for setconcat in the two code paths? Do we need more > tests there? > > 2. Lets not change the existing bench. Lets add a variant that does the more > general case, so we can compare it against the existing special case(s) for > translate-only and translate-scale. > > https://codereview.appspot.com/7058077/diff/1/include/utils/SkMatrix44.h > File include/utils/SkMatrix44.h (right): > > https://codereview.appspot.com/7058077/diff/1/include/utils/SkMatrix44.h#newc... > include/utils/SkMatrix44.h:104: class > What are the implications of this pragma? I presume it means the sizeof this > object, when embedded in other objects, may get larger? > > Do we need this always when _MSC_VER is defined, even if the other x86 || SSE > flags are not true? 1) Yes there's coverage in tests. Also, because setconcat is used in the invert test and other tests, I would assume almost nothing would work if setconcat was broken. 2) I added a new test. Amusingly, the new test is now significantly faster than the special case. 3) It forces alignment to 16 bits, if you had no additional flags in the matrix it wouldn't change the size. It may or may not change the size based on how many flags you have. I changed it to only set alignment if we're going to use it.
Sign in to reply to this message.
On 2013/01/10 19:30:21, reed1 wrote: > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp > File src/utils/SkMatrix44.cpp (right): > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... > src/utils/SkMatrix44.cpp:438: p->w_zw = b0 * x_zw + b1 * y_zw + b2 * z_zw + b3 * > w_zw; > need { } around this "if" > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... > src/utils/SkMatrix44.cpp:440: memcpy(fMat, &storageSSE2, sizeof(storageSSE2)); > Can we cleanly refactor so we don't have to repeat this dirty+return code? > > Related, I wonder if its acceptable to jump to this (or the old C version) via a > function ptr. If so, ... > 1. it would be easier to read > 2. it would allow us to access the SSE2 code on 32bit machines where we don't > know at compile-time that we have SSE2, but we do detect that at runtime. Our > blitter code does a *lot* of this sort of thing. We could certainly do a jump. This would however add a performance penalty to all calls to setconcat in all cases. Personally I find matrix math to be a primitive and therefore should be fast, but SkMatrix44 is neither primitive nor fast so in this case I really don't have an opinion. It's less work for us if we keep it the way it is and it doesn't impose a penalty. Note: I'm doing a study to see what % of machines have SSE2. (Intel started making SSE2 machines in 2000, so it's been 13 years). If the chrome base is anything like the steam hardware survey base it'll be over 99.5% and I'll petition to cut the tail and just assume SSE2 always.
Sign in to reply to this message.
https://codereview.appspot.com/7058077/diff/6001/src/utils/SkMatrix44.cpp File src/utils/SkMatrix44.cpp (right): https://codereview.appspot.com/7058077/diff/6001/src/utils/SkMatrix44.cpp#new... src/utils/SkMatrix44.cpp:395: sk_bzero(result, sizeof(storage)); This change *over doubled* the performance of this function. Please train yourself and anyone writing performance code to *NOT* use memset and memcpy. They are significantly slower than explicitly loops for small arrays (lets say less than ~1k bytes). Many of the performance problems in SkMatrix44 can be attributed to this behavior.
Sign in to reply to this message.
On 2013/01/10 22:14:41, whunt wrote: > On 2013/01/10 19:30:21, reed1 wrote: > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp > > File src/utils/SkMatrix44.cpp (right): > > > > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... > > src/utils/SkMatrix44.cpp:438: p->w_zw = b0 * x_zw + b1 * y_zw + b2 * z_zw + b3 > * > > w_zw; > > need { } around this "if" > > > > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... > > src/utils/SkMatrix44.cpp:440: memcpy(fMat, &storageSSE2, sizeof(storageSSE2)); > > Can we cleanly refactor so we don't have to repeat this dirty+return code? > > > > Related, I wonder if its acceptable to jump to this (or the old C version) via > a > > function ptr. If so, ... > > 1. it would be easier to read > > 2. it would allow us to access the SSE2 code on 32bit machines where we don't > > know at compile-time that we have SSE2, but we do detect that at runtime. Our > > blitter code does a *lot* of this sort of thing. > > We could certainly do a jump. This would however add a performance penalty to > all calls to setconcat in all cases. Personally I find matrix math to be a > primitive and therefore should be fast, but SkMatrix44 is neither primitive nor > fast so in this case I really don't have an opinion. It's less work for us if > we keep it the way it is and it doesn't impose a penalty. > > Note: I'm doing a study to see what % of machines have SSE2. (Intel started > making SSE2 machines in 2000, so it's been 13 years). If the chrome base is > anything like the steam hardware survey base it'll be over 99.5% and I'll > petition to cut the tail and just assume SSE2 always. This is just a product issue for Chrome on Windows. I'd be very happy if we can convince them to let us set SSE2 at compile-time, but today that is not the case.
Sign in to reply to this message.
On 2013/01/10 22:20:23, reed1 wrote: > On 2013/01/10 22:14:41, whunt wrote: > > On 2013/01/10 19:30:21, reed1 wrote: > > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp > > > File src/utils/SkMatrix44.cpp (right): > > > > > > > > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... > > > src/utils/SkMatrix44.cpp:438: p->w_zw = b0 * x_zw + b1 * y_zw + b2 * z_zw + > b3 > > * > > > w_zw; > > > need { } around this "if" > > > > > > > > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... > > > src/utils/SkMatrix44.cpp:440: memcpy(fMat, &storageSSE2, > sizeof(storageSSE2)); > > > Can we cleanly refactor so we don't have to repeat this dirty+return code? > > > > > > Related, I wonder if its acceptable to jump to this (or the old C version) > via > > a > > > function ptr. If so, ... > > > 1. it would be easier to read > > > 2. it would allow us to access the SSE2 code on 32bit machines where we > don't > > > know at compile-time that we have SSE2, but we do detect that at runtime. > Our > > > blitter code does a *lot* of this sort of thing. > > > > We could certainly do a jump. This would however add a performance penalty to > > all calls to setconcat in all cases. Personally I find matrix math to be a > > primitive and therefore should be fast, but SkMatrix44 is neither primitive > nor > > fast so in this case I really don't have an opinion. It's less work for us if > > we keep it the way it is and it doesn't impose a penalty. > > > > Note: I'm doing a study to see what % of machines have SSE2. (Intel started > > making SSE2 machines in 2000, so it's been 13 years). If the chrome base is > > anything like the steam hardware survey base it'll be over 99.5% and I'll > > petition to cut the tail and just assume SSE2 always. > > This is just a product issue for Chrome on Windows. I'd be very happy if we can > convince them to let us set SSE2 at compile-time, but today that is not the > case. You can depend on SSE2 at compile time on x64 always for Chrome on anything (it's tautologically there)
Sign in to reply to this message.
I presume you meant 16 bytes, not bits. Doesn't that mean it also matters how we are embedded in other objects? i.e. struct Foo { SkMatrix44 a; } struct Bar { double a; SkMatrix44 b; } I assumed that 16byte alignment was to ensure that the matrix *began* on a 16byte boundary, which could affect the sizeof Foo or Bar.
Sign in to reply to this message.
We will need to turn the static_assert into an #ifdef, since we can build SkMatrix44 with floats instead of doubles.
Sign in to reply to this message.
On 2013/01/10 22:23:29, jamesr wrote: > On 2013/01/10 22:20:23, reed1 wrote: > > On 2013/01/10 22:14:41, whunt wrote: > > > On 2013/01/10 19:30:21, reed1 wrote: > > > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp > > > > File src/utils/SkMatrix44.cpp (right): > > > > > > > > > > > > > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... > > > > src/utils/SkMatrix44.cpp:438: p->w_zw = b0 * x_zw + b1 * y_zw + b2 * z_zw > + > > b3 > > > * > > > > w_zw; > > > > need { } around this "if" > > > > > > > > > > > > > > https://codereview.appspot.com/7058077/diff/1/src/utils/SkMatrix44.cpp#newcod... > > > > src/utils/SkMatrix44.cpp:440: memcpy(fMat, &storageSSE2, > > sizeof(storageSSE2)); > > > > Can we cleanly refactor so we don't have to repeat this dirty+return code? > > > > > > > > Related, I wonder if its acceptable to jump to this (or the old C version) > > via > > > a > > > > function ptr. If so, ... > > > > 1. it would be easier to read > > > > 2. it would allow us to access the SSE2 code on 32bit machines where we > > don't > > > > know at compile-time that we have SSE2, but we do detect that at runtime. > > Our > > > > blitter code does a *lot* of this sort of thing. > > > > > > We could certainly do a jump. This would however add a performance penalty > to > > > all calls to setconcat in all cases. Personally I find matrix math to be a > > > primitive and therefore should be fast, but SkMatrix44 is neither primitive > > nor > > > fast so in this case I really don't have an opinion. It's less work for us > if > > > we keep it the way it is and it doesn't impose a penalty. > > > > > > Note: I'm doing a study to see what % of machines have SSE2. (Intel > started > > > making SSE2 machines in 2000, so it's been 13 years). If the chrome base is > > > anything like the steam hardware survey base it'll be over 99.5% and I'll > > > petition to cut the tail and just assume SSE2 always. > > > > This is just a product issue for Chrome on Windows. I'd be very happy if we > can > > convince them to let us set SSE2 at compile-time, but today that is not the > > case. > > You can depend on SSE2 at compile time on x64 always for Chrome on anything > (it's tautologically there) Yep, I understand we can't make that assumption yet. I'm hoping that we can sometime in the next year or two though. First step is to collect data showing that only a very small fraction of users doesn't have SSE2 so I'm adding an ISA histogram. SSE2 is defined in the AMD64 spec so yes it's always available in 64 bit mode on all intel/amd platforms.
Sign in to reply to this message.
On 2013/01/10 22:23:42, reed1 wrote: > I presume you meant 16 bytes, not bits. > Doesn't that mean it also matters how we are embedded in other objects? > > i.e. > > struct Foo { > SkMatrix44 a; > } > > struct Bar { > double a; > SkMatrix44 b; > } > > I assumed that 16byte alignment was to ensure that the matrix *began* on a > 16byte boundary, which could affect the sizeof Foo or Bar. The WebKit matrix class already has this alignment requirement for 64 bit builds and we want the same thing for chromium code.
Sign in to reply to this message.
On 2013/01/10 22:23:42, reed1 wrote: > I presume you meant 16 bytes, not bits. > Doesn't that mean it also matters how we are embedded in other objects? > > i.e. > > struct Foo { > SkMatrix44 a; > } > > struct Bar { > double a; > SkMatrix44 b; > } > > I assumed that 16byte alignment was to ensure that the matrix *began* on a > 16byte boundary, which could affect the sizeof Foo or Bar. Your assumptions are correct and yes I meant bytes. Basically because SSE stores and memory operands on math ops require 16 byte alignment we need to assume that doubles are aligned to 16 bytes rather than their natural 8. I believe NEON has the same restriction.
Sign in to reply to this message.
On 2013/01/10 22:25:51, reed1 wrote: > We will need to turn the static_assert into an #ifdef, since we can build > SkMatrix44 with floats instead of doubles.
Sign in to reply to this message.
On 2013/01/10 22:25:51, reed1 wrote: > We will need to turn the static_assert into an #ifdef, since we can build > SkMatrix44 with floats instead of doubles. I agree we need to assert that we're using doubles. I tried static assert but none was defined (the chrome COMPILE_ASSERT didn't work). How would you check? I tried adding sizeof(SkMScalar) == sizeof(double) in the assert but the compiler complained. Any advice?
Sign in to reply to this message.
On 2013/01/10 22:26:04, whunt wrote: > On 2013/01/10 22:23:29, jamesr wrote: > > You can depend on SSE2 at compile time on x64 always for Chrome on anything > > (it's tautologically there) > > Yep, I understand we can't make that assumption yet. I'm hoping that we can > sometime in the next year or two though. First step is to collect data showing > that only a very small fraction of users doesn't have SSE2 so I'm adding an ISA > histogram. SSE2 is defined in the AMD64 spec so yes it's always available in 64 > bit mode on all intel/amd platforms. Android x86 is incomplete - is it SSE2 and just not SSE3?
Sign in to reply to this message.
And LGTM, for what it's worth.
Sign in to reply to this message.
All Intel Atom (used in mobile Intel devices) processors support SSE, SSE2, SSE3 and SSE3E but not SSE4, SSE4.1, SSE4.2, AVX or AVX2. The next generation Atoms will support all SSE instruction sets. ARM based mobile devices do not use SSE but many use ARM's vector instruction set called NEON. I believe that NEON also has versions but I don't know as much about it. On Fri, Jan 11, 2013 at 2:08 AM, <tomhudson@google.com> wrote: > On 2013/01/10 22:26:04, whunt wrote: > >> On 2013/01/10 22:23:29, jamesr wrote: >> > You can depend on SSE2 at compile time on x64 always for Chrome on >> > anything > >> > (it's tautologically there) >> > > Yep, I understand we can't make that assumption yet. I'm hoping that >> > we can > >> sometime in the next year or two though. First step is to collect >> > data showing > >> that only a very small fraction of users doesn't have SSE2 so I'm >> > adding an ISA > >> histogram. SSE2 is defined in the AMD64 spec so yes it's always >> > available in 64 > >> bit mode on all intel/amd platforms. >> > > Android x86 is incomplete - is it SSE2 and just not SSE3? > > > https://codereview.appspot.**com/7058077/<https://codereview.appspot.com/7058... >
Sign in to reply to this message.
On 2013/01/11 18:14:01, whunt_google.com wrote: > All Intel Atom (used in mobile Intel devices) processors support SSE, SSE2, > SSE3 and SSE3E but not SSE4, SSE4.1, SSE4.2, AVX or AVX2. The next > generation Atoms will support all SSE instruction sets. ARM based mobile > devices do not use SSE but many use ARM's vector instruction set called > NEON. I believe that NEON also has versions but I don't know as much about > it. > > > On Fri, Jan 11, 2013 at 2:08 AM, <mailto:tomhudson@google.com> wrote: > > > On 2013/01/10 22:26:04, whunt wrote: > > > >> On 2013/01/10 22:23:29, jamesr wrote: > >> > You can depend on SSE2 at compile time on x64 always for Chrome on > >> > > anything > > > >> > (it's tautologically there) > >> > > > > Yep, I understand we can't make that assumption yet. I'm hoping that > >> > > we can > > > >> sometime in the next year or two though. First step is to collect > >> > > data showing > > > >> that only a very small fraction of users doesn't have SSE2 so I'm > >> > > adding an ISA > > > >> histogram. SSE2 is defined in the AMD64 spec so yes it's always > >> > > available in 64 > > > >> bit mode on all intel/amd platforms. > >> > > > > Android x86 is incomplete - is it SSE2 and just not SSE3? > > > > > > > https://codereview.appspot.**com/7058077/%3Chttps://codereview.appspot.com/70...> > > For skia, what's the process to go from LGTM to commit? Is there something I need to do or someone I need to poke?
Sign in to reply to this message.
On 2013/01/11 18:52:40, whunt wrote: > For skia, what's the process to go from LGTM to commit? Is there something I > need to do or someone I need to poke? If you don't have committer rights, one of us will happily do it for you. I can land it once Mike signs off on it.
Sign in to reply to this message.
https://codereview.appspot.com/7058077/diff/10001/include/utils/SkMatrix44.h File include/utils/SkMatrix44.h (right): https://codereview.appspot.com/7058077/diff/10001/include/utils/SkMatrix44.h#... include/utils/SkMatrix44.h:60: Lets consider consolidating the build flag tests into a single place, and we can also build-in the test for MScalar is a double. #if (defined(__x86_64__) || defined(_M_X64) || defined(__SSE2__)) && defined(SK_MSCALAR_IS_DOUBLE) #define SK_MATRIX44_USE_SSE #endif Then we can just say #ifdef SK_MATRIX44_USE_SSE to bracket the new code. Either way I think we need to treat SK_MSCALAR_IS_DOUBLE as a testable flag, and not an assert.
Sign in to reply to this message.
On 2013/01/14 14:53:00, reed1 wrote: > https://codereview.appspot.com/7058077/diff/10001/include/utils/SkMatrix44.h > File include/utils/SkMatrix44.h (right): > > https://codereview.appspot.com/7058077/diff/10001/include/utils/SkMatrix44.h#... > include/utils/SkMatrix44.h:60: > Lets consider consolidating the build flag tests into a single place, and we can > also build-in the test for MScalar is a double. > > #if (defined(__x86_64__) || defined(_M_X64) || defined(__SSE2__)) && > defined(SK_MSCALAR_IS_DOUBLE) > #define SK_MATRIX44_USE_SSE > #endif > > Then we can just say > > #ifdef SK_MATRIX44_USE_SSE > > to bracket the new code. > > Either way I think we need to treat SK_MSCALAR_IS_DOUBLE as a testable flag, and > not an assert. Done.
Sign in to reply to this message.
lgtm we have other examples where we name the test more explicitly (e.g. _scaletrans instead of _specialcase), but I can fix that after this lands.
Sign in to reply to this message.
Unfortunately, when I run out/Debug/tests on Linux with this patch installed, I get: [42/85] Matrix44... ../include/utils/SkMatrix44.h:131: failed assertion "sizeof(src) == sizeof(fMat) + sizeof(SkMIntScalar)"
Sign in to reply to this message.
On 2013/01/15 11:47:22, TomH wrote: > Unfortunately, when I run out/Debug/tests on Linux with this patch installed, I > get: > > [42/85] Matrix44... > ../include/utils/SkMatrix44.h:131: failed assertion "sizeof(src) == sizeof(fMat) > + sizeof(SkMIntScalar)" This must be due to the alignment pragma. I will work on a fix.
Sign in to reply to this message.
Mike reduced the stringency of our size checks, which avoided the assertion failure. Landed as r7241. Warren, once it's passed the bots (http://skia.googlecode.com/svn/buildbot/buildbots.html), please close this review.
Sign in to reply to this message.
Nope, can't close it. Worked for me in Ubuntu Debug, but tests fail on most platforms: [42/85] Matrix44... FAILED: ../tests/Matrix44Test.cpp:389: is_identity(iden1) FAILED: ../tests/Matrix44Test.cpp:391: is_identity(iden2) ---- FAILED I'm not sure why that test failure doesn't occur on my dev platform. The end of stdio on several platforms has additional error strings: not equal -nan 1 (Ubuntu, both Debug and Release) not equal 8.50898e-314 1 (Win Debug) not equal 2.59876e+126 1 (Win Release) Reverted with r7245; please make sure you address these failures and we'll try again: Skia_Shuttle_Ubuntu12_ATI5770_Float_Debug_32: RunTests - stdio Skia_Shuttle_Ubuntu12_ATI5770_Float_Release_32: RunTests - stdio Skia_Shuttle_Win7_Intel_Float_Debug_32: RunTests - stdio Skia_Shuttle_Win7_Intel_Float_Release_32: RunTests - stdio Skia_Shuttle_Win7_Intel_Float_ANGLE_Debug_32: RunTests - stdio Skia_Shuttle_Win7_Intel_Float_ANGLE_Release_32: RunTests - stdio Skia_Shuttle_Win7_Intel_Float_DirectWrite_Debug_32: RunTests - stdio Skia_Shuttle_Win7_Intel_Float_DirectWrite_Release_32: RunTests - stdio Skia_NexusS_4-1_Float_Release_32: RunTests - stdio Skia_Xoom_4-1_Float_Debug_32: RunTests - stdio Skia_Nexus7_4-1_Float_Debug_32: RunTests - stdio Skia_Nexus7_4-1_Float_Release_32: RunTests - stdio Skia_Nexus10_4-1_Float_Release_32: RunTests - stdio
Sign in to reply to this message.
|