|
|
Descriptionmath/big: more conservative use of lock for divisor table
Minor performance impact running sequentially:
benchmark old ns/op new ns/op delta
BenchmarkString10Base2 389 391 +0.51%
BenchmarkString100Base2 1530 1534 +0.26%
BenchmarkString1000Base2 11789 11787 -0.02%
BenchmarkString10000Base2 111443 112030 +0.53%
BenchmarkString100000Base2 1017483 1015347 -0.21%
BenchmarkString10Base8 339 344 +1.47%
BenchmarkString100Base8 753 756 +0.40%
BenchmarkString1000Base8 4618 4641 +0.50%
BenchmarkString10000Base8 43217 43534 +0.73%
BenchmarkString100000Base8 397518 400602 +0.78%
BenchmarkString10Base10 630 630 +0.00%
BenchmarkString100Base10 1975 1960 -0.76%
BenchmarkString1000Base10 10179 10174 -0.05%
BenchmarkString10000Base10 44527 44416 -0.25%
BenchmarkString100000Base10 14404694 14425308 +0.14%
BenchmarkString10Base16 283 288 +1.77%
BenchmarkString100Base16 597 598 +0.17%
BenchmarkString1000Base16 3189 3186 -0.09%
BenchmarkString10000Base16 29403 29364 -0.13%
BenchmarkString100000Base16 265657 265587 -0.03%
Note that due to other improvements (faster assembly routines,
better code generation by compiler), these benchmarks now run
up to 37% faster than they used to at the last time measured (1/9/2012).
Minor performance impact for StringPiParallel running in parallel:
Current CL but with Lock/Unlock commented out (removed):
BenchmarkStringPiParallel 5000 343581 ns/op
BenchmarkStringPiParallel-2 10000 184511 ns/op
BenchmarkStringPiParallel-3 10000 129768 ns/op
BenchmarkStringPiParallel-4 10000 102326 ns/op
Current CL:
BenchmarkStringPiParallel 5000 345169 ns/op
BenchmarkStringPiParallel-2 10000 185827 ns/op
BenchmarkStringPiParallel-3 10000 131168 ns/op
BenchmarkStringPiParallel-4 10000 102353 ns/op
Fixes issue 4218.
Patch Set 1 #Patch Set 2 : diff -r 1b587887ceb2 https://code.google.com/p/go #Patch Set 3 : diff -r 1b587887ceb2 https://code.google.com/p/go #
Total comments: 4
Patch Set 4 : diff -r 0558d15b2681 https://code.google.com/p/go #Patch Set 5 : diff -r 236ee151d1bc https://code.google.com/p/go #Patch Set 6 : diff -r bc50fc903d6e https://code.google.com/p/go #Patch Set 7 : diff -r bc50fc903d6e https://code.google.com/p/go #Patch Set 8 : diff -r 7677524b8a24 https://code.google.com/p/go #Patch Set 9 : diff -r 7677524b8a24 https://code.google.com/p/go #Patch Set 10 : diff -r 7677524b8a24 https://code.google.com/p/go #Patch Set 11 : diff -r 2aa2c1c93060 https://code.google.com/p/go #MessagesTotal messages: 23
Hello dvyukov@google.com, michael.jones@gmail.com (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://code.google.com/p/go
Sign in to reply to this message.
https://codereview.appspot.com/6643053/diff/3/src/pkg/math/big/nat.go File src/pkg/math/big/nat.go (right): https://codereview.appspot.com/6643053/diff/3/src/pkg/math/big/nat.go#newcode925 src/pkg/math/big/nat.go:925: mu sync.Mutex // protects table s/mu// https://codereview.appspot.com/6643053/diff/3/src/pkg/math/big/nat.go#newcode949 src/pkg/math/big/nat.go:949: cacheBase10.mu.Lock() s/mu// https://codereview.appspot.com/6643053/diff/3/src/pkg/math/big/nat.go#newcode950 src/pkg/math/big/nat.go:950: defer cacheBase10.mu.Unlock() ditto
Sign in to reply to this message.
PTAL. https://codereview.appspot.com/6643053/diff/3/src/pkg/math/big/nat.go File src/pkg/math/big/nat.go (right): https://codereview.appspot.com/6643053/diff/3/src/pkg/math/big/nat.go#newcode925 src/pkg/math/big/nat.go:925: mu sync.Mutex // protects table On 2012/10/10 00:11:01, dfc wrote: > s/mu// but of course! :-)
Sign in to reply to this message.
Thank you. I'm not sure about this, I think the race still exists through table to the memory that it shares with cacheBase10.table.
Sign in to reply to this message.
I do not see the race now, looks correct. But I am concerned about performance implications. On the following benchmark func BenchmarkStringPi(b *testing.B) { const STR = "123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990" x, _, _ := nat(nil).scan(strings.NewReader(STR), 10) _ = x.decimalString() procs := runtime.GOMAXPROCS(-1) N := int32(b.N) c := make(chan bool, procs) for p := 0; p < procs; p++ { go func() { x, _, _ := nat(nil).scan(strings.NewReader(STR), 10) for atomic.AddInt32(&N, -1) >= 0 { _ = x.decimalString() } c <- true }() } for p := 0; p < procs; p++ { <-c } } w/o mutex lock/defer unlock: BenchmarkStringPi 200000 4495 ns/op BenchmarkStringPi-4 500000 1484 ns/op BenchmarkStringPi-8 1000000 971 ns/op BenchmarkStringPi-16 1000000 738 ns/op with mutex lock/defer unlock: BenchmarkStringPi 200000 4708 ns/op BenchmarkStringPi-4 500000 1857 ns/op BenchmarkStringPi-8 500000 1555 ns/op BenchmarkStringPi-16 500000 1633 ns/op I would prefer lock-free fast path (but correct).
Sign in to reply to this message.
PTAL. I am not concerned about the performance. Correctness trumps speed and the performance degradation is small (0.5 - 1.5%) for the relevant benchmarks (now attached). I removed the defer statement for a minor improvement. Note that these benchmarks run up to 37% faster than they used to beginning of the year; I suspect due to compiler improvements and due to the manual assembly improvements for the core math routines. It's safer to tweak those (there's more to get ouf of them) than trying lock-free approaches that are hard to understand. On 2012/10/10 14:20:58, dvyukov wrote: > I do not see the race now, looks correct. But I am concerned about performance > implications. On the following benchmark > > func BenchmarkStringPi(b *testing.B) { > const STR = > "123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990123456789012345678990" > x, _, _ := nat(nil).scan(strings.NewReader(STR), 10) > _ = x.decimalString() > > procs := runtime.GOMAXPROCS(-1) > N := int32(b.N) > c := make(chan bool, procs) > for p := 0; p < procs; p++ { > go func() { > x, _, _ := nat(nil).scan(strings.NewReader(STR), 10) > for atomic.AddInt32(&N, -1) >= 0 { > _ = x.decimalString() > } > c <- true > }() > } > for p := 0; p < procs; p++ { > <-c > } > } > > w/o mutex lock/defer unlock: > > BenchmarkStringPi 200000 4495 ns/op > BenchmarkStringPi-4 500000 1484 ns/op > BenchmarkStringPi-8 1000000 971 ns/op > BenchmarkStringPi-16 1000000 738 ns/op > > with mutex lock/defer unlock: > > BenchmarkStringPi 200000 4708 ns/op > BenchmarkStringPi-4 500000 1857 ns/op > BenchmarkStringPi-8 500000 1555 ns/op > BenchmarkStringPi-16 500000 1633 ns/op > > > I would prefer lock-free fast path (but correct).
Sign in to reply to this message.
On Wed, Oct 10, 2012 at 10:18 PM, <gri@golang.org> wrote: > PTAL. > > I am not concerned about the performance. Correctness trumps speed and > the performance degradation is small (0.5 - 1.5%) for the relevant > benchmarks (now attached). > They are not parallel. > > I removed the defer statement for a minor improvement. Note that these > benchmarks run up to 37% faster My parallel measurements show 121% degradation, so I guess now it is slower than beginning of the year. > than they used to beginning of the year; > I suspect due to compiler improvements and due to the manual assembly > improvements for the core math routines. It's safer to tweak those > (there's more to get ouf of them) than trying lock-free approaches that > are hard to understand. > > > > On 2012/10/10 14:20:58, dvyukov wrote: > >> I do not see the race now, looks correct. But I am concerned about >> > performance > >> implications. On the following benchmark >> > > func BenchmarkStringPi(b *testing.B) { >> const STR = >> > > "**123456789012345678990123456789**012345678990123456789012345678** > 990123456789012345678990123456**789012345678990123456789012345** > 678990123456789012345678990123**456789012345678990" > >> x, _, _ := nat(nil).scan(strings.**NewReader(STR), 10) >> _ = x.decimalString() >> > > procs := runtime.GOMAXPROCS(-1) >> N := int32(b.N) >> c := make(chan bool, procs) >> for p := 0; p < procs; p++ { >> go func() { >> x, _, _ := nat(nil).scan(strings.**NewReader(STR), >> 10) >> for atomic.AddInt32(&N, -1) >= 0 { >> _ = x.decimalString() >> } >> c <- true >> }() >> } >> for p := 0; p < procs; p++ { >> <-c >> } >> } >> > > w/o mutex lock/defer unlock: >> > > BenchmarkStringPi 200000 4495 ns/op >> BenchmarkStringPi-4 500000 1484 ns/op >> BenchmarkStringPi-8 1000000 971 ns/op >> BenchmarkStringPi-16 1000000 738 ns/op >> > > with mutex lock/defer unlock: >> > > BenchmarkStringPi 200000 4708 ns/op >> BenchmarkStringPi-4 500000 1857 ns/op >> BenchmarkStringPi-8 500000 1555 ns/op >> BenchmarkStringPi-16 500000 1633 ns/op >> > > > I would prefer lock-free fast path (but correct). >> > > > > https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643... >
Sign in to reply to this message.
I'm still not concerned. Also, I cannot reproduce the slowdown you are reporting: I have added a parallel benchmark and it doesn't show the behavior you are seing (albeit I can only go up to 4 on my machine). Perhaps there is contention in your benchmark on the benchmark side due to the use of atomic? But more importantly: This is about converting internal numbers to a human-readable decimal form for human consumption (for machine consumption, hex conversion would be more appropriate, way faster, and lock-free). Outputting large amounts of huge decimal numbers for human consumption is not going to be the bottleneck of a program (if it is, there's probably something else wrong or it doesn't matter). One approach is to fill in the entire table at initialization time, or at first use. I am not comfortable using atomic operations for some of the slice accesses - maybe I am wrong but I think it's very tricky to get right. Happy to be proven otherwise. Again, this is about correctness first and fixing the issue, performance 2nd; and as far as I can tell, performance is ok. If it's becoming a real issue for a real application, we can dive into it more. - gri On Wed, Oct 10, 2012 at 11:38 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Wed, Oct 10, 2012 at 10:18 PM, <gri@golang.org> wrote: > >> PTAL. >> >> I am not concerned about the performance. Correctness trumps speed and >> the performance degradation is small (0.5 - 1.5%) for the relevant >> benchmarks (now attached). >> > > They are not parallel. > > >> >> I removed the defer statement for a minor improvement. Note that these >> benchmarks run up to 37% faster > > > My parallel measurements show 121% degradation, so I guess now it is > slower than beginning of the year. > > > >> than they used to beginning of the year; >> I suspect due to compiler improvements and due to the manual assembly >> improvements for the core math routines. It's safer to tweak those >> (there's more to get ouf of them) than trying lock-free approaches that >> are hard to understand. >> >> >> >> On 2012/10/10 14:20:58, dvyukov wrote: >> >>> I do not see the race now, looks correct. But I am concerned about >>> >> performance >> >>> implications. On the following benchmark >>> >> >> func BenchmarkStringPi(b *testing.B) { >>> const STR = >>> >> >> "**123456789012345678990123456789**012345678990123456789012345678** >> 990123456789012345678990123456**789012345678990123456789012345** >> 678990123456789012345678990123**456789012345678990" >> >>> x, _, _ := nat(nil).scan(strings.**NewReader(STR), 10) >>> _ = x.decimalString() >>> >> >> procs := runtime.GOMAXPROCS(-1) >>> N := int32(b.N) >>> c := make(chan bool, procs) >>> for p := 0; p < procs; p++ { >>> go func() { >>> x, _, _ := nat(nil).scan(strings.**NewReader(STR), >>> 10) >>> for atomic.AddInt32(&N, -1) >= 0 { >>> _ = x.decimalString() >>> } >>> c <- true >>> }() >>> } >>> for p := 0; p < procs; p++ { >>> <-c >>> } >>> } >>> >> >> w/o mutex lock/defer unlock: >>> >> >> BenchmarkStringPi 200000 4495 ns/op >>> BenchmarkStringPi-4 500000 1484 ns/op >>> BenchmarkStringPi-8 1000000 971 ns/op >>> BenchmarkStringPi-16 1000000 738 ns/op >>> >> >> with mutex lock/defer unlock: >>> >> >> BenchmarkStringPi 200000 4708 ns/op >>> BenchmarkStringPi-4 500000 1857 ns/op >>> BenchmarkStringPi-8 500000 1555 ns/op >>> BenchmarkStringPi-16 500000 1633 ns/op >>> >> >> >> I would prefer lock-free fast path (but correct). >>> >> >> >> >> https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643... >> > >
Sign in to reply to this message.
Hello dvyukov@google.com, michael.jones@gmail.com, dave@cheney.net (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
Looks correct to me
Sign in to reply to this message.
Dmitry; Can you please give this CL a LGTM if you're ok with it now? That way I can submit it. Thanks. https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643... - Robert
Sign in to reply to this message.
Lgtm On Oct 11, 2012 8:38 PM, "Robert Griesemer" <gri@golang.org> wrote: > Dmitry; > > Can you please give this CL a LGTM if you're ok with it now? That way I > can submit it. Thanks. > > https://codereview.appspot.**com/6643053/<https://codereview.appspot.com/6643... > > - Robert >
Sign in to reply to this message.
*** Submitted as http://code.google.com/p/go/source/detail?r=57e13fc87f43 *** math/big: more conservative use of lock for divisor table Minor performance impact running sequentially: benchmark old ns/op new ns/op delta BenchmarkString10Base2 389 391 +0.51% BenchmarkString100Base2 1530 1534 +0.26% BenchmarkString1000Base2 11789 11787 -0.02% BenchmarkString10000Base2 111443 112030 +0.53% BenchmarkString100000Base2 1017483 1015347 -0.21% BenchmarkString10Base8 339 344 +1.47% BenchmarkString100Base8 753 756 +0.40% BenchmarkString1000Base8 4618 4641 +0.50% BenchmarkString10000Base8 43217 43534 +0.73% BenchmarkString100000Base8 397518 400602 +0.78% BenchmarkString10Base10 630 630 +0.00% BenchmarkString100Base10 1975 1960 -0.76% BenchmarkString1000Base10 10179 10174 -0.05% BenchmarkString10000Base10 44527 44416 -0.25% BenchmarkString100000Base10 14404694 14425308 +0.14% BenchmarkString10Base16 283 288 +1.77% BenchmarkString100Base16 597 598 +0.17% BenchmarkString1000Base16 3189 3186 -0.09% BenchmarkString10000Base16 29403 29364 -0.13% BenchmarkString100000Base16 265657 265587 -0.03% Note that due to other improvements (faster assembly routines, better code generation by compiler), these benchmarks now run up to 37% faster than they used to at the last time measured (1/9/2012). Minor performance impact for StringPiParallel running in parallel: Current CL but with Lock/Unlock commented out (removed): BenchmarkStringPiParallel 5000 343581 ns/op BenchmarkStringPiParallel-2 10000 184511 ns/op BenchmarkStringPiParallel-3 10000 129768 ns/op BenchmarkStringPiParallel-4 10000 102326 ns/op Current CL: BenchmarkStringPiParallel 5000 345169 ns/op BenchmarkStringPiParallel-2 10000 185827 ns/op BenchmarkStringPiParallel-3 10000 131168 ns/op BenchmarkStringPiParallel-4 10000 102353 ns/op Fixes issue 4218. R=dvyukov, michael.jones, dave CC=golang-dev http://codereview.appspot.com/6643053
Sign in to reply to this message.
Just seeing this now (email was to a mostly unused personal account rather than my work account.) The blocking and concurrent-waiting on data around this structure is by design. To run fast, when there are two or more large numbers to be converted from internal base 2^n to external base 10 (say, but for any base not a power of two), then we need a table of divisors. For a (say million "word" number N), we need N**(1/2), N**(1/4), N**(1/8), .. down to a "small" size, say 8-words. where "**" is the exponentiation operator. This table can be big, however, it can be built just once and reused many times. The code does this now. When a first big (bigger than the existing table size) conversion is needed, the code updates the table. When more than one goroutine tries the same large conversion at the same time, then the first one builds the table and the rest of them wait for each entry to be built. Building them redundantly would be no faster and wastes memory and CPU. Once the table grows suitably, then this phase is a no-op as the table is cached. While we could wish that the table was already built, or that as in a real program we might assume that a first operation builds the table without other threads waiting, in the benchmark case of all goroutines converting the same number will cause delays for the first iterations. I see no better way. Michael On Thu, Oct 11, 2012 at 4:04 PM, <gri@golang.org> wrote: > *** Submitted as > http://code.google.com/p/go/**source/detail?r=57e13fc87f43<http://code.google... > > > math/big: more conservative use of lock for divisor table > > Minor performance impact running sequentially: > > benchmark old ns/op new ns/op delta > BenchmarkString10Base2 389 391 +0.51% > BenchmarkString100Base2 1530 1534 +0.26% > BenchmarkString1000Base2 11789 11787 -0.02% > BenchmarkString10000Base2 111443 112030 +0.53% > BenchmarkString100000Base2 1017483 1015347 -0.21% > BenchmarkString10Base8 339 344 +1.47% > BenchmarkString100Base8 753 756 +0.40% > BenchmarkString1000Base8 4618 4641 +0.50% > BenchmarkString10000Base8 43217 43534 +0.73% > BenchmarkString100000Base8 397518 400602 +0.78% > BenchmarkString10Base10 630 630 +0.00% > BenchmarkString100Base10 1975 1960 -0.76% > BenchmarkString1000Base10 10179 10174 -0.05% > BenchmarkString10000Base10 44527 44416 -0.25% > BenchmarkString100000Base10 14404694 14425308 +0.14% > BenchmarkString10Base16 283 288 +1.77% > BenchmarkString100Base16 597 598 +0.17% > BenchmarkString1000Base16 3189 3186 -0.09% > BenchmarkString10000Base16 29403 29364 -0.13% > BenchmarkString100000Base16 265657 265587 -0.03% > > Note that due to other improvements (faster assembly routines, > better code generation by compiler), these benchmarks now run > up to 37% faster than they used to at the last time measured (1/9/2012). > > Minor performance impact for StringPiParallel running in parallel: > > Current CL but with Lock/Unlock commented out (removed): > > BenchmarkStringPiParallel 5000 343581 ns/op > BenchmarkStringPiParallel-2 10000 184511 ns/op > BenchmarkStringPiParallel-3 10000 129768 ns/op > BenchmarkStringPiParallel-4 10000 102326 ns/op > > Current CL: > > BenchmarkStringPiParallel 5000 345169 ns/op > BenchmarkStringPiParallel-2 10000 185827 ns/op > BenchmarkStringPiParallel-3 10000 131168 ns/op > BenchmarkStringPiParallel-4 10000 102353 ns/op > > Fixes issue 4218. > > R=dvyukov, michael.jones, dave > CC=golang-dev > http://codereview.appspot.com/**6643053<http://codereview.appspot.com/6643053> > > > http://codereview.appspot.com/**6643053/<http://codereview.appspot.com/6643053/> > -- Michael T. Jones michael.jones@gmail.com http://www.google.com/profiles/michael.jones
Sign in to reply to this message.
I said that wrong (just woke up). It's not so much n**(1/2), n**(1/4).. as it is the other way, the largest power of 10 (say) that fits in a Word, and then that value **2, **4, **8, and so on until we ge close to the sqrt(N). The essence is the same. On Sat, Oct 13, 2012 at 7:50 AM, Michael Jones <michael.jones@gmail.com>wrote: > Just seeing this now (email was to a mostly unused personal account rather > than my work account.) > > The blocking and concurrent-waiting on data around this structure is by > design. > > To run fast, when there are two or more large numbers to be converted from > internal base 2^n to external base 10 (say, but for any base not a power of > two), then we need a table of divisors. For a (say million "word" number > N), we need N**(1/2), N**(1/4), N**(1/8), .. down to a "small" size, say > 8-words. where "**" is the exponentiation operator. This table can be big, > however, it can be built just once and reused many times. The code does > this now. > > When a first big (bigger than the existing table size) conversion is > needed, the code updates the table. When more than one goroutine tries the > same large conversion at the same time, then the first one builds the table > and the rest of them wait for each entry to be built. Building them > redundantly would be no faster and wastes memory and CPU. Once the table > grows suitably, then this phase is a no-op as the table is cached. > > While we could wish that the table was already built, or that as in a real > program we might assume that a first operation builds the table without > other threads waiting, in the benchmark case of all goroutines converting > the same number will cause delays for the first iterations. > > I see no better way. > > Michael > > On Thu, Oct 11, 2012 at 4:04 PM, <gri@golang.org> wrote: > >> *** Submitted as >> http://code.google.com/p/go/**source/detail?r=57e13fc87f43<http://code.google... >> >> >> math/big: more conservative use of lock for divisor table >> >> Minor performance impact running sequentially: >> >> benchmark old ns/op new ns/op delta >> BenchmarkString10Base2 389 391 +0.51% >> BenchmarkString100Base2 1530 1534 +0.26% >> BenchmarkString1000Base2 11789 11787 -0.02% >> BenchmarkString10000Base2 111443 112030 +0.53% >> BenchmarkString100000Base2 1017483 1015347 -0.21% >> BenchmarkString10Base8 339 344 +1.47% >> BenchmarkString100Base8 753 756 +0.40% >> BenchmarkString1000Base8 4618 4641 +0.50% >> BenchmarkString10000Base8 43217 43534 +0.73% >> BenchmarkString100000Base8 397518 400602 +0.78% >> BenchmarkString10Base10 630 630 +0.00% >> BenchmarkString100Base10 1975 1960 -0.76% >> BenchmarkString1000Base10 10179 10174 -0.05% >> BenchmarkString10000Base10 44527 44416 -0.25% >> BenchmarkString100000Base10 14404694 14425308 +0.14% >> BenchmarkString10Base16 283 288 +1.77% >> BenchmarkString100Base16 597 598 +0.17% >> BenchmarkString1000Base16 3189 3186 -0.09% >> BenchmarkString10000Base16 29403 29364 -0.13% >> BenchmarkString100000Base16 265657 265587 -0.03% >> >> Note that due to other improvements (faster assembly routines, >> better code generation by compiler), these benchmarks now run >> up to 37% faster than they used to at the last time measured (1/9/2012). >> >> Minor performance impact for StringPiParallel running in parallel: >> >> Current CL but with Lock/Unlock commented out (removed): >> >> BenchmarkStringPiParallel 5000 343581 ns/op >> BenchmarkStringPiParallel-2 10000 184511 ns/op >> BenchmarkStringPiParallel-3 10000 129768 ns/op >> BenchmarkStringPiParallel-4 10000 102326 ns/op >> >> Current CL: >> >> BenchmarkStringPiParallel 5000 345169 ns/op >> BenchmarkStringPiParallel-2 10000 185827 ns/op >> BenchmarkStringPiParallel-3 10000 131168 ns/op >> BenchmarkStringPiParallel-4 10000 102353 ns/op >> >> Fixes issue 4218. >> >> R=dvyukov, michael.jones, dave >> CC=golang-dev >> http://codereview.appspot.com/**6643053<http://codereview.appspot.com/6643053> >> >> >> http://codereview.appspot.com/**6643053/<http://codereview.appspot.com/6643053/> >> > > > > -- > Michael T. Jones > michael.jones@gmail.com > http://www.google.com/profiles/michael.jones > -- Michael T. Jones michael.jones@gmail.com http://www.google.com/profiles/michael.jones
Sign in to reply to this message.
On Sat, Oct 13, 2012 at 6:50 PM, Michael Jones <michael.jones@gmail.com>wrote: > Just seeing this now (email was to a mostly unused personal account rather > than my work account.) > > The blocking and concurrent-waiting on data around this structure is by > design. > > To run fast, when there are two or more large numbers to be converted from > internal base 2^n to external base 10 (say, but for any base not a power of > two), then we need a table of divisors. For a (say million "word" number > N), we need N**(1/2), N**(1/4), N**(1/8), .. down to a "small" size, say > 8-words. where "**" is the exponentiation operator. This table can be big, > however, it can be built just once and reused many times. The code does > this now. > > When a first big (bigger than the existing table size) conversion is > needed, the code updates the table. When more than one goroutine tries the > same large conversion at the same time, then the first one builds the table > and the rest of them wait for each entry to be built. Building them > redundantly would be no faster and wastes memory and CPU. Once the table > grows suitably, then this phase is a no-op as the table is cached. > > While we could wish that the table was already built, or that as in a real > program we might assume that a first operation builds the table without > other threads waiting, in the benchmark case of all goroutines converting > the same number will cause delays for the first iterations. > > I see no better way. > > Michael > There is only one problem. The code does a different thing, that leads to data races, crashes and exceptions. I suggested to Robert to make the code work the way you describe. > kString10Base2 389 391 +0.51% > >> BenchmarkString100Base2 1530 1534 +0.26% >> BenchmarkString1000Base2 11789 11787 -0.02% >> BenchmarkString10000Base2 111443 112030 +0.53% >> BenchmarkString100000Base2 1017483 1015347 -0.21% >> BenchmarkString10Base8 339 344 +1.47% >> BenchmarkString100Base8 753 756 +0.40% >> BenchmarkString1000Base8 4618 4641 +0.50% >> BenchmarkString10000Base8 43217 43534 +0.73% >> BenchmarkString100000Base8 397518 400602 +0.78% >> BenchmarkString10Base10 630 630 +0.00% >> BenchmarkString100Base10 1975 1960 -0.76% >> BenchmarkString1000Base10 10179 10174 -0.05% >> BenchmarkString10000Base10 44527 44416 -0.25% >> BenchmarkString100000Base10 14404694 14425308 +0.14% >> BenchmarkString10Base16 283 288 +1.77% >> BenchmarkString100Base16 597 598 +0.17% >> BenchmarkString1000Base16 3189 3186 -0.09% >> BenchmarkString10000Base16 29403 29364 -0.13% >> BenchmarkString100000Base16 265657 265587 -0.03% >> >> Note that due to other improvements (faster assembly routines, >> better code generation by compiler), these benchmarks now run >> up to 37% faster than they used to at the last time measured (1/9/2012). >> >> Minor performance impact for StringPiParallel running in parallel: >> >> Current CL but with Lock/Unlock commented out (removed): >> >> BenchmarkStringPiParallel 5000 343581 ns/op >> BenchmarkStringPiParallel-2 10000 184511 ns/op >> BenchmarkStringPiParallel-3 10000 129768 ns/op >> BenchmarkStringPiParallel-4 10000 102326 ns/op >> >> Current CL: >> >> BenchmarkStringPiParallel 5000 345169 ns/op >> BenchmarkStringPiParallel-2 10000 185827 ns/op >> BenchmarkStringPiParallel-3 10000 131168 ns/op >> BenchmarkStringPiParallel-4 10000 102353 ns/op >> >> Fixes issue 4218. >> >> R=dvyukov, michael.jones, dave >> CC=golang-dev >> http://codereview.appspot.com/**6643053<http://codereview.appspot.com/6643053> >> >> >> http://codereview.appspot.com/**6643053/<http://codereview.appspot.com/6643053/> >> > > > > -- > Michael T. Jones > michael.jones@gmail.com > http://www.google.com/profiles/michael.jones >
Sign in to reply to this message.
Thanks! Performance wise, it is important that other goroutines be able to use the read-only smaller entries while a larger one is being created. On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > I suggested to Robert to make the code work the way you describe. -- Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1 650-335-5765
Sign in to reply to this message.
It is not the case now. On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones <mtj@google.com> wrote: > Thanks! Performance wise, it is important that other goroutines be able to > use the read-only smaller entries while a larger one is being created. > > > On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > >> I suggested to Robert to make the code work the way you describe. > > > > > -- > Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1 > 650-335-5765 > >
Sign in to reply to this message.
Ok. Will look. Thank you for bringing this to my attention! On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > It is not the case now. > > > On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones <mtj@google.com> wrote: > >> Thanks! Performance wise, it is important that other goroutines be able >> to use the read-only smaller entries while a larger one is being created. >> >> >> On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov <dvyukov@google.com>wrote: >> >>> I suggested to Robert to make the code work the way you describe. >> >> >> >> >> -- >> Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1 >> 650-335-5765 >> >> > -- Michael T. Jones michael.jones@gmail.com http://www.google.com/profiles/michael.jones
Sign in to reply to this message.
The only safe way I see at the moment to do this is to lock/unlock the mutex for each new table entry. That's not hard, but I don't know if it's worth it. I am worried about correctness of lock-free approaches (using package atomic) that operate on slice elements. But I'm happy to be proven otherwise. Either way, I believe what we have now is correct if possible not optimal performance-wise. But again, I don't think there's an immediate urgency for this to be fixed as I am not aware of programs caring about this too much at the moment. - gri On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones <michael.jones@gmail.com>wrote: > Ok. Will look. Thank you for bringing this to my attention! > > > On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > >> It is not the case now. >> >> >> On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones <mtj@google.com> wrote: >> >>> Thanks! Performance wise, it is important that other goroutines be able >>> to use the read-only smaller entries while a larger one is being created. >>> >>> >>> On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov <dvyukov@google.com>wrote: >>> >>>> I suggested to Robert to make the code work the way you describe. >>> >>> >>> >>> >>> -- >>> Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1 >>> 650-335-5765 >>> >>> >> > > > -- > Michael T. Jones > michael.jones@gmail.com > http://www.google.com/profiles/michael.jones >
Sign in to reply to this message.
OK. On Oct 15, 2012 2:35 PM, "Robert Griesemer" <gri@golang.org> wrote: > The only safe way I see at the moment to do this is to lock/unlock the > mutex for each new table entry. That's not hard, but I don't know if it's > worth it. I am worried about correctness of lock-free approaches (using > package atomic) that operate on slice elements. But I'm happy to be proven > otherwise. Either way, I believe what we have now is correct if possible > not optimal performance-wise. But again, I don't think there's an immediate > urgency for this to be fixed as I am not aware of programs caring about > this too much at the moment. > - gri > > > On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones <michael.jones@gmail.com>wrote: > >> Ok. Will look. Thank you for bringing this to my attention! >> >> >> On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov <dvyukov@google.com>wrote: >> >>> It is not the case now. >>> >>> >>> On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones <mtj@google.com> wrote: >>> >>>> Thanks! Performance wise, it is important that other goroutines be able >>>> to use the read-only smaller entries while a larger one is being created. >>>> >>>> >>>> On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov <dvyukov@google.com>wrote: >>>> >>>>> I suggested to Robert to make the code work the way you describe. >>>> >>>> >>>> >>>> >>>> -- >>>> Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1 >>>> 650-335-5765 >>>> >>>> >>> >> >> >> -- >> Michael T. Jones >> michael.jones@gmail.com >> http://www.google.com/profiles/michael.jones >> > >
Sign in to reply to this message.
On 2012/10/15 21:35:07, gri wrote: > The only safe way I see at the moment to do this is to lock/unlock the > mutex for each new table entry. That's not hard, but I don't know if it's > worth it. I am worried about correctness of lock-free approaches (using > package atomic) that operate on slice elements. But I'm happy to be proven > otherwise. Either way, I believe what we have now is correct if possible > not optimal performance-wise. But again, I don't think there's an immediate > urgency for this to be fixed as I am not aware of programs caring about > this too much at the moment. > - gri It's quite easy to implement w/o atomic ops on slice elements. That's what I had in mind: https://codereview.appspot.com/6766047/diff/2001/src/pkg/math/big/nat.go > On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones <michael.jones@gmail.com>wrote: > > > Ok. Will look. Thank you for bringing this to my attention! > > > > > > On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov <mailto:dvyukov@google.com> wrote: > > > >> It is not the case now. > >> > >> > >> On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones <mailto:mtj@google.com> wrote: > >> > >>> Thanks! Performance wise, it is important that other goroutines be able > >>> to use the read-only smaller entries while a larger one is being created. > >>> > >>> > >>> On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov <dvyukov@google.com>wrote: > >>> > >>>> I suggested to Robert to make the code work the way you describe. > >>> > >>> > >>> > >>> > >>> -- > >>> Michael T. Jones | Chief Technology Advocate | mailto:mtj@google.com | +1 > >>> 650-335-5765 > >>> > >>> > >> > > > > > > -- > > Michael T. Jones > > mailto:michael.jones@gmail.com > > http://www.google.com/profiles/michael.jones > >
Sign in to reply to this message.
In an airport on the way to Cypress. Will look ASAP. On Wed, Oct 24, 2012 at 12:03 PM, <dvyukov@google.com> wrote: > On 2012/10/15 21:35:07, gri wrote: >> >> The only safe way I see at the moment to do this is to lock/unlock the >> mutex for each new table entry. That's not hard, but I don't know if > > it's >> >> worth it. I am worried about correctness of lock-free approaches > > (using >> >> package atomic) that operate on slice elements. But I'm happy to be > > proven >> >> otherwise. Either way, I believe what we have now is correct if > > possible >> >> not optimal performance-wise. But again, I don't think there's an > > immediate >> >> urgency for this to be fixed as I am not aware of programs caring > > about >> >> this too much at the moment. >> - gri > > > > It's quite easy to implement w/o atomic ops on slice elements. That's > what I had in mind: > https://codereview.appspot.com/6766047/diff/2001/src/pkg/math/big/nat.go > > > >> On Mon, Oct 15, 2012 at 1:51 PM, Michael Jones > > <michael.jones@gmail.com>wrote: > >> > Ok. Will look. Thank you for bringing this to my attention! >> > >> > >> > On Mon, Oct 15, 2012 at 3:31 AM, Dmitry Vyukov > > <mailto:dvyukov@google.com> wrote: >> >> > >> >> It is not the case now. >> >> >> >> >> >> On Sun, Oct 14, 2012 at 4:45 PM, Michael Jones > > <mailto:mtj@google.com> wrote: >> >> >> >> >>> Thanks! Performance wise, it is important that other goroutines be > > able >> >> >>> to use the read-only smaller entries while a larger one is being > > created. >> >> >>> >> >>> >> >>> On Sun, Oct 14, 2012 at 2:05 AM, Dmitry Vyukov > > <dvyukov@google.com>wrote: >> >> >>> >> >>>> I suggested to Robert to make the code work the way you describe. >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Michael T. Jones | Chief Technology Advocate | > > mailto:mtj@google.com | +1 >> >> >>> 650-335-5765 >> >>> >> >>> >> >> >> > >> > >> > -- >> > Michael T. Jones >> > mailto:michael.jones@gmail.com >> > http://www.google.com/profiles/michael.jones >> > > > > > https://codereview.appspot.com/6643053/ -- Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1 650-335-5765
Sign in to reply to this message.
|