Issue 7455049: code review 7455049: runtime: use lock-free ring for work scheduling

Can't Edit
Can't Publish+Mail
Start Review

Created:
13 years, 2 months ago by dho

Modified:
12 years, 4 months ago

Reviewers:
minux1, rsc, dave, matthew.horsnell, remyoudompheng, dvyukov

CC:
golang-codereviews

Visibility:
Public.

Description

runtime: use lock-free ring for work scheduling The algorithm used is based on the implementation in Concurrency Kit (http://concurrencykit.org). The ring has single producer, multiple consumer access semantics; is wait-free on insert and lock-free on retrieval. The size must be a power of 2. This changeset removes the work stealing code; it may be possible to add that back, but it is a bit hairy. When a P's ring is full, the G "overflows" into the global run queue.

Patch Set 1 #

Patch Set 2 : diff -r f545866390ab https://code.google.com/p/go/ #

Patch Set 3 : diff -r f545866390ab https://code.google.com/p/go/ #

Total comments: 22

Created: 13 years, 2 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+227 lines, -209 lines)			Patch
M	src/pkg/runtime/asm_386.s	View	1	2 chunks	+31 lines, -0 lines	0 comments	Download
M	src/pkg/runtime/asm_amd64.s	View	1	2 chunks	+30 lines, -0 lines	0 comments	Download
M	src/pkg/runtime/asm_arm.s	View	1	1 chunk	+31 lines, -0 lines	0 comments	Download
M	src/pkg/runtime/atomic_arm.c	View	1	1 chunk	+13 lines, -0 lines	0 comments	Download
M	src/pkg/runtime/export_test.go	View	1	1 chunk	+0 lines, -2 lines	0 comments	Download
M	src/pkg/runtime/proc.c	View	1	22 chunks	+75 lines, -197 lines	17 comments	Download
M	src/pkg/runtime/proc_test.go	View	1	1 chunk	+0 lines, -4 lines	0 comments	Download
M	src/pkg/runtime/runtime.h	View	1	3 chunks	+9 lines, -6 lines	5 comments	Download
M	src/pkg/runtime/sys_freebsd_arm.s	View	1	1 chunk	+3 lines, -0 lines	0 comments	Download
M	src/pkg/runtime/sys_linux_arm.s	View	1	1 chunk	+32 lines, -0 lines	0 comments	Download
M	src/pkg/runtime/sys_netbsd_arm.s	View	1	1 chunk	+3 lines, -0 lines	0 comments	Download

Messages

Total messages: 23

Expand All Messages | Collapse All Messages

dho

Hello dvyukov@google.com, rsc@golang.org, minux.ma@gmail.com, dave@cheney.net (cc: golang-dev@googlegroups.com), I'd like you to review this change to ...

13 years, 2 months ago (2013-03-04 18:43:48 UTC) #1

dho

I haven't tested this on ARM yet; I'm not sure if it compiles. This algorithm ...

13 years, 2 months ago (2013-03-04 18:52:07 UTC) #2

I haven't tested this on ARM yet; I'm not sure if it compiles. This
algorithm is sensitive to store / load ordering, which means that if
this causes ARM to fail, we're going to need to implement DSB/DMB
instructions in 5a. I'm not sure if the LL/SC loops are enough, but I
figured that work should go in a different CL. (Also, I have no idea
how to do it having never really messed with the compilers.)

Interesting benchmarks:

BenchmarkFields                           30561068     24279386  -20.55%
BenchmarkFieldsFunc                       30054191     24357377  -18.96%
BenchmarkReaderCopyOptimal                    1509         1534   +1.66%
BenchmarkReaderCopyUnoptimal                  7562         5602  -25.92%
BenchmarkReaderCopyNoWriteTo                 10675        11640   +9.04%
BenchmarkWriterCopyOptimal                    9261         4473  -51.70%
BenchmarkWriterCopyUnoptimal                  7318         2757  -62.33%
BenchmarkWriterCopyNoReadFrom                10609        12184  +14.85%
BenchmarkReadString                           5494         6213  +13.09%
BenchmarkIndexByte32                             6            7  +13.89%
BenchmarkIndexByte4K                           238          253   +6.30%
BenchmarkIndexByte4M                        261574       261425   -0.06%
BenchmarkIndexByte64M                      4331658      5021060  +15.92%
BenchmarkIndexBytePortable32                    42           44   +5.19%
BenchmarkIndexBytePortable4K                  2345         2448   +4.39%
BenchmarkIndexBytePortable4M               2416851      2533766   +4.84%
BenchmarkIndexBytePortable64M             38112846     38451022   +0.89%
BenchmarkEqual32                                39           42   +7.91%
BenchmarkEqual4K                              2331         2418   +3.73%
BenchmarkEqual4M                           2586779      2677882   +3.52%
BenchmarkEqual64M                         40851645     41153023   +0.74%
BenchmarkEqualPort32                            46           49   +6.29%
BenchmarkEqualPort4K                          3496         3624   +3.66%
BenchmarkEqualPortable4M                   3776748      3765554   -0.30%
BenchmarkEqualPortable64M                 58643898     60004963   +2.32%
BenchmarkIndex32                               664          685   +3.16%
BenchmarkIndex4K                            101669       105488   +3.76%
BenchmarkIndex4M                         104609738    111066889   +6.17%
BenchmarkIndex64M                       1676273044   1750548375   +4.43%
BenchmarkIndexEasy32                            58           60   +3.09%
BenchmarkIndexEasy4K                           288          292   +1.39%
BenchmarkIndexEasy4M                        254642       282289  +10.86%
BenchmarkIndexEasy64M                      4385937      4525697   +3.19%
BenchmarkCount32                               673          667   -0.89%
BenchmarkCount4K                            103380       103880   +0.48%
BenchmarkCount4M                         106307010    104518944   -1.68%
BenchmarkCount64M                       1702085125   1677917865   -1.42%
BenchmarkCountEasy32                            59           59   +1.01%
BenchmarkCountEasy4K                           289          292   +1.04%
BenchmarkCountEasy4M                        262517       267442   +1.88%
BenchmarkCountEasy64M                      4344423      4299636   -1.03%
BenchmarkTrimSpace                              59           55   -6.61%
BenchmarkSprintfEmpty                           98           99   +1.63%
BenchmarkSprintfString                         284          279   -1.76%
BenchmarkSprintfInt                            226          217   -3.98%
BenchmarkSprintfIntInt                         345          335   -2.90%
BenchmarkSprintfPrefixedInt                    345          356   +3.19%
BenchmarkSprintfFloat                          470          468   -0.43%
BenchmarkManyArgs                             1338         1301   -2.77%
BenchmarkScanInts                           964898       974407   +0.99%
BenchmarkScanRecursiveInt                  1180694      1250752   +5.93%

BenchmarkTCPOneShot                         121556       118597   -2.43%
BenchmarkTCPOneShotTimeout                  636851       601332   -5.58%
BenchmarkTCPPersistent                       50440        48279   -4.28%
BenchmarkTCPPersistentTimeout                50095        49055   -2.08%

BenchmarkSelectUncontended                     243          234   -3.70%
BenchmarkSelectContended                       241          236   -2.07%
BenchmarkSelectNonblock                        105          104   -0.95%

BenchmarkChanUncontended                        64           65   +2.03%
BenchmarkChanContended                          64           64   -0.47%
BenchmarkChanSync                              137          120  -12.41%
BenchmarkChanProdCons0                         140          121  -13.57%
BenchmarkChanProdCons10                         87           83   -5.24%
BenchmarkChanProdCons100                        70           66   -5.55%
BenchmarkChanProdConsWork0                     633          584   -7.74%
BenchmarkChanProdConsWork10                    544          549   +0.92%
BenchmarkChanProdConsWork100                   526          533   +1.33%
BenchmarkChanCreation                          164          172   +4.88%
BenchmarkChanSem                                61           65   +6.01%

BenchmarkCallClosure                             2            2   +7.24%
BenchmarkCallClosure1                            3            3   +0.32%
BenchmarkCallClosure2                           42           42   +0.70%
BenchmarkCallClosure3                           43           45   +3.90%
BenchmarkCallClosure4                           43           43   +0.23%

BenchmarkSyscall                                28           27   -2.84%
BenchmarkSyscallWork                           192          187   -2.60%
BenchmarkSyscallExcess                          28           27   -2.47%
BenchmarkSyscallExcessWork                     189          193   +2.12%

BenchmarkCreateGoroutines                      111           87  -21.08%
BenchmarkCreateGoroutinesParallel              114           90  -20.79%

BenchmarkUncontendedSemaphore                   25           26   +3.59%
BenchmarkContendedSemaphore                     80           26  -66.50%
BenchmarkMutexUncontended                       17           18   +3.43%
BenchmarkMutex                                  17           18   +0.56%
BenchmarkMutexSlack                             17           17   +0.56%
BenchmarkMutexWork                             176          178   +1.14%
BenchmarkMutexWorkSlack                        178          176   -1.12%
BenchmarkOnce                                    5            5   -2.09%
BenchmarkSemaUncontended                        25           26   +3.17%
BenchmarkSemaSyntNonblock                       25           26   +0.77%
BenchmarkSemaSyntBlock                          25           26   +1.54%
BenchmarkSemaWorkNonblock                      191          192   +0.52%
BenchmarkSemaWorkBlock                         191          192   +0.52%
BenchmarkRWMutexUncontended                     66           69   +4.35%
BenchmarkRWMutexWrite100                        22           22   +1.33%
BenchmarkRWMutexWrite10                         24           24   +0.00%
BenchmarkRWMutexWorkWrite100                   189          186   -1.59%
BenchmarkRWMutexWorkWrite10                    173          177   +2.31%

BenchmarkWaitGroupUncontended                   23           24   +1.26%
BenchmarkWaitGroupAddDone                       21           21   +1.42%
BenchmarkWaitGroupAddDoneWork                  184          184   +0.00%
BenchmarkWaitGroupWait                           5            5   +0.92%
BenchmarkWaitGroupWaitWork                     170          173   +1.76%

BenchmarkAfterFunc                             428          362  -15.42%
BenchmarkAfter                                 829          607  -26.78%
BenchmarkStop                                  358          280  -21.79%
BenchmarkTicker                              17123        14846  -13.30%
BenchmarkNow                                    14           11  -19.31%
BenchmarkNowUnixNano                            12           11   -5.56%
BenchmarkFormat                                904          617  -31.75%
BenchmarkFormatNow                             685          584  -14.74%
BenchmarkParse                                 486          426  -12.35%
BenchmarkHour                                   19           17  -10.88%
BenchmarkSecond                                 16           15   -2.48%
BenchmarkYear                                   29           30   +2.04%
BenchmarkDay                                    37           40   +6.08%

Devons-MacBook-Pro:shootout dho$ cat thread-ring-old.txt
        3.02 real         3.01 user         0.00 sys
       10.21 real        11.51 user         4.07 sys
        6.73 real         7.91 user         3.08 sys
        6.11 real         7.27 user         2.74 sys
Devons-MacBook-Pro:shootout dho$ cat thread-ring-new.txt
        2.67 real         2.67 user         0.00 sys
        5.87 real         6.30 user         2.54 sys
        4.66 real         4.88 user         2.17 sys
        4.42 real         4.81 user         1.98 sys

remyoudompheng

2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>: > BenchmarkReaderCopyOptimal 1509 1534 +1.66% > BenchmarkReaderCopyUnoptimal 7562 5602 -25.92% ...

13 years, 2 months ago (2013-03-04 18:55:46 UTC) #3

remyoudompheng

2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: > 2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>: >> BenchmarkReaderCopyOptimal 1509 1534 +1.66% ...

13 years, 2 months ago (2013-03-04 19:20:56 UTC) #4

dho

2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: > 2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: >> 2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>: ...

13 years, 2 months ago (2013-03-04 19:24:52 UTC) #5

remyoudompheng

2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>: > 2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: >> 2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>: ...

13 years, 2 months ago (2013-03-04 19:39:11 UTC) #6

2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>:
> 2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>:
>> 2013/3/4 Rémy Oudompheng <remyoudompheng@gmail.com>:
>>> 2013/3/4 Devon H. O'Dell <devon.odell@gmail.com>:
>>>> BenchmarkReaderCopyOptimal                    1509         1534   +1.66%
>>>> BenchmarkReaderCopyUnoptimal                  7562         5602  -25.92%
>>>> BenchmarkReaderCopyNoWriteTo                 10675        11640   +9.04%
>>>> BenchmarkWriterCopyOptimal                    9261         4473  -51.70%
>>>> BenchmarkWriterCopyUnoptimal                  7318         2757  -62.33%
>>>> BenchmarkWriterCopyNoReadFrom                10609        12184  +14.85%
>>>
>>> Why are there differences here? Is it because of a stack split?
>>>
>>
>> GC accounts for 75% of the profile, and it runs thousands of times
>> during the benchmark, I'm not sure anything can be said reliably.
>
> Ok, that's fair. I was curious about this myself. (But I'm still
> ramping up on figuring out how to profile Go programs; apparently it's
> unreliable on mac, or so says #go-nuts.)
>
> I'd hope this isn't also true of the select/chan/goroutine benchmarks.

The improvement seems real in the threadring benchmark.

Before:
Total: 657 samples
     184  28.0%  28.0%      184  28.0% runtime.xchg
      87  13.2%  41.2%       87  13.2% runtime.futex
      73  11.1%  52.4%      306  46.6% runtime.chansend
      42   6.4%  58.8%      472  71.8% testa.f
      37   5.6%  64.4%       37   5.6% dequeue
      35   5.3%  69.7%      118  18.0% runtime.chanrecv
      35   5.3%  75.0%      103  15.7% runtime.unlock
      23   3.5%  78.5%      124  18.9% schedule
      22   3.3%  81.9%      149  22.7% runtime.lock
      19   2.9%  84.8%       43   6.5% runqget

After:
Total: 597 samples
     116  19.4%  19.4%      116  19.4% runtime.xchg
      84  14.1%  33.5%       84  14.1% runtime.futex
      56   9.4%  42.9%      427  71.5% testa.f
      54   9.0%  51.9%      263  44.1% runtime.chansend
      36   6.0%  58.0%       95  15.9% runtime.chanrecv
      30   5.0%  63.0%       30   5.0% dequeue
      29   4.9%  67.8%      104  17.4% runtime.lock
      27   4.5%  72.4%       27   4.5% runtime.atomicload
      26   4.4%  76.7%      113  18.9% schedule
      21   3.5%  80.2%       43   7.2% runqget

package main

import (
	"sync"
	"testing"
)

const Nthread = 503

func f(i int, in <-chan int, out chan<- int, wg *sync.WaitGroup) {
	for {
		n := <-in
		if n == 0 {
			wg.Done()
			out <- 0
		}
		out <- n - 1
	}
}

func BenchmarkThreadRing(b *testing.B) {
	one := make(chan int) // will be input to thread 1
	var in, out chan int = nil, one
	wg := new(sync.WaitGroup)
	wg.Add(Nthread)
	for i := 1; i <= Nthread-1; i++ {
		in, out = out, make(chan int)
		go f(i, in, out, wg)
	}
	go f(Nthread, out, one, wg)
	one <- b.N
	wg.Wait()
}

rsc

Devon, Thanks for working on this. Have you coordinated with Dmitriy about this? I don't ...

13 years, 2 months ago (2013-03-04 19:54:43 UTC) #7

dho

2013/3/4 Russ Cox <rsc@golang.org>: > Devon, > > Thanks for working on this. Have you ...

13 years, 2 months ago (2013-03-04 19:57:20 UTC) #8

dvyukov

https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c File src/pkg/runtime/proc.c (right): https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode279 src/pkg/runtime/proc.c:279: if (runqput(m->p, gp) == false) { Perhaps combine it ...

13 years, 2 months ago (2013-03-08 18:23:27 UTC) #9

dho

Hi, Haven't had a lot of time to fix this recently but I wanted to ...

13 years, 2 months ago (2013-03-12 15:38:43 UTC) #10

dho

2013/3/12 Russ Cox <rsc@golang.org>: > The C compiler does not inline. Maybe we should revisit ...

13 years, 2 months ago (2013-03-12 15:50:25 UTC) #12

rsc

On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell <devon.odell@gmail.com>wrote: > 2013/3/12 Russ ...

13 years, 2 months ago (2013-03-12 15:56:01 UTC) #13

dho

2013/3/12 Russ Cox <rsc@golang.org>: > On Tue, Mar 12, 2013 at 11:50 AM, Devon H. ...

13 years, 2 months ago (2013-03-12 16:07:28 UTC) #14

dvyukov

On Tue, Mar 12, 2013 at 7:38 PM, Devon H. O'Dell <devon.odell@gmail.com> wrote: >> https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newcode279 ...

13 years, 2 months ago (2013-03-12 16:18:01 UTC) #15

On Tue, Mar 12, 2013 at 7:38 PM, Devon H. O'Dell <devon.odell@gmail.com> wrote:

>>
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco...
>> src/pkg/runtime/proc.c:279: if (runqput(m->p, gp) == false) {
>> Perhaps combine it in a separate function, because it's copied several
>> times.
>
> That will get inlined, yes?


No, it won't.



>>
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco...
>> src/pkg/runtime/proc.c:2087: if((delta & mask) == (start & mask))
>> Is it intended that you do not full the last slot?
>> If so, it can be:
>> if((delta^start)&mask))
>> the compiler is not particularly good at code optimization.
>
> This check won't work. We can't enqueue a new value into the ring if
> that would overflow the ring. So, if delta == start, we don't insert.
> If we have a ring of size 4, the tail is at 0x3 and the head is at
> 0x0, this will end up being if(0&mask), which is of course always
> false.

I meant ((delta^start)&mask) == 0


>>
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco...
>> src/pkg/runtime/proc.c:2093: runtime·fastatomicstore(&p->r_tail, delta);
>> atomicstorerelease should contain the necessary fence
>
> The store fence is required to serialize the update of the tail with
> respect to putting the entry in the ring, so it's not really a
> question of whether this interface provides the fence. It's to make
> sure that all other consumers do not read that the tail is updated
> without being guaranteed to be able to read the entry at the tail.


atomicstorerelease should contain the necessary fence, the TODO is not
necessary



>>
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco...
>> src/pkg/runtime/proc.c:2109: /* TODO: load fence */
>> atomicload contains the fence
>
> This is to guarantee consistency of our view of the snapshot of the head.

Yes, I understand, but please remove the TODO, because it's not
actually the TODO, the fence is already there.


>>
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco...
>> src/pkg/runtime/proc.c:2114: /* TODO: load fence */
>> why is it needed?
>
> Same reason, guarantee we're looking at the proper state of the head
> when we do the assignment from the ring.


The necessary fence is in atomicload/cas.


>>
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco...
>> src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask];
>> I think you can implement steal in a similar way, just copying a batch
>> of elements here and incrementing head by N instead on 1.
>
> It would be possible to do this with a memcpy indeed, but you have to
> do the copy every iteration of the loop

I think it is fine

> because head may have changed.
> Another way I was thinking of doing it was just to "steal" the blocks
> and do the copies once that was successful -- but the problem is that
> another producer might invalidate the consistency of the ring by the
> time the copy started. I'll do the memcpy approach for now.
>
>>
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco...
>> src/pkg/runtime/proc.c:2116: gp = p->ring[start & mask];
>> I think you can implement steal() in a similar way, just copying a batch
>> of elements here and incrementing head by N instead of 1.
>>
>>
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/proc.c#newco...
>> src/pkg/runtime/proc.c:2118: /* TODO: full memory fence */
>> why is it needed?
>
> To serialize loading with respect to updating the head poitner.

cas contains the necessary fence, these are *not* relaxed atomic



>>
https://codereview.appspot.com/7455049/diff/4001/src/pkg/runtime/runtime.h#ne...
>> src/pkg/runtime/runtime.h:690: bool     runtime·casvalue(uint32*, uint32,
>> uint32, uint32*);
>> Change cas(uint32*, uint32, uint32) to cas(uint32*, uint32*, uint32)
>> similar to cas64. Then you don't need the separate function.
>> This should go in a separate patch.
>
> Then we pay the cost for assigning to the old value everywhere, even
> when we don't need it. I think this just adds overhead.


Local assignment is not a great cost. The cc compiler makes
unnecessary assignments everywhere. And it's needed in mutexes and
will be needed in chans. Most other places are not performance
critical.

dvyukov

On Tue, Mar 12, 2013 at 8:07 PM, Devon H. O'Dell <devon.odell@gmail.com> wrote: > 2013/3/12 ...

13 years, 2 months ago (2013-03-12 16:21:32 UTC) #16

rsc

On Tue, Mar 12, 2013 at 12:07 PM, Devon H. O'Dell <devon.odell@gmail.com>wrote: > 2013/3/12 Russ ...

13 years, 2 months ago (2013-03-12 16:26:09 UTC) #17

dvyukov

On Tue, Mar 12, 2013 at 8:26 PM, Russ Cox <rsc@golang.org> wrote: > On Tue, ...

13 years, 2 months ago (2013-03-12 16:37:36 UTC) #18

On Tue, Mar 12, 2013 at 8:26 PM, Russ Cox <rsc@golang.org> wrote:
> On Tue, Mar 12, 2013 at 12:07 PM, Devon H. O'Dell <devon.odell@gmail.com>
> wrote:
>>
>> 2013/3/12 Russ Cox <rsc@golang.org>:
>> > On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell
>> > <devon.odell@gmail.com>
>> > wrote:
>> >>
>> >> 2013/3/12 Russ Cox <rsc@golang.org>:
>> >> > The C compiler does not inline.
>> >>
>> >> Maybe we should revisit this after Go 1.1.
>> >>
>> >>
>> >> I need help implementing support for the DMB instruction on ARM. I
>> >> don't understand the assembler and linker code enough to add the
>> >> instruction and this won't work on ARM without that. Additionally,
>> >> we'd need an interface for generating memory barriers. If the C
>> >> compiler doesn't inline, that's going to be a ton of overhead just to
>> >> generate sfence/lfence/dmb.
>> >
>> >
>> > I am happy to add intrinsics if you demonstrate that the performance win
>> > is
>> > enough. We did this already with prefetch.
>>
>> What benchmarks do you want me to run? I'll only be able to run them
>> reliably on x86 and AMD64 until there's support for DMB. This
>> instruction is required for consistency using the lock-free data
>> structure to guarantee that stores / loads are observed by other
>> threads; it's not really for performance. I guess in the meantime I
>> could make an ASM stub using .BYTE?
>
>
> I'd like to see a version that calls into assembly (you want WORD $x, not
> BYTE, and no dot), compared with a version that doesn't. In the case of
> prefetch, the difference was significant enough to add to the compiler.
>
> To compare with the version that doesn't call into assembly, you can patch
> in CL 7756043, with two caveats:
> 1. It's not done well, and it would need redoing if we keep it.
> 2. The linker emits DSB as 0xFFFFFFFF, which I assume is not the actual
> encoding. You'll need to tweak that before using.


We do not want stand-alone memory fences, the atomic operations we
have include the memory fences. We can make them finer-grained if
necessary (e.g. load-acquire, store-release, cas-release, etc).
While the atomic operations are in assembly, intrinsic for DMB makes no sense.
If anything, we need ATOMIC_LOAD/ATOMIC_STORE/ETC intrinsics.

matthew.horsnell_gmail.com

Late to the party, but keen to test this change list. I am trying to ...

13 years, 2 months ago (2013-03-13 21:55:52 UTC) #19

Late to the party, but keen to test this change list. I am trying to use 
the DSB primitive from inside a linked assembly routine. Is DSB expected to 
work from assembly? I'm currently getting the following errors:

test_arm.s:19 syntax error, last name: DSB

On Tuesday, 12 March 2013 16:26:07 UTC, rsc wrote:
>
> On Tue, Mar 12, 2013 at 12:07 PM, Devon H. O'Dell
<devon...@gmail.com<javascript:>
> > wrote:
>
>> 2013/3/12 Russ Cox <r...@golang.org <javascript:>>:
>> > On Tue, Mar 12, 2013 at 11:50 AM, Devon H. O'Dell
<devon...@gmail.com<javascript:>
>> >
>> > wrote:
>> >>
>> >> 2013/3/12 Russ Cox <r...@golang.org <javascript:>>:
>> >> > The C compiler does not inline.
>> >>
>> >> Maybe we should revisit this after Go 1.1.
>> >>
>> >>
>> >> I need help implementing support for the DMB instruction on ARM. I
>> >> don't understand the assembler and linker code enough to add the
>> >> instruction and this won't work on ARM without that. Additionally,
>> >> we'd need an interface for generating memory barriers. If the C
>> >> compiler doesn't inline, that's going to be a ton of overhead just to
>> >> generate sfence/lfence/dmb.
>> >
>> >
>> > I am happy to add intrinsics if you demonstrate that the performance 
>> win is
>> > enough. We did this already with prefetch.
>>
>> What benchmarks do you want me to run? I'll only be able to run them
>> reliably on x86 and AMD64 until there's support for DMB. This
>> instruction is required for consistency using the lock-free data
>> structure to guarantee that stores / loads are observed by other
>> threads; it's not really for performance. I guess in the meantime I
>> could make an ASM stub using .BYTE?
>>
>
> I'd like to see a version that calls into assembly (you want WORD $x, not 
> BYTE, and no dot), compared with a version that doesn't. In the case of 
> prefetch, the difference was significant enough to add to the compiler. 
>
> To compare with the version that doesn't call into assembly, you can patch 
> in CL 7756043, with two caveats:
> 1. It's not done well, and it would need redoing if we keep it.
> 2. The linker emits DSB as 0xFFFFFFFF, which I assume is not the actual 
> encoding. You'll need to tweak that before using.
>
> Russ
>

rsc

No, it doesn't. You can use WORD $0x12345678 with the constant corrected. Russ

13 years, 2 months ago (2013-03-13 22:00:50 UTC) #20

dvyukov

12 years, 4 months ago (2014-01-01 09:09:03 UTC) #23

R=close
replacement CL: https://codereview.appspot.com/46170044/

Expand All Messages | Collapse All Messages