Issue 317090043: i#1834 memval: Adds memval sample

toshi

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c File api/samples/memval.c (right): https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222 api/samples/memval.c:222: dr_insert_clean_call(drcontext, ilist, where, (void *)memcpy, false, 3, Should we ...

7 years, 3 months ago (2017-01-12 12:12:57 UTC) #2

bruening

https://codereview.appspot.com/317090043/diff/1/api/samples/CMakeLists.txt File api/samples/CMakeLists.txt (right): https://codereview.appspot.com/317090043/diff/1/api/samples/CMakeLists.txt#newcode214 api/samples/CMakeLists.txt:214: add_sample_client(memval "memval.c;utils.c" "drmgr;drreg;drutil;drx") nit: align to match line below ...

7 years, 3 months ago (2017-01-12 20:31:50 UTC) #3

toshi

Responded to your immediate questions, and will mark Done to the style issues and comment ...

7 years, 3 months ago (2017-01-12 21:33:20 UTC) #4

Responded to your immediate questions, and will mark Done to the style issues
and comment requests soon.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c
File api/samples/memval.c (right):

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode165
api/samples/memval.c:165: dr_save_reg(drcontext, ilist, where, reg_tmp,
slot_tmp);
On 2017/01/12 20:31:49, bruening wrote:
> Not sure that this is needed, will comment down below...

Acknowledged.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode198
api/samples/memval.c:198: dr_restore_reg(drcontext, ilist, where, reg_tmp,
slot_tmp);
On 2017/01/12 20:31:50, bruening wrote:
> This is unnecessary.  It's only *DR* spill slots that have that prohibition:
> *drreg* slots are valid across app instrs.
> 
> If you can think of a way to augment the drreg (or DR) docs to make this
clear,
> please add it to the CL or send as a separate CL.

reg_tmp is a drreg-reserved register, so yeah it's value is persisted across app
instructions. But, the problem shown here is that, after we call
drutil_insert_get_mem_addr(), which loads the write address into reg_tmp, we
subsequently clobber it later when loading type/size. However, we need that
value in a reg to persist till after the current instruction. That was the
motivation for performing this spill/restore.

I also considered re-calling drutil_insert_get_mem_addr(), but it's also
non-trivial to do so because the reserved registers could possibly be the same
as the ones used in the memref, which have since been clobbered.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode220
api/samples/memval.c:220: * opt to simply use memcpy to copy the memory into the
write buffer.
On 2017/01/12 20:31:50, bruening wrote:
> What I would do is have fast single-store cases for the common sizes: 1, 2, 4,
8
> (for 64-bit) bytes, and fall back to this only for larger or unusual sizes. 
> Otherwise this is going to dominate the overhead.

Can do. Should I add a method in drx.c?

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222
api/samples/memval.c:222: dr_insert_clean_call(drcontext, ilist, where, (void
*)memcpy, false, 3,
On 2017/01/12 20:31:49, bruening wrote:
> On 2017/01/12 12:12:57, toshi wrote:
> > Should we use memcpy here? It simplifies the code, but we don't know whether
> it
> > is compatible with drx_buf. If we have too many large writes in a row,
> faulting
> > the buffer, will the application crash?
> > 
> > Should we write our own drx_buf_insert_create_memcpy()?
> 
> First, this is a circular buffer, which can't handle var-len writes, right? 
> You're risking overflow instead of wrapping around.

The circular buffer, without the 16K buffer size optimization uses faults as
well. However, I highly doubt that if memcpy() goes over, we'd be able to handle
it, as our handler expects some faulting instruction of the form mov $imm/%reg,
(%base + $disp). I'm assuming memcpy() does some sort of word copy, i.e. *p++ =
*q++, which probably wouldn't produce the necessary instrumentation for us to
catch it.

Also, circular buffer does allow variable-length writes.

> 
> Re: faults: do circular buffers have a redzone for faults at all?
> 
> For faults in memcpy, good question: today a fault in a "client lib" results
in
> an abort.  If memcpy were inlined into the client lib rather than in a private
> libc, it won't work.  But, it seems like we should give the client a chance to
> handle all faults, even in its own lib -- xref i#50.  The risk is that if the
> client state is compromised it might flail while we try to call its event
> handler.  Prob still worth it: I would say let's put in calling of the handler
> (separate CL).

Acknowledged.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode292
api/samples/memval.c:292: * assume no instruction has multiple distinct memory
operands.
On 2017/01/12 20:31:49, bruening wrote:
> AVX2's VSIB unfortunately breaks this: it can reference 4 different doubles
> scattered around memory.  In DR we decided not to try and split it in some
way. 
> Dr. Memory does not yet handle them
> (https://github.com/DynamoRIO/drmemory/issues/1375) so fine to punt here, but
at
> least talk about them with a XXX comment. 

Understood. If it's not worth it, I won't pursue it. Though, one way this would
be possible to implement would be to read the last few elements from the
trace_buffer (as many as there are distinct dest operands) and extract the write
address from each of them, then perform the memcpy per operand.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode317
api/samples/memval.c:317: * instrumentation after the last instruction in a
basic block.
On 2017/01/12 20:31:49, bruening wrote:
> We don't need this if we just have a check for drmgr_is_last_instr() that
> unreserves the reg, right?

If I understand correctly, instead of adding a NOP, we would just always
unreserve all outstanding registers on the last instruction, right?

Usually, this would work fine, since a terminating instruction exists for most
basic blocks. But I've been bitten by this where there was a basic block that
was not terminated by any special instruction -- it was a simple move that wrote
memory. If we blindly unreserved the register, then we'd never write it's memref
to disk (and we can get other insidious errors because the trace buffer has an
entry that's not in the write buffer).

Sign in to reply to this message.

bruening

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c File api/samples/memval.c (right): https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode198 api/samples/memval.c:198: dr_restore_reg(drcontext, ilist, where, reg_tmp, slot_tmp); On 2017/01/12 21:33:20, toshi ...

7 years, 3 months ago (2017-01-12 23:11:01 UTC) #5

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c
File api/samples/memval.c (right):

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode198
api/samples/memval.c:198: dr_restore_reg(drcontext, ilist, where, reg_tmp,
slot_tmp);
On 2017/01/12 21:33:20, toshi wrote:
> On 2017/01/12 20:31:50, bruening wrote:
> > This is unnecessary.  It's only *DR* spill slots that have that prohibition:
> > *drreg* slots are valid across app instrs.
> > 
> > If you can think of a way to augment the drreg (or DR) docs to make this
> clear,
> > please add it to the CL or send as a separate CL.
> 
> reg_tmp is a drreg-reserved register, so yeah it's value is persisted across
app
> instructions. But, the problem shown here is that, after we call
> drutil_insert_get_mem_addr(), which loads the write address into reg_tmp, we
> subsequently clobber it later when loading type/size. However, we need that
> value in a reg to persist till after the current instruction. That was the
> motivation for performing this spill/restore.
> 
> I also considered re-calling drutil_insert_get_mem_addr(), but it's also
> non-trivial to do so because the reserved registers could possibly be the same
> as the ones used in the memref, which have since been clobbered. 

I see.  It may be more efficient to ask drreg for a 3rd reg and copy the address
there, as drreg will avoid extra spill+restore memrefs if there are dead regs.

reg_tmp is only clobbered for ARM, right?

It would be nice to automate the ARM-vs-x86 scratch w/o ifdefs here, but that's
not easy to do w/o moving some drreg reservations inside drx_buf which is a
whole different approach from the current library register model.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode220
api/samples/memval.c:220: * opt to simply use memcpy to copy the memory into the
write buffer.
On 2017/01/12 21:33:20, toshi wrote:
> On 2017/01/12 20:31:50, bruening wrote:
> > What I would do is have fast single-store cases for the common sizes: 1, 2,
4,
> 8
> > (for 64-bit) bytes, and fall back to this only for larger or unusual sizes. 
> > Otherwise this is going to dominate the overhead.
> 
> Can do. Should I add a method in drx.c?

Hmm, since it hasn't come up before maybe just here instead of a library
routine.  OTOH maybe that would be useful.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222
api/samples/memval.c:222: dr_insert_clean_call(drcontext, ilist, where, (void
*)memcpy, false, 3,
On 2017/01/12 21:33:20, toshi wrote:
> On 2017/01/12 20:31:49, bruening wrote:
> > On 2017/01/12 12:12:57, toshi wrote:
> > > Should we use memcpy here? It simplifies the code, but we don't know
whether
> > it
> > > is compatible with drx_buf. If we have too many large writes in a row,
> > faulting
> > > the buffer, will the application crash?
> > > 
> > > Should we write our own drx_buf_insert_create_memcpy()?
> > 
> > First, this is a circular buffer, which can't handle var-len writes, right? 
> > You're risking overflow instead of wrapping around.
> 
> The circular buffer, without the 16K buffer size optimization uses faults as
> well. However, I highly doubt that if memcpy() goes over, we'd be able to
handle
> it, as our handler expects some faulting instruction of the form mov
$imm/%reg,
> (%base + $disp). I'm assuming memcpy() does some sort of word copy, i.e. *p++
=
> *q++, which probably wouldn't produce the necessary instrumentation for us to
> catch it.

Right, that's another issue with memcpy and the fault: we'd have to relax the
fault handler pattern.  OK, so this is why you suggested a drx_buf_*_memcpy. 
Makes more sense now.  Seems reasonable to add it.

> Also, circular buffer does allow variable-length writes.

For the 16K wraparound how does it handle a single var-len store that goes off
the end?  It only wraps on the buf ptr update, not within each write.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode317
api/samples/memval.c:317: * instrumentation after the last instruction in a
basic block.
On 2017/01/12 21:33:20, toshi wrote:
> On 2017/01/12 20:31:49, bruening wrote:
> > We don't need this if we just have a check for drmgr_is_last_instr() that
> > unreserves the reg, right?
> 
> If I understand correctly, instead of adding a NOP, we would just always
> unreserve all outstanding registers on the last instruction, right?
> 
> Usually, this would work fine, since a terminating instruction exists for most
> basic blocks. But I've been bitten by this where there was a basic block that
> was not terminated by any special instruction -- it was a simple move that
wrote
> memory. If we blindly unreserved the register, then we'd never write it's
memref
> to disk (and we can get other insidious errors because the trace buffer has an
> entry that's not in the write buffer).

Yes, right.  So either you have to specially post-insert when on the last instr,
or you insert sthg -- but can you just insert a label instead of a nop?  The
is-app check is afterward (document that users of the drx routine need that
ordering).  Then there's zero cost in executed instrs.

Sign in to reply to this message.

toshi

Done w/ all except the discussion on drx_buf_*_memcpy(). I'll write that up within the next ...

7 years, 3 months ago (2017-01-13 09:53:44 UTC) #6

Done w/ all except the discussion on drx_buf_*_memcpy(). I'll write that up
within the next few patchsets.

https://codereview.appspot.com/317090043/diff/1/api/samples/CMakeLists.txt
File api/samples/CMakeLists.txt (right):

https://codereview.appspot.com/317090043/diff/1/api/samples/CMakeLists.txt#ne...
api/samples/CMakeLists.txt:214: add_sample_client(memval "memval.c;utils.c"
"drmgr;drreg;drutil;drx")
On 2017/01/12 20:31:49, bruening wrote:
> nit: align to match line below

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/CMakeLists.txt#ne...
api/samples/CMakeLists.txt:214: add_sample_client(memval "memval.c;utils.c"
"drmgr;drreg;drutil;drx")
On 2017/01/12 20:31:49, bruening wrote:
> nit: group with memtrace+instrace

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c
File api/samples/memval.c (right):

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode36
api/samples/memval.c:36: * Tracks all app writes and dumps them to a file.
On 2017/01/12 20:31:50, bruening wrote:
> suggest: elaborate on what is being recorded+dumped: the address *and* the
value

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode38
api/samples/memval.c:38: * (1) It fills two per-thread-buffers with inlined
instrumentation.
On 2017/01/12 20:31:49, bruening wrote:
> Explain why two (see related comments below.)

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode39
api/samples/memval.c:39: * (2) It calls a clean call to dump the buffers into a
file.
On 2017/01/12 20:31:49, bruening wrote:
> This doesn't seem accurate: I see a fault handler for that

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode49
api/samples/memval.c:49: *   manner
On 2017/01/12 20:31:49, bruening wrote:
> Isn't the key thing in this sample, getting the value post-instr, missing from
> this list?

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode67
api/samples/memval.c:67: app_pc addr; /* mem ref addr */
On 2017/01/12 20:31:50, bruening wrote:
> Without thinking hard, most people reading this would say "why not put the
value
> here and have just one buffer".  Explaining why not in comments would help.
> How hard would it be to have a var-len component?
> 
> Explain how the code correlates values between the buffers (i.e., how to find
> the value in the other buffer for a given mem_ref_t entry in the trace
buffer).

Right now, I believe the only reason why we cannot use a single buffer is
because we don't know how well memcpy() will work with drx_buf(). If and when we
get around to writing drx_buf_*_memcpy(), would you think it a good idea to
consolidate the buffers?

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode75
api/samples/memval.c:75: * but we give ourselves some leeway and say a write
should generally be under 32 bytes.
On 2017/01/12 20:31:49, bruening wrote:
> Some are quite large: e.g., OP_xsave is hundreds of bytes.  So we're assuming
> here that the *average* size for any 4096-long consecutive sequence is <= 32
> bytes.  Seems reasonable for a sample: can you update the comment: "generally"
> should be more about the "average".

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode102
api/samples/memval.c:102: static void
On 2017/01/12 20:31:50, bruening wrote:
> suggest: a comment explaining "fault" (i.e., it's not an unexpected fatal
crash)

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode115
api/samples/memval.c:115: for (mem_ref = trace_base; mem_ref < trace_ptr;
mem_ref++)
On 2017/01/12 20:31:50, bruening wrote:
> style violation: {} for multi-line body

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode127
api/samples/memval.c:127: write_hexdump(hex_buf, write_base, mem_ref));
On 2017/01/12 20:31:49, bruening wrote:
> Use libc fprintf (see memtrace_simple.c and i#1929)

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode127
api/samples/memval.c:127: write_hexdump(hex_buf, write_base, mem_ref));
On 2017/01/12 20:31:49, bruening wrote:
> Add comments that a binary dump is *much* faster than fprintf.
> 
> Should this be renamed memval_simple.c?  memtrace and instrace have the
_simple
> use text, while the _x86 uses binary by default (along with having other
instru
> opts).

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode145
api/samples/memval.c:145: if (drreg_reserve_register(drcontext, ilist, where,
NULL, &reg_tmp) != DRREG_SUCCESS) {
On 2017/01/12 20:31:50, bruening wrote:
> style: line too long

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode149
api/samples/memval.c:149: if (drreg_reserve_register(drcontext, ilist, where,
NULL, &reg_ptr) != DRREG_SUCCESS) {
On 2017/01/12 20:31:49, bruening wrote:
> style: line too long

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode162
api/samples/memval.c:162: opnd_create_reg(reg_tmp), OPSZ_PTR,
offsetof(mem_ref_t, addr));
On 2017/01/12 20:31:50, bruening wrote:
> style: line too long

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode174
api/samples/memval.c:174: OPND_CREATE_INT16(size), OPSZ_2, offsetof(mem_ref_t,
size));
On 2017/01/12 20:31:50, bruening wrote:
> style: line too long

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode178
api/samples/memval.c:178: if (instr_is_call(where)) {
On 2017/01/12 20:31:50, bruening wrote:
> This should be x86-only as on ARM a call writes to the link register, not
> memory.  I guess this function won't be called on ARM: so maybe an assert or
> comment if not an actual ifdef?  Either way.

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode186
api/samples/memval.c:186: drx_buf_insert_buf_store(drcontext, trace_buffer,
ilist, where, reg_ptr, DR_REG_NULL,
On 2017/01/12 20:31:50, bruening wrote:
> style: line too long

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode198
api/samples/memval.c:198: dr_restore_reg(drcontext, ilist, where, reg_tmp,
slot_tmp);
On 2017/01/12 23:11:01, bruening wrote:
> On 2017/01/12 21:33:20, toshi wrote:
> > On 2017/01/12 20:31:50, bruening wrote:
> > > This is unnecessary.  It's only *DR* spill slots that have that
prohibition:
> > > *drreg* slots are valid across app instrs.
> > > 
> > > If you can think of a way to augment the drreg (or DR) docs to make this
> > clear,
> > > please add it to the CL or send as a separate CL.
> > 
> > reg_tmp is a drreg-reserved register, so yeah it's value is persisted across
> app
> > instructions. But, the problem shown here is that, after we call
> > drutil_insert_get_mem_addr(), which loads the write address into reg_tmp, we
> > subsequently clobber it later when loading type/size. However, we need that
> > value in a reg to persist till after the current instruction. That was the
> > motivation for performing this spill/restore.
> > 
> > I also considered re-calling drutil_insert_get_mem_addr(), but it's also
> > non-trivial to do so because the reserved registers could possibly be the
same
> > as the ones used in the memref, which have since been clobbered. 
> 
> I see.  It may be more efficient to ask drreg for a 3rd reg and copy the
address
> there, as drreg will avoid extra spill+restore memrefs if there are dead regs.
> 
> reg_tmp is only clobbered for ARM, right?
> 
> It would be nice to automate the ARM-vs-x86 scratch w/o ifdefs here, but
that's
> not easy to do w/o moving some drreg reservations inside drx_buf which is a
> whole different approach from the current library register model.

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode212
api/samples/memval.c:212: if (drreg_reserve_register(drcontext, ilist, where,
NULL, &reg_ptr) != DRREG_SUCCESS) {
On 2017/01/12 20:31:50, bruening wrote:
> style: line too long

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode220
api/samples/memval.c:220: * opt to simply use memcpy to copy the memory into the
write buffer.
On 2017/01/12 23:11:01, bruening wrote:
> On 2017/01/12 21:33:20, toshi wrote:
> > On 2017/01/12 20:31:50, bruening wrote:
> > > What I would do is have fast single-store cases for the common sizes: 1,
2,
> 4,
> > 8
> > > (for 64-bit) bytes, and fall back to this only for larger or unusual
sizes. 
> > > Otherwise this is going to dominate the overhead.
> > 
> > Can do. Should I add a method in drx.c?
> 
> Hmm, since it hasn't come up before maybe just here instead of a library
> routine.  OTOH maybe that would be useful.

Acknowledged, see comment below.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222
api/samples/memval.c:222: dr_insert_clean_call(drcontext, ilist, where, (void
*)memcpy, false, 3,
On 2017/01/12 23:11:01, bruening wrote:
> On 2017/01/12 21:33:20, toshi wrote:
> > On 2017/01/12 20:31:49, bruening wrote:
> > > On 2017/01/12 12:12:57, toshi wrote:
> > > > Should we use memcpy here? It simplifies the code, but we don't know
> whether
> > > it
> > > > is compatible with drx_buf. If we have too many large writes in a row,
> > > faulting
> > > > the buffer, will the application crash?
> > > > 
> > > > Should we write our own drx_buf_insert_create_memcpy()?
> > > 
> > > First, this is a circular buffer, which can't handle var-len writes,
right? 
> > > You're risking overflow instead of wrapping around.
> > 
> > The circular buffer, without the 16K buffer size optimization uses faults as
> > well. However, I highly doubt that if memcpy() goes over, we'd be able to
> handle
> > it, as our handler expects some faulting instruction of the form mov
> $imm/%reg,
> > (%base + $disp). I'm assuming memcpy() does some sort of word copy, i.e.
*p++
> =
> > *q++, which probably wouldn't produce the necessary instrumentation for us
to
> > catch it.
> 
> Right, that's another issue with memcpy and the fault: we'd have to relax the
> fault handler pattern.  OK, so this is why you suggested a drx_buf_*_memcpy. 
> Makes more sense now.  Seems reasonable to add it.

Gotcha, will add drx_buf_*_memcpy(). This will also potentially have the
optimization for common write sizes mentioned above (2, 4, 8 etc).

> 
> > Also, circular buffer does allow variable-length writes.
> 
> For the 16K wraparound how does it handle a single var-len store that goes off
> the end?  It only wraps on the buf ptr update, not within each write.

Yup, we were thinking of adding some slack to the 16K buffer, but we also
admonish in the docs that you should only write fixed-length elements whose
sizes are all divisors of the 16K buffer size, so I didn't add any slack region.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode292
api/samples/memval.c:292: * assume no instruction has multiple distinct memory
operands.
On 2017/01/12 20:31:50, bruening wrote:
> You mean "no instruction has multiple distinct *stores*", right?  Certainly
> several have a load and a store (push, movs, etc.)

Done.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode317
api/samples/memval.c:317: * instrumentation after the last instruction in a
basic block.
On 2017/01/12 23:11:01, bruening wrote:
> On 2017/01/12 21:33:20, toshi wrote:
> > On 2017/01/12 20:31:49, bruening wrote:
> > > We don't need this if we just have a check for drmgr_is_last_instr() that
> > > unreserves the reg, right?
> > 
> > If I understand correctly, instead of adding a NOP, we would just always
> > unreserve all outstanding registers on the last instruction, right?
> > 
> > Usually, this would work fine, since a terminating instruction exists for
most
> > basic blocks. But I've been bitten by this where there was a basic block
that
> > was not terminated by any special instruction -- it was a simple move that
> wrote
> > memory. If we blindly unreserved the register, then we'd never write it's
> memref
> > to disk (and we can get other insidious errors because the trace buffer has
an
> > entry that's not in the write buffer).
> 
> Yes, right.  So either you have to specially post-insert when on the last
instr,
> or you insert sthg -- but can you just insert a label instead of a nop?  The
> is-app check is afterward (document that users of the drx routine need that
> ordering).  Then there's zero cost in executed instrs.

Done, great idea.

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode378
api/samples/memval.c:378: if(drreg_init(&ops) != DRREG_SUCCESS)
On 2017/01/12 20:31:49, bruening wrote:
> style: "if ("

Done.

Sign in to reply to this message.

toshi

Commit log for latest patchset: --------------- i#1834 memval: Adds memval sample Introduces a sample client ...

7 years, 3 months ago (2017-01-13 09:59:12 UTC) #7

bruening

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c File api/samples/memval.c (right): https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222 api/samples/memval.c:222: dr_insert_clean_call(drcontext, ilist, where, (void *)memcpy, false, 3, On 2017/01/13 ...

7 years, 3 months ago (2017-01-13 17:26:59 UTC) #8

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c
File api/samples/memval.c (right):

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222
api/samples/memval.c:222: dr_insert_clean_call(drcontext, ilist, where, (void
*)memcpy, false, 3,
On 2017/01/13 09:53:44, toshi wrote:
> On 2017/01/12 23:11:01, bruening wrote:
> > On 2017/01/12 21:33:20, toshi wrote:
> > > On 2017/01/12 20:31:49, bruening wrote:
> > > > On 2017/01/12 12:12:57, toshi wrote:
> > > > > Should we use memcpy here? It simplifies the code, but we don't know
> > whether
> > > > it
> > > > > is compatible with drx_buf. If we have too many large writes in a row,
> > > > faulting
> > > > > the buffer, will the application crash?
> > > > > 
> > > > > Should we write our own drx_buf_insert_create_memcpy()?
> > > > 
> > > > First, this is a circular buffer, which can't handle var-len writes,
> right? 
> > > > You're risking overflow instead of wrapping around.
> > > 
> > > The circular buffer, without the 16K buffer size optimization uses faults
as
> > > well. However, I highly doubt that if memcpy() goes over, we'd be able to
> > handle
> > > it, as our handler expects some faulting instruction of the form mov
> > $imm/%reg,
> > > (%base + $disp). I'm assuming memcpy() does some sort of word copy, i.e.
> *p++
> > =
> > > *q++, which probably wouldn't produce the necessary instrumentation for us
> to
> > > catch it.
> > 
> > Right, that's another issue with memcpy and the fault: we'd have to relax
the
> > fault handler pattern.  OK, so this is why you suggested a drx_buf_*_memcpy.

> > Makes more sense now.  Seems reasonable to add it.
> 
> Gotcha, will add drx_buf_*_memcpy(). This will also potentially have the
> optimization for common write sizes mentioned above (2, 4, 8 etc).
> 
> > 
> > > Also, circular buffer does allow variable-length writes.
> > 
> > For the 16K wraparound how does it handle a single var-len store that goes
off
> > the end?  It only wraps on the buf ptr update, not within each write.
> 
> Yup, we were thinking of adding some slack to the 16K buffer, but we also
> admonish in the docs that you should only write fixed-length elements whose
> sizes are all divisors of the 16K buffer size, so I didn't add any slack
region.

So you're retracting your comment "circular buffer does allow variable-length
writes" :) ?  Didn't look at new patchset yet, presumably it does not use a
circular anymore.

Sign in to reply to this message.

bruening

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c File api/samples/memval.c (right): https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222 api/samples/memval.c:222: dr_insert_clean_call(drcontext, ilist, where, (void *)memcpy, false, 3, On 2017/01/13 ...

7 years, 3 months ago (2017-01-13 17:35:50 UTC) #9

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c
File api/samples/memval.c (right):

https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222
api/samples/memval.c:222: dr_insert_clean_call(drcontext, ilist, where, (void
*)memcpy, false, 3,
On 2017/01/13 17:26:59, bruening wrote:
> On 2017/01/13 09:53:44, toshi wrote:
> > On 2017/01/12 23:11:01, bruening wrote:
> > > On 2017/01/12 21:33:20, toshi wrote:
> > > > On 2017/01/12 20:31:49, bruening wrote:
> > > > > On 2017/01/12 12:12:57, toshi wrote:
> > > > > > Should we use memcpy here? It simplifies the code, but we don't know
> > > whether
> > > > > it
> > > > > > is compatible with drx_buf. If we have too many large writes in a
row,
> > > > > faulting
> > > > > > the buffer, will the application crash?
> > > > > > 
> > > > > > Should we write our own drx_buf_insert_create_memcpy()?
> > > > > 
> > > > > First, this is a circular buffer, which can't handle var-len writes,
> > right? 
> > > > > You're risking overflow instead of wrapping around.
> > > > 
> > > > The circular buffer, without the 16K buffer size optimization uses
faults
> as
> > > > well. However, I highly doubt that if memcpy() goes over, we'd be able
to
> > > handle
> > > > it, as our handler expects some faulting instruction of the form mov
> > > $imm/%reg,
> > > > (%base + $disp). I'm assuming memcpy() does some sort of word copy, i.e.
> > *p++
> > > =
> > > > *q++, which probably wouldn't produce the necessary instrumentation for
us
> > to
> > > > catch it.
> > > 
> > > Right, that's another issue with memcpy and the fault: we'd have to relax
> the
> > > fault handler pattern.  OK, so this is why you suggested a
drx_buf_*_memcpy.
> 
> > > Makes more sense now.  Seems reasonable to add it.
> > 
> > Gotcha, will add drx_buf_*_memcpy(). This will also potentially have the
> > optimization for common write sizes mentioned above (2, 4, 8 etc).
> > 
> > > 
> > > > Also, circular buffer does allow variable-length writes.
> > > 
> > > For the 16K wraparound how does it handle a single var-len store that goes
> off
> > > the end?  It only wraps on the buf ptr update, not within each write.
> > 
> > Yup, we were thinking of adding some slack to the 16K buffer, but we also
> > admonish in the docs that you should only write fixed-length elements whose
> > sizes are all divisors of the 16K buffer size, so I didn't add any slack
> region.
> 
> So you're retracting your comment "circular buffer does allow variable-length
> writes" :) ?  Didn't look at new patchset yet, presumably it does not use a
> circular anymore.

Sent too soon.  Further discussion with myself to be sure I understand: so we
both agree 16K circular buffer does not handle var-len.  But for non-16K
circular buffer that uses faults: for a single-instr var-len write, the fault
handling code re-executes the store with the buf ptr set back to the top?  So
it's up to the user to know where the final write at the bottom ends (could be a
gap) -- which could work here if the code reading the buffer checks for the end
and moves to the start if it doesn't fit.

But for memcpy with multiple stores, it's going to split and wrap around.  I
guess that can be made to work too, if the code here checks for the end and
merges together the two pieces.

But imagine a 4-byte gap at the bottom, and we want to write 8 bytes.  How does
the user know whether the memcpy optimized and did a single 8-byte store which
faulted and got moved (leaving a gap) or put 4 bytes at the bottom and 4 at the
top?  The fault handler has to be consistent I guess.

Sign in to reply to this message.

toshi

On 2017/01/13 17:35:50, bruening wrote: > https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c > File api/samples/memval.c (right): > > https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222 > ...

7 years, 3 months ago (2017-01-13 20:15:48 UTC) #10

On 2017/01/13 17:35:50, bruening wrote:
> https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c
> File api/samples/memval.c (right):
> 
>
https://codereview.appspot.com/317090043/diff/1/api/samples/memval.c#newcode222
> api/samples/memval.c:222: dr_insert_clean_call(drcontext, ilist, where, (void
> *)memcpy, false, 3,
> On 2017/01/13 17:26:59, bruening wrote:
> > On 2017/01/13 09:53:44, toshi wrote:
> > > On 2017/01/12 23:11:01, bruening wrote:
> > > > On 2017/01/12 21:33:20, toshi wrote:
> > > > > On 2017/01/12 20:31:49, bruening wrote:
> > > > > > On 2017/01/12 12:12:57, toshi wrote:
> > > > > > > Should we use memcpy here? It simplifies the code, but we don't
know
> > > > whether
> > > > > > it
> > > > > > > is compatible with drx_buf. If we have too many large writes in a
> row,
> > > > > > faulting
> > > > > > > the buffer, will the application crash?
> > > > > > > 
> > > > > > > Should we write our own drx_buf_insert_create_memcpy()?
> > > > > > 
> > > > > > First, this is a circular buffer, which can't handle var-len writes,
> > > right? 
> > > > > > You're risking overflow instead of wrapping around.
> > > > > 
> > > > > The circular buffer, without the 16K buffer size optimization uses
> faults
> > as
> > > > > well. However, I highly doubt that if memcpy() goes over, we'd be able
> to
> > > > handle
> > > > > it, as our handler expects some faulting instruction of the form mov
> > > > $imm/%reg,
> > > > > (%base + $disp). I'm assuming memcpy() does some sort of word copy,
i.e.
> > > *p++
> > > > =
> > > > > *q++, which probably wouldn't produce the necessary instrumentation
for
> us
> > > to
> > > > > catch it.
> > > > 
> > > > Right, that's another issue with memcpy and the fault: we'd have to
relax
> > the
> > > > fault handler pattern.  OK, so this is why you suggested a
> drx_buf_*_memcpy.
> > 
> > > > Makes more sense now.  Seems reasonable to add it.
> > > 
> > > Gotcha, will add drx_buf_*_memcpy(). This will also potentially have the
> > > optimization for common write sizes mentioned above (2, 4, 8 etc).
> > > 
> > > > 
> > > > > Also, circular buffer does allow variable-length writes.
> > > > 
> > > > For the 16K wraparound how does it handle a single var-len store that
goes
> > off
> > > > the end?  It only wraps on the buf ptr update, not within each write.
> > > 
> > > Yup, we were thinking of adding some slack to the 16K buffer, but we also
> > > admonish in the docs that you should only write fixed-length elements
whose
> > > sizes are all divisors of the 16K buffer size, so I didn't add any slack
> > region.
> > 
> > So you're retracting your comment "circular buffer does allow
variable-length
> > writes" :) ?  Didn't look at new patchset yet, presumably it does not use a
> > circular anymore.
> 
> Sent too soon.  Further discussion with myself to be sure I understand: so we
> both agree 16K circular buffer does not handle var-len.  But for non-16K
> circular buffer that uses faults: for a single-instr var-len write, the fault
> handling code re-executes the store with the buf ptr set back to the top?  So
> it's up to the user to know where the final write at the bottom ends (could be
a
> gap) -- which could work here if the code reading the buffer checks for the
end
> and moves to the start if it doesn't fit.
> 
> But for memcpy with multiple stores, it's going to split and wrap around.  I
> guess that can be made to work too, if the code here checks for the end and
> merges together the two pieces.
> 
> But imagine a 4-byte gap at the bottom, and we want to write 8 bytes.  How
does
> the user know whether the memcpy optimized and did a single 8-byte store which
> faulted and got moved (leaving a gap) or put 4 bytes at the bottom and 4 at
the
> top?  The fault handler has to be consistent I guess.

Right, we're using the faulting buffer which allows var-length writes. For the
fast circular buffer we advise not to.

And yes, I think memcpy() would have this problem -- in our implementation of
drx_buf_memcpy(), we would instead have to incrementally update the buffer
pointer each iteration of the loop, as a sort of "commit" operation to let the
fault handler know where the last partial write ended. That way we would know
exactly where the partial write left off, and we could fix it in the handler.

example instrumentation for drx_buf_memcpy():

loop:
  test ecx, ecx
  jz done
  ; edx holds write_addr
  ; ebx holds buf_ptr
  mov eax, [edx + ecx*4]
  mov [ebx], eax
  ; instrumentation for drx_insert_update_buf_ptr()
  add ebx, 4
  mov tls_slot, ebx  ; acts as a "commit" operation to write back to TLS
  jmp loop
done:
  ...

Sign in to reply to this message.

toshi

Commit log for latest patchset: --------------- i#1834 memval: Adds memval sample Introduces a sample client ...

7 years, 3 months ago (2017-01-15 07:01:20 UTC) #11

toshi

Patch set message says it all, I implemented the optimization for gpr-sized writes instead of ...

7 years, 3 months ago (2017-01-15 07:24:35 UTC) #12

toshi

https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c File ext/drx/drx_buf.c (right): https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c#newcode695 ext/drx/drx_buf.c:695: /* This could happen if, for example we tried ...

7 years, 3 months ago (2017-01-16 01:55:43 UTC) #13

bruening

https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c File ext/drx/drx_buf.c (right): https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c#newcode653 ext/drx/drx_buf.c:653: insert_load(void *drcontext, instrlist_t *ilist, instr_t *where, On 2017/01/15 07:24:34, ...

7 years, 3 months ago (2017-01-16 20:22:51 UTC) #14

https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c
File ext/drx/drx_buf.c (right):

https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c#newcode653
ext/drx/drx_buf.c:653: insert_load(void *drcontext, instrlist_t *ilist, instr_t
*where,
On 2017/01/15 07:24:34, toshi wrote:
> I've come into this problem a few times before, where I wanted a generic
> load/store that worked on any gpr. Maybe in a separate CL we change
> drx_buf_insert_buf_store() to drx_insert_store(), and make
> drx_buf_insert_buf_store() just redirect to drx_insert_store() so there's no
API
> change.
> 
> Conceivably we could also write a drx_insert_load() with insert_load() as the
> base (though it needs a bit of work done on it, as it's currently tailored to
> fit our needs as of now).

For a load, see below re: movzx.

https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c#newcode695
ext/drx/drx_buf.c:695: /* This could happen if, for example we tried to resize
the base
On 2017/01/16 01:55:43, toshi wrote:
> On 2017/01/15 07:24:34, toshi wrote:
> > Is this a good solution? Should we instead exclude ebp and some other
> registers
> > from drreg()? Not sure which is worse for performance: drreg excluding some
> > registers in 32-bit and thus thrashing trying to spill/restore registers, or
> > drreg allowing ebp and us having to fall back onto the clean call for many
> (how
> > many?) 1-byte writes.
> 
> This is actually a problem with drx_buf_insert_buf_store() as well. No client
> that I know of uses XINST_CREATE_store_1byte(), so I've never run into this
> problem before.
> 
> Should we put this burden on the client to explicitly disallow reserving esi,
> edi and ebp? This would fix all our problems, but in that case maybe we should
> create a method drreg_init_vector_at_least_size() which takes an opnd_size_t
and
> will fill a drvector with only registers which can be resized to that operand
> size (a no-op on AArchXX). This way clients only hit a potential slowdown when
> they know that they require a register which can be resized to it's 1-byte
> counterpart.
> 
> I can do the above in a separate CL before moving on here, if need be.

For a load, often the simplest solution is to use movzx to load 1 byte into the
full register.  This matches the behavior on ARM where the ISA only provides
this type of 1-byte or 2-byte load that zero-expands and writes the whole
register.  Maybe there should be an XINST_CREATE_load_1byte_zextend() or sthg
for this.

For this particular store to the buffer, on x86, we may be able to get away with
writing the whole word, rely on it being little-endian, and only update the buf
ptr by 1 (or 2).  And on ARM there's no problem.

Another alternative for x86 is OP_movs but it takes too much setup as you need 3
specific regs.

Sign in to reply to this message.

toshi

On 2017/01/16 20:22:51, bruening wrote: > https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c > File ext/drx/drx_buf.c (right): > > https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c#newcode653 > ...

7 years, 3 months ago (2017-01-18 04:17:14 UTC) #15

On 2017/01/16 20:22:51, bruening wrote:
> https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c
> File ext/drx/drx_buf.c (right):
> 
>
https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c#newcode653
> ext/drx/drx_buf.c:653: insert_load(void *drcontext, instrlist_t *ilist,
instr_t
> *where,
> On 2017/01/15 07:24:34, toshi wrote:
> > I've come into this problem a few times before, where I wanted a generic
> > load/store that worked on any gpr. Maybe in a separate CL we change
> > drx_buf_insert_buf_store() to drx_insert_store(), and make
> > drx_buf_insert_buf_store() just redirect to drx_insert_store() so there's no
> API
> > change.
> > 
> > Conceivably we could also write a drx_insert_load() with insert_load() as
the
> > base (though it needs a bit of work done on it, as it's currently tailored
to
> > fit our needs as of now).
> 
> For a load, see below re: movzx.
> 
>
https://codereview.appspot.com/317090043/diff/40001/ext/drx/drx_buf.c#newcode695
> ext/drx/drx_buf.c:695: /* This could happen if, for example we tried to resize
> the base
> On 2017/01/16 01:55:43, toshi wrote:
> > On 2017/01/15 07:24:34, toshi wrote:
> > > Is this a good solution? Should we instead exclude ebp and some other
> > registers
> > > from drreg()? Not sure which is worse for performance: drreg excluding
some
> > > registers in 32-bit and thus thrashing trying to spill/restore registers,
or
> > > drreg allowing ebp and us having to fall back onto the clean call for many
> > (how
> > > many?) 1-byte writes.
> > 
> > This is actually a problem with drx_buf_insert_buf_store() as well. No
client
> > that I know of uses XINST_CREATE_store_1byte(), so I've never run into this
> > problem before.
> > 
> > Should we put this burden on the client to explicitly disallow reserving
esi,
> > edi and ebp? This would fix all our problems, but in that case maybe we
should
> > create a method drreg_init_vector_at_least_size() which takes an opnd_size_t
> and
> > will fill a drvector with only registers which can be resized to that
operand
> > size (a no-op on AArchXX). This way clients only hit a potential slowdown
when
> > they know that they require a register which can be resized to it's 1-byte
> > counterpart.
> > 
> > I can do the above in a separate CL before moving on here, if need be.
> 
> For a load, often the simplest solution is to use movzx to load 1 byte into
the
> full register.  This matches the behavior on ARM where the ISA only provides
> this type of 1-byte or 2-byte load that zero-expands and writes the whole
> register.  Maybe there should be an XINST_CREATE_load_1byte_zextend() or sthg
> for this.
> 
> For this particular store to the buffer, on x86, we may be able to get away
with
> writing the whole word, rely on it being little-endian, and only update the
buf
> ptr by 1 (or 2).  And on ARM there's no problem.
> 
> Another alternative for x86 is OP_movs but it takes too much setup as you need
3
> specific regs.

Done.

Sign in to reply to this message.

toshi

Commit log for latest patchset: --------------- i#1834 memval: Adds memval sample Introduces a sample client ...

7 years, 3 months ago (2017-01-18 04:23:27 UTC) #16

toshi

https://codereview.appspot.com/317090043/diff/60001/core/arch/x86/instr_create.h File core/arch/x86/instr_create.h (right): https://codereview.appspot.com/317090043/diff/60001/core/arch/x86/instr_create.h#newcode172 core/arch/x86/instr_create.h:172: * followed by a zextend. Oops, thinking about it, ...

7 years, 3 months ago (2017-01-18 04:25:19 UTC) #17

bruening

It's cleaner to separate each feature into its own CL: so the drx addition on ...

7 years, 3 months ago (2017-01-18 04:32:11 UTC) #18

toshi

On 2017/01/18 04:32:11, bruening wrote: > It's cleaner to separate each feature into its own ...

7 years, 3 months ago (2017-01-18 16:23:53 UTC) #19

Edmund.Grimley.Evans

https://codereview.appspot.com/317090043/diff/60001/api/samples/memval_simple.c File api/samples/memval_simple.c (right): https://codereview.appspot.com/317090043/diff/60001/api/samples/memval_simple.c#newcode142 api/samples/memval_simple.c:142: fprintf(data->logf, ""PFX": %s %2d %s\n", Putting no space between ...

7 years, 3 months ago (2017-01-19 09:15:26 UTC) #20

toshi

Derek: For memval Part 3, should I upload only the patchset related to drx_buf_insert_memcpy(), or ...

7 years, 3 months ago (2017-01-23 18:29:52 UTC) #21

bruening

On 2017/01/23 18:29:52, toshi wrote: > Derek: For memval Part 3, should I upload only ...

7 years, 3 months ago (2017-01-23 18:59:04 UTC) #22

toshi.piazza

7 years, 2 months ago (2017-02-13 05:53:47 UTC) #23

On 2017/01/23 18:59:04, bruening wrote:
> On 2017/01/23 18:29:52, toshi wrote:
> > Derek: For memval Part 3, should I upload only the patchset related to
> > drx_buf_insert_memcpy(), or should I upload all of memval_simple (i.e. this
> > patch w/o the zextend and drx padding stuff).
> 
> I guess it seems cleanest to have the drx_buf feature be separate with a test
of
> its own in suite/ and then have the new memval sample added in its own CL.  Of
> course that's easy for me to say, you're the one doing the work -- doesn't
seem
> so bad to combine them, but I would vote for separate.

Moved to GitHub UI: https://github.com/DynamoRIO/dynamorio/pull/2174

Sign in to reply to this message.

Issue 317090043: i#1834 memval: Adds memval sample

Description

Patch Set 1 #

Patch Set 2 : Addresses reviewer concerns #

Patch Set 3 : Addresses optimization for gpr-sized writes only #

Patch Set 4 : Introduces correct load/store behavior in presence of E{BP,SI,DI} #

Messages