|
|
Descriptionencoding/xml: Marshal/Escape allows invalid characters
Fixes issue 4235.
Patch Set 1 #Patch Set 2 : diff -r 2f04e02f33c0 https://code.google.com/p/go #Patch Set 3 : diff -r 2f04e02f33c0 https://code.google.com/p/go #
Total comments: 14
Patch Set 4 : diff -r a1ce0637ab2b https://code.google.com/p/go #Patch Set 5 : diff -r 149360268947 https://code.google.com/p/go #
Total comments: 2
Patch Set 6 : diff -r 149360268947 https://code.google.com/p/go #MessagesTotal messages: 19
Hello golang-dev@googlegroups.com (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://code.google.com/p/go
Sign in to reply to this message.
On 2013/03/01 01:56:47, osaingre wrote: > Hello mailto:golang-dev@googlegroups.com (cc: mailto:golang-dev@googlegroups.com), > > I'd like you to review this change to > https://code.google.com/p/go Thank you for this proposal. Can you provide any benchmark details to assert the cost of this additional checking ?
Sign in to reply to this message.
On 2013/03/02 02:01:59, dfc wrote: > On 2013/03/01 01:56:47, osaingre wrote: > > Hello mailto:golang-dev@googlegroups.com (cc: > mailto:golang-dev@googlegroups.com), > > > > I'd like you to review this change to > > https://code.google.com/p/go > > > Thank you for this proposal. Can you provide any benchmark details to assert the > cost of this additional checking ? BenchmarkMarshal marshals a single struct to XML. It runs 7% slower after the patch. Before: BenchmarkMarshal 100000 26231 ns/op After: BenchmarkMarshal 100000 28058 ns/op
Sign in to reply to this message.
On 2013/03/03 19:56:24, osaingre wrote: > On 2013/03/02 02:01:59, dfc wrote: > > On 2013/03/01 01:56:47, osaingre wrote: > > > Hello mailto:golang-dev@googlegroups.com (cc: > > mailto:golang-dev@googlegroups.com), > > > > > > I'd like you to review this change to > > > https://code.google.com/p/go > > > > > > Thank you for this proposal. Can you provide any benchmark details to assert > the > > cost of this additional checking ? > > BenchmarkMarshal marshals a single struct to XML. It runs 7% slower after the > patch. > > Before: > BenchmarkMarshal 100000 26231 ns/op > After: > BenchmarkMarshal 100000 28058 ns/op Ping :-)
Sign in to reply to this message.
This is an inefficient implementation. It adds another scan over the data, plus some allocation through bytes.Map. It's possible to do this much more efficiently. I suggest starting over with an implementation that incorporates the error check into the escape processing loop that's already there. That said, there's no rush. I doubt this will make it into Go 1.1.
Sign in to reply to this message.
On 2013/03/08 21:03:34, r wrote: > This is an inefficient implementation. It adds another scan over the data, plus > some allocation through bytes.Map. It's possible to do this much more > efficiently. > > I suggest starting over with an implementation that incorporates the error check > into the escape processing loop that's already there. > > That said, there's no rush. I doubt this will make it into Go 1.1. Ok thanks, I'll look into it.
Sign in to reply to this message.
https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go File src/pkg/encoding/xml/xml.go (right): https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1725: func isInvalidChar(r rune) bool { This function is redundant. You may use the existing isInCharacterRange from line 1002. https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1743: return '\uFFFD' Instead of '\uFFFD' you might want to use utf8.RuneError which might be clearer, even without the comment.
Sign in to reply to this message.
This CL can be made simple enough that I would like to see this fix go into Go 1.1. See below. Thanks. https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go File src/pkg/encoding/xml/xml.go (right): https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1737: // Returns a byte slice in which all non XML 1.0 compliant runes http://golang.org/doc/effective_go.html#commentary // fixRune converts an invalid rune to the Unicode replacement character. func fixRune(r rune) rune { if isInCharacterRange(r) { return r } return '\uFFFD' } https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1754: func escape(w io.Writer, s []byte, strictCharset bool) error { Please make this EscapeText, drop strictCharset (assume it is always true) and restore the original doc comment. https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1755: if i := bytes.IndexFunc(s, isInvalidChar); i != -1 { bytes.Map is going to do this anyway. It's not worth doing twice. s = bytes.Map(fixRune, s) https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1809: escape(w, s, false) Please restore this to just EscapeText(w, s). The only difference between Escape and EscapeText should concern the returned error, as it says in the doc comment; there should not be additional changes like Escape returns invalid XML and EscapeText returns valid XML.
Sign in to reply to this message.
https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go File src/pkg/encoding/xml/xml.go (right): https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1782: default: Actually, you can just skip the whole bytes.Map thing entirely by making this: default: if !isInCharacterRange(c) { esc = esc_fffd break } continue with esc_fffd = []byte("\uFFFD") above.
Sign in to reply to this message.
Hello, I made the suggested changes. Benchmark is as follows: benchmark old ns/op new ns/op delta BenchmarkMarshal 25989 31170 +19.94% Olivier https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go File src/pkg/encoding/xml/xml.go (right): https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1725: func isInvalidChar(r rune) bool { On 2013/03/11 12:10:55, volker.dobler wrote: > This function is redundant. You may use the existing > isInCharacterRange from line 1002. Done. https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1737: // Returns a byte slice in which all non XML 1.0 compliant runes On 2013/03/12 19:51:37, rsc wrote: > http://golang.org/doc/effective_go.html#commentary > > // fixRune converts an invalid rune to the Unicode replacement character. > func fixRune(r rune) rune { > if isInCharacterRange(r) { > return r > } > return '\uFFFD' > } > Done. https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1754: func escape(w io.Writer, s []byte, strictCharset bool) error { On 2013/03/12 19:51:37, rsc wrote: > Please make this EscapeText, drop strictCharset (assume it is always true) and > restore the original doc comment. > Done. https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1755: if i := bytes.IndexFunc(s, isInvalidChar); i != -1 { On 2013/03/12 19:51:37, rsc wrote: > bytes.Map is going to do this anyway. It's not worth doing twice. > > s = bytes.Map(fixRune, s) Done. https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1782: default: On 2013/03/12 19:54:07, rsc wrote: > Actually, you can just skip the whole bytes.Map thing entirely by making this: > > default: > if !isInCharacterRange(c) { > esc = esc_fffd > break > } > continue > > with esc_fffd = []byte("\uFFFD") above. I feel I can't do that unless I iterate on the runes. Am I missing something? https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1809: escape(w, s, false) On 2013/03/12 19:51:37, rsc wrote: > Please restore this to just EscapeText(w, s). > The only difference between Escape and EscapeText should concern the returned > error, as it says in the doc comment; there should not be additional changes > like Escape returns invalid XML and EscapeText returns valid XML. Done.
Sign in to reply to this message.
https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go File src/pkg/encoding/xml/xml.go (right): https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... src/pkg/encoding/xml/xml.go:1782: default: s is a slice of bytes, so you need to be a little smarter but yes, you can. the loop needs to extract the next utf-8-encoded rune, using utf8.DecodeRune. that's what bytes.Map is doing anyway, so it won't cost more than you have now and will iterate over the slice only once. r, width := utf8.DecodeRune(s) switch r { ... default: if isInvalidChar(r) { .. } } etc.
Sign in to reply to this message.
On 2013/03/13 01:27:04, r wrote: > https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go > File src/pkg/encoding/xml/xml.go (right): > > https://codereview.appspot.com/7438051/diff/5001/src/pkg/encoding/xml/xml.go#... > src/pkg/encoding/xml/xml.go:1782: default: > s is a slice of bytes, so you need to be a little smarter but yes, you can. > > the loop needs to extract the next utf-8-encoded rune, using utf8.DecodeRune. > that's what bytes.Map is doing anyway, so it won't cost more than you have now > and will iterate over the slice only once. > > r, width := utf8.DecodeRune(s) > switch r { > ... > default: > if isInvalidChar(r) { > .. > } > } > etc. Ok. That's what I intended to do in the first place when I say your remark last week. I got confused when I saw the latest suggestions. I'll do that.
Sign in to reply to this message.
I removed the call to bytes.Map(). PTAL. benchmark old ns/op new ns/op delta BenchmarkMarshal 25833 29745 +15.14%
Sign in to reply to this message.
https://codereview.appspot.com/7438051/diff/25001/src/pkg/encoding/xml/xml.go File src/pkg/encoding/xml/xml.go (right): https://codereview.appspot.com/7438051/diff/25001/src/pkg/encoding/xml/xml.go... src/pkg/encoding/xml/xml.go:1737: esc_fffd := []byte("\uFFFD") Please put this into the var block above so that it is only done once instead of being allocated on every call to EscapeText.
Sign in to reply to this message.
https://codereview.appspot.com/7438051/diff/25001/src/pkg/encoding/xml/xml.go File src/pkg/encoding/xml/xml.go (right): https://codereview.appspot.com/7438051/diff/25001/src/pkg/encoding/xml/xml.go... src/pkg/encoding/xml/xml.go:1737: esc_fffd := []byte("\uFFFD") On 2013/03/13 22:02:30, rsc wrote: > Please put this into the var block above so that it is only done once instead of > being allocated on every call to EscapeText. Done.
Sign in to reply to this message.
Did that affect the benchmark?
Sign in to reply to this message.
On 2013/03/13 22:13:35, rsc wrote: > Did that affect the benchmark? Yes indeed. After moving esc_fffd: benchmark old ns/op new ns/op delta BenchmarkMarshal 25741 28539 +10.87% benchmark old allocs new allocs delta BenchmarkMarshal 26 26 0.00% benchmark old bytes new bytes delta BenchmarkMarshal 5429 5429 0.00% before: benchmark old ns/op new ns/op delta BenchmarkMarshal 25741 29827 +15.87% benchmark old allocs new allocs delta BenchmarkMarshal 26 40 53.85% benchmark old bytes new bytes delta BenchmarkMarshal 5429 5542 2.08% I'll keep that in mind.
Sign in to reply to this message.
LGTM Thanks for iterating with us.
Sign in to reply to this message.
*** Submitted as https://code.google.com/p/go/source/detail?r=62c96dd0ddc3 *** encoding/xml: rewrite invalid code points to U+FFFD in Marshal, Escape Fixes issue 4235. R=rsc, dave, r, dr.volker.dobler CC=golang-dev https://codereview.appspot.com/7438051 Committer: Russ Cox <rsc@golang.org>
Sign in to reply to this message.
|