|
|
Descriptiongo.text/cases: first stab at cases package.
Overview
Includes language-specific casing. Title caser uses a close
approximation of the Unicode Word Breaking Algorithm. Most
information needed for case mapping is stored in a single
trie.
Core files:
cases.go: API
casemap.go: per-language, high-level case mappings
context.go: trie lookup wrapper and state management
Patch Set 1 #Patch Set 2 : diff -r d65bffbc88a1 https://code.google.com/p/go.text #Patch Set 3 : diff -r d65bffbc88a1 https://code.google.com/p/go.text #
Total comments: 68
Patch Set 4 : diff -r d65bffbc88a1 https://code.google.com/p/go.text #
Total comments: 14
Patch Set 5 : diff -r a18e7d21c06d https://code.google.com/p/go.text #Patch Set 6 : diff -r 12288f41f508 https://code.google.com/p/go.text #
Total comments: 43
Patch Set 7 : diff -r 12288f41f508 https://code.google.com/p/go.text #
Total comments: 21
Patch Set 8 : diff -r 024681b033be https://code.google.com/p/go.text #Patch Set 9 : diff -r 024681b033be https://code.google.com/p/go.text #
Total comments: 18
Patch Set 10 : diff -r 024681b033be https://code.google.com/p/go.text #
Total comments: 41
Patch Set 11 : diff -r 1ac75970ff9e https://code.google.com/p/go.text #Patch Set 12 : diff -r 1ac75970ff9e https://code.google.com/p/go.text #
Total comments: 48
Patch Set 13 : diff -r f8db539672d0 https://code.google.com/p/go.text #
Total comments: 14
Patch Set 14 : diff -r f8db539672d0 https://code.google.com/p/go.text #Patch Set 15 : diff -r f8db539672d0 https://code.google.com/p/go.text #Patch Set 16 : diff -r f8db539672d0 https://code.google.com/p/go.text #Patch Set 17 : diff -r b8406a47db8d https://code.google.com/p/go.text #
MessagesTotal messages: 28
Hello nigeltao@golang.org (cc: golang-codereviews@googlegroups.com, r@golang.org), I'd like you to review this change to https://code.google.com/p/go.text
Sign in to reply to this message.
I haven't digested all of this yet, but here's a first round of mostly trivial comments. https://codereview.appspot.com/122420043/diff/40001/cases/cases.go File cases/cases.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35 cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has returned a nil error. Would it be safer to supply an explicit Reset method instead of relying on callers carefully checking for nil/non-nil error? https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35 cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has returned a nil error. Change "it" to "its Transform method"? https://codereview.appspot.com/122420043/diff/40001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode13 cases/context.go:13: // This file contains the high-level lookup code for casing properties as well I like to put file-level comments between the package and import lines, not after the import: ---- // Copyright package foo // This file etc. import ( ---- https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode45 cases/context.go:45: func makeContext(dst, src []byte, atEOF bool) context { Return a *context instead of a context? https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode91 cases/context.go:91: c.doTitle = true Add a comment to say why?? I was surprised that c.ret() can modify c's state. I'm also surprised that the doTitle state depends on how big or small the Transform buffers are. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode105 cases/context.go:105: // done returns true if Transform should terminate immediately. I'd s/true if/whether/ and similarly below and in trieval.go. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode122 cases/context.go:122: func (c *context) insertString(str string) { Personally, I'd shorten "str" to just "s". https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode140 cases/context.go:140: func (c *context) writeString(str string) { s/str/s/ https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode150 cases/context.go:150: func (c *context) writeBytes(str []byte) { s/str/b/, or s/str/s/ and similarly for insertBytes. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode150 cases/context.go:150: func (c *context) writeBytes(str []byte) { If you changed the exceptions variable from a [...]byte to a string, then you could delete this method, and use writeString instead. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode171 cases/context.go:171: // Fast path for 6-bit XOR pattern, which covers the fast majority of Change the last "fast" to "vast"? Or just say "which covers most cases." https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode201 cases/context.go:201: e := c.info You could delete this line, and on the next: cm := c.info & 0x7 Below, "e&0x3" becomes "cm&0x3". https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode225 cases/context.go:225: } else { I'd add a return before this line, drop the else, and outdent the rest. Similarly in the upper and title methods. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode228 cases/context.go:228: c.writeBytes(e[1:][:ln]) Personally, I find e[1:1+ln] easier to read than e[1:][:ln] and similarly below. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode278 cases/context.go:278: nu := (e[0] >> lengthBits) & lengthMask I must confess that I don't understand where the "nu" name comes from. It's also a little weird that ln is the low bits here, but the high bits in func little and func upper. https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go File cases/context_test.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new... cases/context_test.go:47: return case_ignorable[r] // in data_test.go s/data/tables/, or just delete the comment. https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new... cases/context_test.go:97: t.Errorf("lowerSpecial:(%X): got %+q; want %+q", r, got, want) %U instead of %X? https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new... cases/context_test.go:137: case 230: s/230/above/ ? https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new... cases/context_test.go:139: default: Delete. https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go File cases/gen_test.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode73 cases/gen_test.go:73: // chars contains runeInfo structs for all runes. It is filled by parseUCD. Have parseUCD return chars instead of sharing a global variable? https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode295 cases/gen_test.go:295: fmt.Fprintln(w, "\n}\n") I'd say fmt.Fprint(w, "\n}\n\n") to keep "go vet" happy. https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode301 cases/gen_test.go:301: func makeEntry(ri *runeInfo) { Would it be easier if this returned caseInfo instead of setting ri.entry as a side-effect? https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode529 cases/gen_test.go:529: // See context.go for the definition of lookup. Definition? https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode536 cases/gen_test.go:536: if err == nil && e != nil { Personally, I'd rename "e" to "err" and rename "err" to "retErr". https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode672 cases/gen_test.go:672: logger.Fatalf("%U: cased, but got break type %s; want ALetter", r, c.CCC, c.BreakType) The %-verbs don't match the arguments. https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go File cases/trieval.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode36 cases/trieval.go:36: casedMask = 0x3 // uncased if 0 Is the comment necessary when you spell out cUncased == 0 (and cTitle, cLower and cUpper) below? https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode61 cases/trieval.go:61: // case runes to alternate for increasing rune values. We use this property Perhaps give an example of this, to help those of us familiar with only ASCII's 'a' and 'A' not being consecutive. https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode100 cases/trieval.go:100: // The case mapping implementation will need to know about various CCC values. I'd spell out at least once, maybe here, that CCC stands for Canonical Combining Class. https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode133 cases/trieval.go:133: iotaSubscrupt = 240 Typo in "script". https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode146 cases/trieval.go:146: // for title casing. For example, a4a -> A4a, but a\u03004a -> A\u03004A. Is it possible to use a different middle rune than '4'? It's not obvious that the last "4" doesn't belong to the "\u0300". https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode240 cases/trieval.go:240: noChange = 0x7 Is it easier (if possible?) to define noChange = 0?
Sign in to reply to this message.
In the CL description: s/gettest/gentest/ and s/differes/differs/
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/40001/cases/cases.go File cases/cases.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35 cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has returned a nil error. On 2014/08/13 06:26:29, nigeltao wrote: > Change "it" to "its Transform method"? Done. https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35 cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has returned a nil error. On 2014/08/13 06:26:29, nigeltao wrote: > Would it be safer to supply an explicit Reset method instead of relying on > callers carefully checking for nil/non-nil error? A few points on this: - I'm okay with this, as long as it doesn't have to be called after using Bytes() or String(). However, if we do this it is good to take a step back and look at Transformers. So far, almost no Transformer has state (Chain is a notable exception and it could benefit from a Reset method). Also, only the Title Caser has minimal state. We could add a Reset(t Transformer) to transform that will call Reset on a transformer if it implements Reset or either 1) relies on all Transformers with state to implement it, or 2) will rely on a nil error to mean the state is clear. In both cases we need to augment the spec for Transfomer. https://codereview.appspot.com/122420043/diff/40001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode13 cases/context.go:13: // This file contains the high-level lookup code for casing properties as well Yes, of course. My bad. On 2014/08/13 06:26:29, nigeltao wrote: > I like to put file-level comments between the package and import lines, not > after the import: > > ---- > // Copyright > > package foo > > // This file etc. > > import ( > ---- https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode45 cases/context.go:45: func makeContext(dst, src []byte, atEOF bool) context { This is deliberate. It is easier to use this and ensure the returned context does not escape. Granted, so far I did not have any implementation that did not escape because of the use of function pointers, but I added a special case for the majority-case upper caser where this is not the case. (sorry for that sentence :-) ). This gives a speedup of about 50% for small strings (never mind the avoiding of garbage) and a noticeable speedup even for larger strings. I considered making it a member variable for other implementations, but this did not give worthy enough speedup (1-2%) considering the drawbacks. Would be good to make such a special case for lowercase as well, as it is the dominant use case. Added a TODO for this. On 2014/08/13 06:26:29, nigeltao wrote: > Return a *context instead of a context? https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode91 cases/context.go:91: c.doTitle = true On 2014/08/13 06:26:29, nigeltao wrote: > Add a comment to say why?? I was surprised that c.ret() can modify c's state. The way context is designed is that, by default, the case mappers do not deal with ErrShortSrc and ErrShortDst, only context does this. c.ret translates the checkpoints and other buffer pointers to the ErrShortSrc if necessary. It makes it very easy to verify that the code is correct. Note that all Transforms use c.ret before returning. The only tricky part is that a checkpoint should always be called before writing any rune in title case. I remove this tricky part by having a separate retTitle where I always return ErrShortSrc if c.atEOF == false. This is not desirable for Lower or Upper as it has a small but noticeable performance impact for them (less important and noticeable for title casing). > I'm also surprised that the doTitle state depends on how big or small the > Transform buffers are. This block does not signify a short or long buffer, but rather whether processing has completed. A If it has not, and if https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode105 cases/context.go:105: // done returns true if Transform should terminate immediately. On 2014/08/13 06:26:29, nigeltao wrote: > I'd s/true if/whether/ and similarly below and in trieval.go. Done. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode122 cases/context.go:122: func (c *context) insertString(str string) { On 2014/08/13 06:26:30, nigeltao wrote: > Personally, I'd shorten "str" to just "s". Done. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode140 cases/context.go:140: func (c *context) writeString(str string) { On 2014/08/13 06:26:30, nigeltao wrote: > s/str/s/ Done. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode150 cases/context.go:150: func (c *context) writeBytes(str []byte) { On 2014/08/13 06:26:30, nigeltao wrote: > s/str/b/, or s/str/s/ and similarly for insertBytes. Done. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode150 cases/context.go:150: func (c *context) writeBytes(str []byte) { On 2014/08/13 06:26:30, nigeltao wrote: > If you changed the exceptions variable from a [...]byte to a string, then you > could delete this method, and use writeString instead. Nice. Done. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode171 cases/context.go:171: // Fast path for 6-bit XOR pattern, which covers the fast majority of On 2014/08/13 06:26:29, nigeltao wrote: > Change the last "fast" to "vast"? Or just say "which covers most cases." Done. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode201 cases/context.go:201: e := c.info On 2014/08/13 06:26:29, nigeltao wrote: > You could delete this line, and on the next: > cm := c.info & 0x7 > Below, "e&0x3" becomes "cm&0x3". Done. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode225 cases/context.go:225: } else { On 2014/08/13 06:26:29, nigeltao wrote: > I'd add a return before this line, drop the else, and outdent the rest. > Similarly in the upper and title methods. Done. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode228 cases/context.go:228: c.writeBytes(e[1:][:ln]) On 2014/08/13 06:26:30, nigeltao wrote: > Personally, I find > e[1:1+ln] > easier to read than > e[1:][:ln] > and similarly below. Done. https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode278 cases/context.go:278: nu := (e[0] >> lengthBits) & lengthMask it is length of upper part. should be un, to be consistent. Changed to something more meaningful. On 2014/08/13 06:26:29, nigeltao wrote: > I must confess that I don't understand where the "nu" name comes from. > > It's also a little weird that ln is the low bits here, but the high bits in func > little and func upper. https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go File cases/context_test.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new... cases/context_test.go:47: return case_ignorable[r] // in data_test.go On 2014/08/13 06:26:30, nigeltao wrote: > s/data/tables/, or just delete the comment. Done. https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new... cases/context_test.go:97: t.Errorf("lowerSpecial:(%X): got %+q; want %+q", r, got, want) On 2014/08/13 06:26:30, nigeltao wrote: > %U instead of %X? Done. https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new... cases/context_test.go:137: case 230: On 2014/08/13 06:26:30, nigeltao wrote: > s/230/above/ ? Done. https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new... cases/context_test.go:139: default: On 2014/08/13 06:26:30, nigeltao wrote: > Delete. Done. https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go File cases/gen_test.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode73 cases/gen_test.go:73: // chars contains runeInfo structs for all runes. It is filled by parseUCD. On 2014/08/13 06:26:30, nigeltao wrote: > Have parseUCD return chars instead of sharing a global variable? Done. https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode295 cases/gen_test.go:295: fmt.Fprintln(w, "\n}\n") removed as I'm not generating a string. BTW, how do I run go vet on test files? On 2014/08/13 06:26:30, nigeltao wrote: > I'd say > fmt.Fprint(w, "\n}\n\n") > to keep "go vet" happy. https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode301 cases/gen_test.go:301: func makeEntry(ri *runeInfo) { It is a bit tricky to do so. entry should really be seen as a struct, or bit union, to be correct. It is assembled in two functions (makeEntry and makeException), and combining these two subresults is not a self-evident operation. So I think it would actually make things worse. I could do this only for makeEntry and let makeException modify it as a side effect, but I rather be consistent. On 2014/08/13 06:26:30, nigeltao wrote: > Would it be easier if this returned caseInfo instead of setting ri.entry as a > side-effect? https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode529 cases/gen_test.go:529: // See context.go for the definition of lookup. Redid comment. On 2014/08/13 06:26:30, nigeltao wrote: > Definition? https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode536 cases/gen_test.go:536: if err == nil && e != nil { On 2014/08/13 06:26:30, nigeltao wrote: > Personally, I'd rename "e" to "err" and rename "err" to "retErr". Done. https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode672 cases/gen_test.go:672: logger.Fatalf("%U: cased, but got break type %s; want ALetter", r, c.CCC, c.BreakType) On 2014/08/13 06:26:30, nigeltao wrote: > The %-verbs don't match the arguments. Done. https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go File cases/trieval.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode36 cases/trieval.go:36: casedMask = 0x3 // uncased if 0 removed. On 2014/08/13 06:26:31, nigeltao wrote: > Is the comment necessary when you spell out cUncased == 0 (and cTitle, cLower > and cUpper) below? https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode61 cases/trieval.go:61: // case runes to alternate for increasing rune values. We use this property On 2014/08/13 06:26:30, nigeltao wrote: > Perhaps give an example of this, to help those of us familiar with only ASCII's > 'a' and 'A' not being consecutive. Done. https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode100 cases/trieval.go:100: // The case mapping implementation will need to know about various CCC values. On 2014/08/13 06:26:31, nigeltao wrote: > I'd spell out at least once, maybe here, that CCC stands for Canonical Combining > Class. Done. https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode133 cases/trieval.go:133: iotaSubscrupt = 240 On 2014/08/13 06:26:30, nigeltao wrote: > Typo in "script". Done. https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode146 cases/trieval.go:146: // for title casing. For example, a4a -> A4a, but a\u03004a -> A\u03004A. I think it needs to be a Number class. How about replacing \u0300 with ~ (\u0308) and mentioning this? On 2014/08/13 06:26:31, nigeltao wrote: > Is it possible to use a different middle rune than '4'? It's not obvious that > the last "4" doesn't belong to the "\u0300". https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode240 cases/trieval.go:240: noChange = 0x7 0 is already reserved for a zero-length string. I think it is better to have the length always as is and the "no change/ undefined" be the special value. On 2014/08/13 06:26:30, nigeltao wrote: > Is it easier (if possible?) to define noChange = 0?
Sign in to reply to this message.
Some more mostly trivial comments. I haven't had the time this week to give it a proper look. Sorry. https://codereview.appspot.com/122420043/diff/40001/cases/cases.go File cases/cases.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35 cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has returned a nil error. On 2014/08/13 15:26:02, mpvl wrote: > On 2014/08/13 06:26:29, nigeltao wrote: > > Would it be safer to supply an explicit Reset method instead of relying on > > callers carefully checking for nil/non-nil error? > A few points on this: > - I'm okay with this, as long as it doesn't have to be called after using > Bytes() or String(). > > However, if we do this it is good to take a step back and look at Transformers. > So far, almost no Transformer has state (Chain is a notable exception and it > could benefit from a Reset method). Also, only the Title Caser has minimal > state. > > We could add a Reset(t Transformer) to transform that will call Reset on a > transformer if it implements Reset or either 1) relies on all Transformers with > state to implement it, or 2) will rely on a nil error to mean the state is > clear. In both cases we need to augment the spec for Transfomer. Yes, there's a TODO in go.text/transform/transform.go to think about resets. You said that almost no Transformer so far has state, other than Chain, and the Title Caser. I don't know the grand scope of what else you have coming down the pipe. Do you anticipate any more complicated (stateful) transformers? I haven't settled on what I think is the best design, but a couple of thoughts: We could add a (mandatory) Reset method to the Transformer interface, and have the transform.Bytes and transform.String functions' implementations start with a Reset call. I wouldn't expect that to be a huge performance cost, but I also have no actual data. I'm not sure if implicitly resetting after a nil error is entirely safe. You could be io.Copy'ing to a destination io.Writer that fails for whatever reason, and you might exit early out of your transformation loop without ever seeing a nil error, and you have no way of explicitly resetting the Transformer for re-use. https://codereview.appspot.com/122420043/diff/40001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode91 cases/context.go:91: c.doTitle = true On 2014/08/13 15:26:05, mpvl wrote: > On 2014/08/13 06:26:29, nigeltao wrote: > > I'm also surprised that the doTitle state depends on how big or small the > > Transform buffers are. > This block does not signify a short or long buffer, but rather whether > processing has completed. A > If it has not, and if Unfinished codereview response?? https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go File cases/trieval.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode240 cases/trieval.go:240: noChange = 0x7 On 2014/08/13 15:26:05, mpvl wrote: > 0 is already reserved for a zero-length string. I think it is better to have the > length always as is and the "no change/ undefined" be the special value. Casing can consume or output a zero-length string?? I should really know well enough that, even if it can, that won't be the strangest thing in Unicode. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode61 cases/casemap.go:61: titleInfos = []titleInfo{ Just a suggestion, but if you want, you don't need to name the titleInfo type: titleInfos = []struct { title, lower func(*context) rewrite func(*context) caseInfo }{ {title, lower, nil}, // und etc, } https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode73 cases/casemap.go:73: // options are not used (yet), but ensure that all make* functions have the same I'd just drop the "options are not used (yet), but ... in the tests." Similarly below. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode80 cases/casemap.go:80: // makeLower configures a Caser from a basic lowercase implementation. s/from/for/? Or just delete the comment entirely. It's not an exported function. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode83 cases/casemap.go:83: func makeLower(t language.Tag, o options) *lowerCaser { returning a transform.Transformer would be consistent with makeUpper. Similarly below, for makeTitle. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode251 cases/casemap.go:251: // determines whether a lowercased sigma should be σ or ς, by looking ahead for Two spaces after the comma. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode320 cases/casemap.go:320: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342, 0x0308, 0x0304, 0x345: Adding another "0" to "0x345" would be more consistent. https://codereview.appspot.com/122420043/diff/60001/cases/cases.go File cases/cases.go (right): https://codereview.appspot.com/122420043/diff/60001/cases/cases.go#newcode37 cases/cases.go:37: type Caser struct { If a Caser is just a transfomer.Transformer, then I'm not totally convinced of the utility of this type. An alternative design is that cases.Lower, cases.Title and cases.Upper all return a transform.Transformer, and instead of calling caser.Bytes(b), you call transform.Bytes(caser, b). I haven't really decided either way which design I prefer, as I've only got example_test.go to go on, and haven't tried to use it for real, but perhaps something to think about. One thing is do you expect users of the cases package to almost always lower/title/upper-case individual words (i.e. call Bytes or String), or will they often want to convert blocks of text (i.e. use it with transform.NewReader).
Sign in to reply to this message.
[sorry, had a bunch of replies, but forgot to send out.] In the mean time warmed up more to the idea of making Reset part of the interface. Will do some experiments to see now this looks. https://codereview.appspot.com/122420043/diff/40001/cases/cases.go File cases/cases.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35 cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has returned a nil error. On 2014/08/22 07:00:22, nigeltao wrote: > On 2014/08/13 15:26:02, mpvl wrote: > > On 2014/08/13 06:26:29, nigeltao wrote: > > > Would it be safer to supply an explicit Reset method instead of relying on > > > callers carefully checking for nil/non-nil error? > > A few points on this: > > - I'm okay with this, as long as it doesn't have to be called after using > > Bytes() or String(). > > > > However, if we do this it is good to take a step back and look at > Transformers. > > So far, almost no Transformer has state (Chain is a notable exception and it > > could benefit from a Reset method). Also, only the Title Caser has minimal > > state. > > > > We could add a Reset(t Transformer) to transform that will call Reset on a > > transformer if it implements Reset or either 1) relies on all Transformers > with > > state to implement it, or 2) will rely on a nil error to mean the state is > > clear. In both cases we need to augment the spec for Transfomer. > > Yes, there's a TODO in go.text/transform/transform.go to think about resets. > > You said that almost no Transformer so far has state, other than Chain, and the > Title Caser. I don't know the grand scope of what else you have coming down the > pipe. Do you anticipate any more complicated (stateful) transformers? There are a few more, but most of them will be simpler and probably no state. Note that most lower and upper are currently stateless. I could imagine having a versions with state for this, for example, if users want to have an unlimited number of modifiers or something alike. > I haven't settled on what I think is the best design, but a couple of thoughts: > > We could add a (mandatory) Reset method to the Transformer interface, and have > the transform.Bytes and transform.String functions' implementations start with a > Reset call. I wouldn't expect that to be a huge performance cost, but I also > have no actual data. Probably not a huge overhead indeed. I'm warming up to the idea, as it makes things simple. It just seems a bit of a shame to do if most Transformers don't need this. I did some tests and an unconditional Reset is a lot less overhead than a conditional one (not surprisingly so), so it could work. > I'm not sure if implicitly resetting after a nil error is entirely safe. For casing it is. It is pretty simple. For Chain I'm not sure and would need to check, but we're talking about Transformer here. > could be io.Copy'ing to a destination io.Writer that fails for whatever reason, > and you might exit early out of your transformation loop without ever seeing a > nil error, and you have no way of explicitly resetting the Transformer for > re-use. Iterating until err == nil is explicit, just cumbersome. And probably more expensive than unconditionally calling a Reset(). It is worth considering adding a mandatory Reset to Transfomer, but probably separately from this CL? https://codereview.appspot.com/122420043/diff/40001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode91 cases/context.go:91: c.doTitle = true Sometimes they disappear on me, and then I have to write them anew. So see the previous one for that. Sometimes they come back, like this one, so please ignore. On 2014/08/22 07:00:22, nigeltao wrote: > On 2014/08/13 15:26:05, mpvl wrote: > > On 2014/08/13 06:26:29, nigeltao wrote: > > > I'm also surprised that the doTitle state depends on how big or small the > > > Transform buffers are. > > This block does not signify a short or long buffer, but rather whether > > processing has completed. A > > If it has not, and if > > Unfinished codereview response?? https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go File cases/trieval.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode240 cases/trieval.go:240: noChange = 0x7 Yes! For example, the modifier "combining dot above" is dropped when lowercasing in Turkish. However, Looking at it more closely, though, this seems to only occur for conditional mappings, which I do not include in this table. So 0 now means no change. I kept the constant until a rune comes up that is removed under casing or if conditional mappings are included in the map as well. But indeed, stranger things are happening in Unicode, so this would not be surprising. (Even with casing: runes that are case-ignorable AND cased; cased letters that do not change when cased to the other form, etc.) On 2014/08/22 07:00:22, nigeltao wrote: > On 2014/08/13 15:26:05, mpvl wrote: > > 0 is already reserved for a zero-length string. I think it is better to have > the > > length always as is and the "no change/ undefined" be the special value. > > Casing can consume or output a zero-length string?? I should really know well > enough that, even if it can, that won't be the strangest thing in Unicode. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode61 cases/casemap.go:61: titleInfos = []titleInfo{ Don't always like this approach, but here it is nicer. Done. On 2014/08/22 07:00:22, nigeltao wrote: > Just a suggestion, but if you want, you don't need to name the titleInfo type: > > titleInfos = []struct { > title, lower func(*context) > rewrite func(*context) caseInfo > }{ > {title, lower, nil}, // und > etc, > } https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode73 cases/casemap.go:73: // options are not used (yet), but ensure that all make* functions have the same On 2014/08/22 07:00:22, nigeltao wrote: > I'd just drop the "options are not used (yet), but ... in the tests." > > Similarly below. Done. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode80 cases/casemap.go:80: // makeLower configures a Caser from a basic lowercase implementation. I'll delete. It is actually from, as it is wrapping a basic lowercase function into a more fancy one. On 2014/08/22 07:00:22, nigeltao wrote: > s/from/for/? Or just delete the comment entirely. It's not an exported function. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode83 cases/casemap.go:83: func makeLower(t language.Tag, o options) *lowerCaser { For makeUpper it has to be a Transformer, for the others not (yet). I've changed it for consistency and future proofness. Had to move the noFinalSigma option from test to lib code to do this, but ultimately something like that option should go there anyway. On 2014/08/22 07:00:22, nigeltao wrote: > returning a transform.Transformer would be consistent with makeUpper. Similarly > below, for makeTitle. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode251 cases/casemap.go:251: // determines whether a lowercased sigma should be σ or ς, by looking ahead for On 2014/08/22 07:00:23, nigeltao wrote: > Two spaces after the comma. Done. https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode320 cases/casemap.go:320: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342, 0x0308, 0x0304, 0x345: On 2014/08/22 07:00:22, nigeltao wrote: > Adding another "0" to "0x345" would be more consistent. Done. https://codereview.appspot.com/122420043/diff/60001/cases/cases.go File cases/cases.go (right): https://codereview.appspot.com/122420043/diff/60001/cases/cases.go#newcode37 cases/cases.go:37: type Caser struct { On 2014/08/22 07:00:23, nigeltao wrote: > If a Caser is just a transfomer.Transformer, then I'm not totally convinced of > the utility of this type. An alternative design is that cases.Lower, cases.Title > and cases.Upper all return a transform.Transformer, and instead of calling > caser.Bytes(b), you call transform.Bytes(caser, b). The vast majority of uses for casing are to operate on small pieces of text. The will mostly mean not in a streaming setting. If you compare s, _, _ := transform.String(cases.Lower(language.Und), "Foo") Use(s) with Use(cases.Lower(language.Und).String("Foo")) or even: Use(cases.Root.Lower().String("Foo")) // not (yet?) implemented. The first looks quite awkward. It will only drive people towards using strings.Lower, even when this is the wrong choice. There are two main uses for the Transformer interface. One of them is to make it easy to implement some transformation that can be wrapped in a Reader or a Writer. The other is to allow chaining together text transformations, even when not used in a streaming setting. The transforms in the ICU library are of the latter ilk. For example, to lowercase and normalize, one could write: t := transform.Chain(cases.Lower(language.Und), norm.NFC) ICU uses a proprietary syntax for chaining transforms, go.text does it programmatically, but the idea is roughly the same. Either way, the vast majority of people will need simple transforms on short strings, not dissimilar from the examples. Casing is also a bit special in that there are no errors. This allows simplifying the API a bit. This is especially useful if people care about casing but not about Transformers in general, which I think will be common. > > I haven't really decided either way which design I prefer, as I've only got > example_test.go to go on, and haven't tried to use it for real, but perhaps > something to think about. One thing is do you expect users of the cases package > to almost always lower/title/upper-case individual words (i.e. call Bytes or > String), or will they often want to convert blocks of text (i.e. use it with > transform.NewReader).
Sign in to reply to this message.
Another round of comments. I still haven't found the time to try to digest the whole thing in one sitting. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode12 cases/casemap.go:12: "code.google.com/p/go.text/transform" Move this import to the second group (with the "code.google.com" prefix). https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode81 cases/casemap.go:81: // makeLower configures a Caser from a basic lowercase implementation. Delete this comment (and the one for makeUpper)? It still looks weird how one is "for" and the other is "from". https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode146 cases/casemap.go:146: // checkpointing. lower case functions use for mid-word runes should checkpoint s/lower/Lower/ https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode196 cases/casemap.go:196: // buffer to avoid possibilities for DNS attacks. s/DNS/DoS/ ?? https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode203 cases/casemap.go:203: // Annex #29, “Unicode Text Segmentation.” For each word boundary, find the Change the “smart quotes” to "regular quotes"? https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode269 cases/casemap.go:269: c.dst[p+1]-- // σ → ς Do you know that p+1 is in bounds? https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode289 cases/casemap.go:289: r, _ := utf8.DecodeRune(c.src[c.pSrc:]) What happens if the src is invalid UTF-8? https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode297 cases/casemap.go:297: // Take the properties of the uppercased rune that is already written to the I'm not overly familiar with Unicode. I'm guessing that this code here is covered by the test cases in casemap_test.go, but I think it would be helpful if you gave an example of what this code does, such as: // This handles converting "Οδός" to "ΟΔΟΣ". or whatever the relevant test case is. Maybe that's already covered by the "From CLDR:" comment above, but I must confess that I don't grok it. Similarly with all the other non-trivial functions in this file. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode314 cases/casemap.go:314: for ; !c.done() && i <= maxIgnorable; i++ { When comparing against maxIgnorable, sometimes you use < and sometimes <=. Similarly when comparing > or >=. It'd be nice to be consistent. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go File cases/casemap_test.go (right): https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:32: func (tc *testCase) mkUpper(tag language.Tag) Caser { I'm not really convinced that you need these testCase methods. (See below). https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:44: // testCases contains test cases The redundant comment seems redundant. I'd delete it. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:46: { If a test fails, below, you say t.Errorf("%d: etc", i, etc) Thus, it might be nice here if the { on this line was 0: { and similarly below: "1: {", "2: {", etc., so that if a test fails, it's easy to see which one it is. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:56: "und", As you sometimes use named fields, I'd be consistent and always use them: lang: "und", src: "etc", etc You could then drop the "options{}," lines. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:178: for i, tt := range testCases { I'd s/tt/tc/ https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:186: testEntry := func(name string, mk func(language.Tag) Caser, gold interface{}) { Instead of mk func(language.Tag) Caser I'd have mk func(language.Tag, options) transform.Transformer and then c := mk(tag) becomes c := Caser{mk(tag, tt.opts)} below, tt.mkUpper becomes makeUpper https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:189: wanted, ok := gold.([]string) I'd s/wanted/wants/ https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:236: const minBufSize = norm.MaxSegmentSize If this constant is only used by TestShortBuffersAndOverflow, I'd move it into that function's body. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:239: for i, tt := range []struct { s/tt/tc/ https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:397: func goldBench(b *testing.B, s string, f func(b []byte) []byte) { I'd rename goldBench and doBench to be benchFunc and benchTransformer, and also make their argument order consistent. https://codereview.appspot.com/122420043/diff/100001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/100001/cases/context.go#newcode12 cases/context.go:12: import ( Separate the standard library imports from the others: import ( "unicode" "unicode/utf8" "code.google.com/p/go.text/transform" ) https://codereview.appspot.com/122420043/diff/100001/cases/context.go#newcode26 cases/context.go:26: // ignorables as lookahead (analogous to the limit in norm) and to use state if IANAUnicodeExpert: is "case ignorables" a Unicode term? Can I more or less think of it as "rune" or can it mean something different? https://codereview.appspot.com/122420043/diff/100001/cases/context.go#newcode278 cases/context.go:278: // write exception bits. s/bits/bytes/ ?
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode12 cases/casemap.go:12: "code.google.com/p/go.text/transform" On 2014/09/12 09:58:10, nigeltao wrote: > Move this import to the second group (with the "code.google.com" prefix). Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode81 cases/casemap.go:81: // makeLower configures a Caser from a basic lowercase implementation. On 2014/09/12 09:58:10, nigeltao wrote: > Delete this comment (and the one for makeUpper)? It still looks weird how one is > "for" and the other is "from". Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode146 cases/casemap.go:146: // checkpointing. lower case functions use for mid-word runes should checkpoint On 2014/09/12 09:58:10, nigeltao wrote: > s/lower/Lower/ Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode196 cases/casemap.go:196: // buffer to avoid possibilities for DNS attacks. On 2014/09/12 09:58:09, nigeltao wrote: > s/DNS/DoS/ ?? Done. Must be muscle memory or something; I keep typing one when I mean the other. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode203 cases/casemap.go:203: // Annex #29, “Unicode Text Segmentation.” For each word boundary, find the On 2014/09/12 09:58:10, nigeltao wrote: > Change the “smart quotes” to "regular quotes"? Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode269 cases/casemap.go:269: c.dst[p+1]-- // σ → ς On 2014/09/12 09:58:09, nigeltao wrote: > Do you know that p+1 is in bounds? Yes, although it is subtle. 1) both sigmas are two bytes in size. So if there is enough space to completely write sigma, p+1 is valid. 2) If not, nextRune and initRune will not be called and the for loop will be skipped (c.done() is true). 3) This means that c.info still contains the info from the sigma. 4) sigma is cased, so the if block in question will be skipped. Q.E.D. Added comments to clarify and more tests to verify. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode289 cases/casemap.go:289: r, _ := utf8.DecodeRune(c.src[c.pSrc:]) On 2014/09/12 09:58:10, nigeltao wrote: > What happens if the src is invalid UTF-8? Nothing, this is fine. It will return FFFD which doesn't map to any of the special runes. It will then do a trie lookup which will, once again, detect it is an invalid rune. Invalid runes have a zero CCC, so it will return soon thereafter. Note that the Decoding result is never used for iteration increment. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode297 cases/casemap.go:297: // Take the properties of the uppercased rune that is already written to the On 2014/09/12 09:58:10, nigeltao wrote: > I'm not overly familiar with Unicode. I'm guessing that this code here is > covered by the test cases in casemap_test.go, but I think it would be helpful if > you gave an example of what this code does, such as: > // This handles converting "Οδός" to "ΟΔΟΣ". > or whatever the relevant test case is. > > Maybe that's already covered by the "From CLDR:" comment above, but I must > confess that I don't grok it. > > Similarly with all the other non-trivial functions in this file. Done. Yeah, the CLDR syntax is a bit special. They are production rules. The stuff before { and after } is merely context for the production rules to trigger. Production rules need to be applied to newly replaced text. The square brackets are used to indicate sets of runes. https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode314 cases/casemap.go:314: for ; !c.done() && i <= maxIgnorable; i++ { On 2014/09/12 09:58:10, nigeltao wrote: > When comparing against maxIgnorable, sometimes you use < and sometimes <=. That actually was because of a bug! Fixed that. > Similarly when comparing > or >=. It'd be nice to be consistent. Moved some of these comparisons to the for loop. However, in both cases I actually need to check for maxIgnorables-1. So I did <= maxIgnorables-1 instead of < maxIgnorables. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go File cases/casemap_test.go (right): https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:32: func (tc *testCase) mkUpper(tag language.Tag) Caser { On 2014/09/12 09:58:11, nigeltao wrote: > I'm not really convinced that you need these testCase methods. (See below). Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:44: // testCases contains test cases On 2014/09/12 09:58:11, nigeltao wrote: > The redundant comment seems redundant. I'd delete it. :) Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:46: { On 2014/09/12 09:58:11, nigeltao wrote: > If a test fails, below, you say > t.Errorf("%d: etc", i, etc) > Thus, it might be nice here if the > { > on this line was > 0: { > and similarly below: "1: {", "2: {", etc., so that if a test fails, it's easy to > see which one it is. Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:56: "und", On 2014/09/12 09:58:10, nigeltao wrote: > As you sometimes use named fields, I'd be consistent and always use them: > lang: "und", > src: "etc", > etc > You could then drop the "options{}," lines. Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:178: for i, tt := range testCases { On 2014/09/12 09:58:10, nigeltao wrote: > I'd s/tt/tc/ Why exactly? I know that tc matches testCases nicely. However, tt is also a natural choice and seems by far the most common one in core. Only if the var iterated upon is called testCases is the use of tc slightly more prevalent. So is it because it is a new standard we're converting to, because in this case particular it is called testCases or for other reasons? https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:186: testEntry := func(name string, mk func(language.Tag) Caser, gold interface{}) { On 2014/09/12 09:58:10, nigeltao wrote: > Instead of > mk func(language.Tag) Caser > I'd have > mk func(language.Tag, options) transform.Transformer > > and then > > c := mk(tag) > becomes > c := Caser{mk(tag, tt.opts)} > > below, > > tt.mkUpper > becomes > makeUpper Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:189: wanted, ok := gold.([]string) On 2014/09/12 09:58:11, nigeltao wrote: > I'd s/wanted/wants/ Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:236: const minBufSize = norm.MaxSegmentSize On 2014/09/12 09:58:11, nigeltao wrote: > If this constant is only used by TestShortBuffersAndOverflow, I'd move it into > that function's body. Done. https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne... cases/casemap_test.go:397: func goldBench(b *testing.B, s string, f func(b []byte) []byte) { On 2014/09/12 09:58:11, nigeltao wrote: > I'd rename goldBench and doBench to be benchFunc and benchTransformer, and also > make their argument order consistent. Done. https://codereview.appspot.com/122420043/diff/100001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/100001/cases/context.go#newcode12 cases/context.go:12: import ( On 2014/09/12 09:58:11, nigeltao wrote: > Separate the standard library imports from the others: > > import ( > "unicode" > "unicode/utf8" > > "code.google.com/p/go.text/transform" > ) Sorry. Done. https://codereview.appspot.com/122420043/diff/100001/cases/context.go#newcode26 cases/context.go:26: // ignorables as lookahead (analogous to the limit in norm) and to use state if On 2014/09/12 09:58:11, nigeltao wrote: > IANAUnicodeExpert: IANAUE henceforth. :) > is "case ignorables" a Unicode term? Yep. > Can I more or less think > of it as "rune" or can it mean something different? It is a rune with a derived property. See Case_Ignorable in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt. https://codereview.appspot.com/122420043/diff/100001/cases/context.go#newcode278 cases/context.go:278: // write exception bits. On 2014/09/12 09:58:11, nigeltao wrote: > s/bits/bytes/ ? The whole comment made not sense. Replaced it.
Sign in to reply to this message.
I dropped a few more comments, but I have a bigger design question: Currently, a context peeks *ahead* of pSrc: you have separate initRune, nextRune and done methods, and c.src[c.pSrc:c.pSrc+c.sz] contains the next rune. It's also a little weird how you've written both done := c.atEOF && c.pSrc == len(c.src) and separately: func (c *context) done() bool { return c.pSrc >= len(c.src) || c.err != nil } and the two definitions of doneness differ. What if there was only one method instead, called next, which returns (ok bool) and if ok, the rune just read is *behind* pSrc, in c.src[c.pSrc-c.sz:c.pSrc]. Like a bufio.Scanner, the idiom becomes: for c.next() { // Do stuff with the rune just read. } I think this simplifies a few things. I prototyped what upper casing would look like with this design in https://codereview.appspot.com/142380043 I'm not 100% sure that I'm doing the right thing wrt Unicode in that CL, but the tests pass. WDYT? https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode322 cases/casemap.go:322: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342, 0x0308, 0x0304, 0x0345: Add comments like "U+0300 COMBINING GRAVE ACCENT"? https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode432 cases/casemap.go:432: if softDotted { I'd say if !softDotted { return } and outdent the rest. https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode433 cases/casemap.go:433: for i := 0; !c.done() && i <= maxIgnorable-1; i++ { Just double-checking... I'm still not sure why there's a "-1" in "maxIgnorable-1". Not that I know what an ignorable is, but I'm just pattern matching how other comparisons are "i <= maxIgnorable". https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode434 cases/casemap.go:434: switch ccc := c.info.cccType(); { switch c.info.cccType() { https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go File cases/casemap_test.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go#ne... cases/casemap_test.go:19: src interface{} // string, []string, or nil to skip test src is always a string, right? https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go#ne... cases/casemap_test.go:175: t.Errorf("%d:%s:%s.String(%q):\n\tgot %+q;\n\twant %+q", i, lang, name, src[j], got, want) The "\t"s should be unnecessary, as "go test" should indent. https://codereview.appspot.com/122420043/diff/120001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode46 cases/context.go:46: func makeContext(dst, src []byte, atEOF bool) context { Just a thought: instead of returning a context (which isn't a large struct, but it's not a small struct), what about func (c *context) reset(dst, src []byte, atEOF bool) { *c = context{ etc } c.initRune() } and in casemap.go, embed the context: type upperCaser struct{} becomes type upperCaser struct{ context } and similarly for simpleCaser, lowerCaser, titleCaser. https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode104 cases/context.go:104: if done := c.atEOF && c.pSrc == len(c.src); done { It may be worth factoring out a method func (c *context) doneAndAtEOF() bool { return c.atEOF && c.pSrc == len(c.src) } and use it both here and above, in context.ret. https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode105 cases/context.go:105: return true, c.pDst, c.pSrc, nil Here, we are atEOF, so can we replace return true, etc with return c.doTitle, etc since there shouldn't be any more source bytes anyway? Or is this necessary because Transformers don't have an explicit Reset method? Still, as I think I've previously said, "callers should read until EOF to reset" seems fragile, especially if we can chain Transformers and possibly return early. If we do that, there's no need for a separate retTitle method, and instead, the caller can say: t.doTitle = c.doTitle return c.ret()
Sign in to reply to this message.
Some comments for now. Still working on this, including reworking checkpointing etc. to allow your simplification proposal. https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode322 cases/casemap.go:322: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342, 0x0308, 0x0304, 0x0345: On 2014/09/22 09:57:36, nigeltao wrote: > Add comments like "U+0300 COMBINING GRAVE ACCENT"? will do. https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode433 cases/casemap.go:433: for i := 0; !c.done() && i <= maxIgnorable-1; i++ { On 2014/09/22 09:57:36, nigeltao wrote: > Just double-checking... I'm still not sure why there's a "-1" in > "maxIgnorable-1". Not that I know what an ignorable is, but I'm just pattern > matching how other comparisons are "i <= maxIgnorable". As usual, we are matching a maximum number of ignorables before matching the \u0307. However, the latter is also an ignorable, so to keep the total number consistent, we have to subtract 1. https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode434 cases/casemap.go:434: switch ccc := c.info.cccType(); { On 2014/09/22 09:57:36, nigeltao wrote: > switch c.info.cccType() { Done. https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go File cases/casemap_test.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go#ne... cases/casemap_test.go:19: src interface{} // string, []string, or nil to skip test On 2014/09/22 09:57:36, nigeltao wrote: > src is always a string, right? No, either a space-separated list of strings or an actual list of strings. https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go#ne... cases/casemap_test.go:175: t.Errorf("%d:%s:%s.String(%q):\n\tgot %+q;\n\twant %+q", i, lang, name, src[j], got, want) On 2014/09/22 09:57:36, nigeltao wrote: > The "\t"s should be unnecessary, as "go test" should indent. Done. https://codereview.appspot.com/122420043/diff/120001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode104 cases/context.go:104: if done := c.atEOF && c.pSrc == len(c.src); done { On 2014/09/22 09:57:36, nigeltao wrote: > It may be worth factoring out a method > func (c *context) doneAndAtEOF() bool { > return c.atEOF && c.pSrc == len(c.src) > } > and use it both here and above, in context.ret. As retTitle is gone, I guess it is not that useful anymore. https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode105 cases/context.go:105: return true, c.pDst, c.pSrc, nil On 2014/09/22 09:57:36, nigeltao wrote: > Here, we are atEOF, so can we replace > return true, etc > with > return c.doTitle, etc > since there shouldn't be any more source bytes anyway? > > Or is this necessary because Transformers don't have an explicit Reset method? > Still, as I think I've previously said, "callers should read until EOF to reset" > seems fragile, especially if we can chain Transformers and possibly return > early. > > If we do that, there's no need for a separate retTitle method, and instead, the > caller can say: > > t.doTitle = c.doTitle > return c.ret() Done. Was indeed to work around not having an explicit reset method.
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode433 cases/casemap.go:433: for i := 0; !c.done() && i <= maxIgnorable-1; i++ { On 2014/09/24 13:08:18, mpvl wrote: > On 2014/09/22 09:57:36, nigeltao wrote: > > Just double-checking... I'm still not sure why there's a "-1" in > > "maxIgnorable-1". Not that I know what an ignorable is, but I'm just pattern > > matching how other comparisons are "i <= maxIgnorable". > > As usual, we are matching a maximum number of ignorables before matching the > \u0307. However, the latter is also an ignorable, so to keep the total number > consistent, we have to subtract 1. OK, putting this in a comment would be nice.
Sign in to reply to this message.
Did another iteration incorporating many of your ideas from CL 142380043, but taking a different approach to checkpointing. The mapping functions now returns whether the result was "short", including whether future runes may change what has been written. This is then used my the main Transformer implementations to do the checkpointing. This localizes things a bit more. Also simplified things a bit as a result of the Reset. I would like to have a reentrant Upper and Lower caser as these are the dominant use cases and this will save allocs. But other than that I've embedded the context in the caser types. Let me know if you would like me to only check in uppercase first or create a new CL. At least I will move the casemap* files to mappings*, but wanted to keep it for now to keep the CL diffs and comments. https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode322 cases/casemap.go:322: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342, 0x0308, 0x0304, 0x0345: On 2014/09/22 09:57:36, nigeltao wrote: > Add comments like "U+0300 COMBINING GRAVE ACCENT"? Done. https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode432 cases/casemap.go:432: if softDotted { On 2014/09/22 09:57:36, nigeltao wrote: > I'd say > if !softDotted { > return > } > and outdent the rest. Done. https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode433 cases/casemap.go:433: for i := 0; !c.done() && i <= maxIgnorable-1; i++ { On 2014/09/24 13:08:18, mpvl wrote: > On 2014/09/22 09:57:36, nigeltao wrote: > > Just double-checking... I'm still not sure why there's a "-1" in > > "maxIgnorable-1". Not that I know what an ignorable is, but I'm just pattern > > matching how other comparisons are "i <= maxIgnorable". > > As usual, we are matching a maximum number of ignorables before matching the > \u0307. However, the latter is also an ignorable, so to keep the total number > consistent, we have to subtract 1. added comment https://codereview.appspot.com/122420043/diff/120001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode46 cases/context.go:46: func makeContext(dst, src []byte, atEOF bool) context { Now we have a reset this seemsprevent an allocation where this applies and it avoids the need for saving state for titlecase. It just makes some casers are currently reentrant no longer so. This is not necessarily a bad thing, though. Will investigate.
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go#newcode25 cases/casemap.go:25: type MapFunc func(*context) bool s/MapFunc/mapFunc/ unless there's a reason to export the type. https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go#newcode171 cases/casemap.go:171: var cc context Huh? https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go#newcode346 cases/casemap.go:346: if c.hasPrefix("Σ") { if !c.hasPrefix("Σ") { return f(c) } and out-dent the rest. https://codereview.appspot.com/122420043/diff/160001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode42 cases/context.go:42: func makeContext(dst, src []byte, atEOF bool) context { This might be better as func (c *context) initialize(dst, src []byte, atEOF bool) { etc } although it's weird how there's both a Reset method (above) and an initialize method (this one). We could just delete this method, and at the call sites, say: t.context = context{dst: dst, src: src, atEOF: atEOF} and also add isMidWord for the titleCaser. https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode66 cases/context.go:66: if ok && c.err == nil { I'd expect the mapFunc to return ok==false if c.err != nil, so this could be just if ok { https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode99 cases/context.go:99: // insertBytes adds bytes without incrementing the src indexes. s/insert/write/ to match the function name, and it looks odd how this function's doc comment is different from writeString's. https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode105 cases/context.go:105: for _, ch := range b { c.pDst += copy(c.dst[c.pDst:], b) https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode118 cases/context.go:118: for i := 0; i < len(s); i++ { c.pDst += copy(c.dst[c.pDst:], s) https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode125 cases/context.go:125: // copy writes the current rune to dst and increases the pointers "The pointers" looks odd since only pDst changes.
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go#newcode25 cases/casemap.go:25: type MapFunc func(*context) bool On 2014/09/29 07:15:14, nigeltao wrote: > s/MapFunc/mapFunc/ unless there's a reason to export the type. Done. Not sure what I was thinking there. https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go#newcode171 cases/casemap.go:171: var cc context On 2014/09/29 07:15:14, nigeltao wrote: > Huh? Oops: leftover from performance experiments. Removed. https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go#newcode346 cases/casemap.go:346: if c.hasPrefix("Σ") { On 2014/09/29 07:15:14, nigeltao wrote: > if !c.hasPrefix("Σ") { > return f(c) > } > > and out-dent the rest. Done. https://codereview.appspot.com/122420043/diff/160001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode42 cases/context.go:42: func makeContext(dst, src []byte, atEOF bool) context { On 2014/09/29 07:15:14, nigeltao wrote: > This might be better as > > func (c *context) initialize(dst, src []byte, atEOF bool) { etc } > > although it's weird how there's both a Reset method (above) and an initialize > method (this one). We could just delete this method, and at the call sites, say: > t.context = context{dst: dst, src: src, atEOF: atEOF} > and also add isMidWord for the titleCaser. I've removed the func altogether and use context{} at the call sites instead. https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode66 cases/context.go:66: if ok && c.err == nil { On 2014/09/29 07:15:14, nigeltao wrote: > I'd expect the mapFunc to return ok==false if c.err != nil, so this could be > just > if ok { Theoretically it is possible for c.err to be non-nil, while ok is false (e.g. a mapFunc may return true if it doesn't write any bytes, even if there is already an error). Instead of proving that in all cases it is safe to not check for c.err == nil, and that it is safe to not check for the return code when it is not checked, I prefer to keep things simple and check it here. https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode99 cases/context.go:99: // insertBytes adds bytes without incrementing the src indexes. On 2014/09/29 07:15:14, nigeltao wrote: > s/insert/write/ to match the function name, and it looks odd how this function's > doc comment is different from writeString's. Done. https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode105 cases/context.go:105: for _, ch := range b { On 2014/09/29 07:15:14, nigeltao wrote: > c.pDst += copy(c.dst[c.pDst:], b) This is a performance optimization. Added comment to clarify. https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode118 cases/context.go:118: for i := 0; i < len(s); i++ { On 2014/09/29 07:15:14, nigeltao wrote: > c.pDst += copy(c.dst[c.pDst:], s) Same here. https://codereview.appspot.com/122420043/diff/160001/cases/context.go#newcode125 cases/context.go:125: // copy writes the current rune to dst and increases the pointers On 2014/09/29 07:15:14, nigeltao wrote: > "The pointers" looks odd since only pDst changes. Updated comment, including the ones for the other write* methods.
Sign in to reply to this message.
In an effort to cut this CL into managable chunks, I split off the Unicode tables and their generator code into a separate CL. In doing so, it wasn't trivial for me to bootstrap the "go test --tags=generate" process, since the Venn diagram of what .go files are involved at separate stages wasn't immediately obvious to me. In general, it always seemed hacky to use "go test" to run the generation process. It's not obvious that e.g. gen_test.go and gentest_test.go aren't actually tests, but is effectively a program, "package main", in disguise, that runs in an init function. Rather than hijacking the "go test" mechanism to share the trieval code between a generator program and its generated package, I'd rather explicitly share the code. IMHO, I think it becomes quite simple: to generate the code, run "go run gen*.go". Please review https://codereview.appspot.com/153220043 instead.
Sign in to reply to this message.
SGTM. I'll remove the generation code from this CL once the other one is checked in. On Wed, Oct 8, 2014 at 10:34 AM, <nigeltao@golang.org> wrote: > In an effort to cut this CL into managable chunks, I split off the > Unicode tables and their generator code into a separate CL. > > In doing so, it wasn't trivial for me to bootstrap the "go test > --tags=generate" process, since the Venn diagram of what .go files are > involved at separate stages wasn't immediately obvious to me. > > In general, it always seemed hacky to use "go test" to run the > generation process. It's not obvious that e.g. gen_test.go and > gentest_test.go aren't actually tests, but is effectively a program, > "package main", in disguise, that runs in an init function. > > Rather than hijacking the "go test" mechanism to share the trieval code > between a generator program and its generated package, I'd rather > explicitly share the code. IMHO, I think it becomes quite simple: to > generate the code, run "go run gen*.go". > > Please review https://codereview.appspot.com/153220043 instead. > > https://codereview.appspot.com/122420043/ >
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode52 cases/casemap.go:52: // Transformers and safe a bit on runtime allocations. s/safe/save/ https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode140 cases/casemap.go:140: // Standard upper case does not need any look ahead so we can safely not use You usually spell "lookahead" without the space. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode176 cases/casemap.go:176: // state inter-word Why isn't inter-word vs first vs mid-word state not stored in the context, the same way as the c.isMidWord field? Would it be better to make the state machine explicit instead of being implied by gotos? for { switch c.lowerCaserState: case lcsInterWord: etc case lcsFirst: etc case lcsMidWord: etc } } return c.ret() https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode349 cases/casemap.go:349: c.writeString("ς") if !c.writeString("ς") { return false } Similarly elsewhere in this file. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode361 cases/casemap.go:361: c.copy() if !c.copy() { return false } Similarly elsewhere in this file. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode402 cases/casemap.go:402: c.writeBytes(b[:sz]) if !c.writeBytes(b[:sz]) { return false } Similarly elsewhere in this file. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode536 cases/casemap.go:536: for ; i <= maxIgnorable-1 && c.next(); i++ { I'm still confused. You subtract 1, but you also say <= instead of <, which means that this loop still executes maxIgnorable times. If that's really what you want, idiomatic is checking "< N" instead of "<= N-1". How about? // We check for up to maxIgnorable ignorables, up to and including the // U+0307 COMBINING DOT ABOVE, which is itself an ignorable. for i := 0; i < maxIgnorable && c.next(); i++ { https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode574 cases/casemap.go:574: return i > maxIgnorable How can this be true? https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode616 cases/casemap.go:616: for ; i <= maxIgnorable-1 && c.next(); i++ { Ditto. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode622 cases/casemap.go:622: return c.writeBytes(c.src[start:c.pSrc]) // ignore u+0307 s/u/U/ https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode634 cases/casemap.go:634: if i > maxIgnorable { Ditto. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode637 cases/casemap.go:637: c.writeString("ı") return ok && c.writeString(etc) && c.writeBytes(etc) https://codereview.appspot.com/122420043/diff/180001/cases/cases.go File cases/cases.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/cases.go#newcode33 cases/cases.go:33: // A Caser may be stateful and should therefore not be shared between go "goroutines" is one word. https://codereview.appspot.com/122420043/diff/180001/cases/cases.go#newcode70 cases/cases.go:70: // No options (yet) for lowercase. I'm not sure why there's a comment here, but not a corresponding one in Upper or Title. Do you have options in mind? https://codereview.appspot.com/122420043/diff/180001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode12 cases/context.go:12: // writing to a a destination buffer. Typo in "a a". https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode22 cases/context.go:22: dst, src []byte It might be worth pulling up atEOF up here: type context struct { // dst, src and atEOF are the arguments to Transform. dst, src []byte atEOF bool pDst etc etc } Otherwise, it looks odd how context.Reset sets c.isMidWord but not c.atEOF. https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode32 cases/context.go:32: info caseInfo // case information of currently scanned rune Maybe s/info/caseInfo/ https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode57 cases/context.go:57: if ok && c.err == nil { Can this just be if ok { https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode159 cases/context.go:159: func caseType(c *context) caseInfo { s/function/method/ ?
Sign in to reply to this message.
On 2014/10/08 09:29:29, mpvl wrote: > SGTM. I'll remove the generation code from this CL once the other one is > checked in. The generation code is checked in.
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode24 cases/casemap.go:24: // cause written bytes to be modified (analogous to ErrShortSrc). Updated comment to reflect reality: mapFunc doesn't have to return false in case of a ErrShortDst. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode52 cases/casemap.go:52: // Transformers and safe a bit on runtime allocations. On 2014/10/13 06:37:42, nigeltao wrote: > s/safe/save/ Done. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode140 cases/casemap.go:140: // Standard upper case does not need any look ahead so we can safely not use On 2014/10/13 06:37:42, nigeltao wrote: > You usually spell "lookahead" without the space. Done. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode176 cases/casemap.go:176: // state inter-word On 2014/10/13 06:37:42, nigeltao wrote: > Why isn't inter-word vs first vs mid-word state not stored in the context, the > same way as the c.isMidWord field? For lowercase it is not necessary to save the state. This allows me to create a lowercaser for the root locale without allocating memory (this is not done yet, but would be similar to the root uppercaser). > > Would it be better to make the state machine explicit instead of being implied > by gotos? I think gotos is often clearer in this case and gives also a better handle on performance (less unnecessary ifs). However, I changed it to an implementation without gotos. As there are really only two states, I just use an if statement. The performance is roughly equal to the previous implementation (seems about 0.5-1% slower, but hard to say). An advantage of this implementation is that it is closer to an implementation that stores the state and checkpoints more often. > > for { > switch c.lowerCaserState: > case lcsInterWord: > etc > case lcsFirst: > etc > case lcsMidWord: > etc > } > } > return c.ret() https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode349 cases/casemap.go:349: c.writeString("ς") On 2014/10/13 06:37:42, nigeltao wrote: > if !c.writeString("ς") { > return false > } > > Similarly elsewhere in this file. I realize my comment for mapFunc was not entirely correct. It is not necessary to return false in case of a ErrShortDst. I prefer no to add the if statements as it just adds additional checks and, making the code messier and possibly causing a performance hit (the check is likely to be far more often true than false for reasonable dst buffer sizes). However, I've at least made it such that if a write fails it will return false as well (though not strictly necessary, it helps to leave a casing loop early). In this particular case, though, it is not necessary, as call to next that follows will return false if there was an error. I would like to keep this check in next and checkpoint as this means I'm not relying on not missing out on an if statement. In general, it makes it much easier to reason about the correctness of the code. Let me know if you still strongly feel about this check given that it is not necessary that for a mapFunc to return false in case of a ErrShortDst and that in this case (and others) it will return false anyway. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode361 cases/casemap.go:361: c.copy() On 2014/10/13 06:37:42, nigeltao wrote: > if !c.copy() { > return false > } > > Similarly elsewhere in this file. Same as before. This is not necessary: not a requirement to return false if this fails, plus it will do so anyway as it is immediately followed by a next(). https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode402 cases/casemap.go:402: c.writeBytes(b[:sz]) On 2014/10/13 06:37:42, nigeltao wrote: > if !c.writeBytes(b[:sz]) { > return false > } > > Similarly elsewhere in this file. Acknowledged. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode536 cases/casemap.go:536: for ; i <= maxIgnorable-1 && c.next(); i++ { I've simplified this now. The choice of maxIgnorable is somewhat arbitrary. So I now say that both input and output should not span maxIgnorable, which is also a reasonable choice. This makes all checks identical except for finalSigma, where the check involves checking ignorables followed by a cased letter, rather than another ignorable. Anyway, I hope it is clearer now. I've also simplified the loop for finalSigma and added a comment there to explain the difference in treatment. On 2014/10/13 06:37:42, nigeltao wrote: > I'm still confused. You subtract 1, but you also say <= instead of <, which > means that this loop still executes maxIgnorable times. If that's really what > you want, idiomatic is checking "< N" instead of "<= N-1". How about? > > // We check for up to maxIgnorable ignorables, up to and including the > // U+0307 COMBINING DOT ABOVE, which is itself an ignorable. > for i := 0; i < maxIgnorable && c.next(); i++ { https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode574 cases/casemap.go:574: return i > maxIgnorable On 2014/10/13 06:37:42, nigeltao wrote: > How can this be true? Good catch. It is safe to always return false here, it is just not the most efficient as it would result in a ErrShortSrc unnecessarily. That's why the tests still passed. I've modified the tests to check the correct errors codes are returned. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode616 cases/casemap.go:616: for ; i <= maxIgnorable-1 && c.next(); i++ { On 2014/10/13 06:37:42, nigeltao wrote: > Ditto. Done. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode622 cases/casemap.go:622: return c.writeBytes(c.src[start:c.pSrc]) // ignore u+0307 On 2014/10/13 06:37:42, nigeltao wrote: > s/u/U/ Done. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode634 cases/casemap.go:634: if i > maxIgnorable { On 2014/10/13 06:37:42, nigeltao wrote: > Ditto. Done. https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode637 cases/casemap.go:637: c.writeString("ı") On 2014/10/13 06:37:42, nigeltao wrote: > return ok && c.writeString(etc) && c.writeBytes(etc) This is incorrect: returning false means that that it is not safe to checkpoint as the written output may change if we have more bytes. ok == false means we ran out of source bytes. This still is okay if atEOF == true. So we need to write the bytes even if ok == false. It is fine checking for ok last, though. https://codereview.appspot.com/122420043/diff/180001/cases/cases.go File cases/cases.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/cases.go#newcode33 cases/cases.go:33: // A Caser may be stateful and should therefore not be shared between go On 2014/10/13 06:37:42, nigeltao wrote: > "goroutines" is one word. Done. https://codereview.appspot.com/122420043/diff/180001/cases/cases.go#newcode70 cases/cases.go:70: // No options (yet) for lowercase. I do (e.g. ignoring the final sigma handling), but I've removed the comment. On 2014/10/13 06:37:42, nigeltao wrote: > I'm not sure why there's a comment here, but not a corresponding one in Upper or > Title. Do you have options in mind? https://codereview.appspot.com/122420043/diff/180001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode12 cases/context.go:12: // writing to a a destination buffer. On 2014/10/13 06:37:42, nigeltao wrote: > Typo in "a a". Done. https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode22 cases/context.go:22: dst, src []byte It makes the struct about 7% bigger. This difference is measurable but, albeit hardly. Changed it. On 2014/10/13 06:37:42, nigeltao wrote: > It might be worth pulling up atEOF up here: > > type context struct { > // dst, src and atEOF are the arguments to Transform. > dst, src []byte > atEOF bool > > pDst etc > etc > } > > Otherwise, it looks odd how context.Reset sets c.isMidWord but not c.atEOF. https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode32 cases/context.go:32: info caseInfo // case information of currently scanned rune On 2014/10/13 06:37:42, nigeltao wrote: > Maybe s/info/caseInfo/ Done. Although I'm a bit worried we may end up with 2 tries at some point. We'll see. For now we're good. https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode57 cases/context.go:57: if ok && c.err == nil { On 2014/10/13 06:37:42, nigeltao wrote: > Can this just be > if ok { This would require mapFuncs to always consistently return false on ErrShortDst. With this check it doesn't matter. I like the fact that I can guarantee no checkpoint is written after a write failure without looking at every specific mapFunc implementation. https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode159 cases/context.go:159: func caseType(c *context) caseInfo { On 2014/10/13 06:37:42, nigeltao wrote: > s/function/method/ ? Done.
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/180001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode32 cases/context.go:32: info caseInfo // case information of currently scanned rune On 2014/10/13 18:36:54, mpvl wrote: > On 2014/10/13 06:37:42, nigeltao wrote: > > Maybe s/info/caseInfo/ > > Done. Although I'm a bit worried we may end up with 2 tries at some point. We'll > see. For now we're good. Heh, I meant to change the other way, but c'est la vie. It's fine as "info info". https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go#newcode22 cases/casemap.go:22: // version to the same context. It may advance the context to the next rune. It I found "It returns false when future runes may cause written bytes to be modified (analogous to ErrShortSrc)" hard to understand. How about: // etc. It // returns whether a checkpoint is possible: whether the pDst bytes written to // dst so far won't need changing as we see more source bytes. https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go#newcode374 cases/casemap.go:374: // A case ignorable may also introduce a word break, so me may need s/me/we/ https://codereview.appspot.com/122420043/diff/220001/cases/casemap_test.go File cases/casemap_test.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/casemap_test.go#ne... cases/casemap_test.go:226: const minBufSize = norm.MaxSegmentSize Out of scope of this CL, but I saw that norm.MaxSegmentSize doesn't have a doc comment. An ignorant-of-Unicode question, but does "segment" here have a precise meaning, or is simply the general-English meaning synonymous with "chunk", "part" or "piece"? https://codereview.appspot.com/122420043/diff/220001/cases/casemap_test.go#ne... cases/casemap_test.go:441: t.Errorf("%d:%s: error was %v; want %v", i, tt.desc, err, tt.firstErr) It can be hard to tell where the test description end and the error message begins, since the description also contains ":". How about replacing "%d:%s: " with "%d:%s:\n", and similarly below. https://codereview.appspot.com/122420043/diff/220001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode50 cases/context.go:50: if c.atEOF && c.pSrc == len(c.src) || c.nSrc == len(c.src) { Does c.pSrc need to be c.pSrc+c.sz here and on the next line?? https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode51 cases/context.go:51: return c.pDst, c.pSrc, nil If we get here by the second condition (c.nSrc == len(c.src)), are you sure that we want c.pDst instead of c.nDst?? How about: // ret returns the Transform method's return values. func (c *context) ret() (nDst, nSrc int, err error) { // If we hit an error, or if the checkpoint was at the end of the source // data, then return the checkpoint. if c.err != nil || c.nSrc == len(c.src) { return c.nDst, c.nSrc, c.err } // There is an implicit checkpoint if we are at the end of the source data // and that source data is atEOF. if c.atEOF && c.pSrc == len(c.src) { return c.pDst, c.pSrc, nil } // nSrc < len(c.src), so return ErrShortSrc. return c.nDst, c.nSrc, transform.ErrShortSrc } https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode159 cases/context.go:159: // caseType returns a info with only the case bits, normalized to either s/a info/an info/ https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go File cases/context_test.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:17: const lastRune = rune(0x1FFFF) Delete this, use lastRuneForTesting instead. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:68: Move the contextFromRune definition up here, since it's used by TestCaseProperties and other tests up here, not just TestContext? https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:91: func TestMapping(t *testing.T) { If applyFunc is used only by this test, I'd make it local to this test (and s/applyFunc/apply/): func TestMapping(t *testing.T) { apply := func(r rune, f func(c *context) bool) string { etc } etc } https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:105: if _, ok := special[r]; ok { Isn't this just for r := range special { modulo map iteration order? If you really want a stable order (even though testing a rune should be independent), you could accumulate special's keys beforehand and sort them. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:110: t.Errorf("lower:%q (%U): got %q %U; want %q %U", r, rune(r), got, []rune(got), want, []rune(want)) s/rune(r)/r/ and similarly below. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:160: Delete the blank line. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:174: atEOF = true Having these in global scope seems weird, and while everything's actually correct, it seems potentially confusing how the atEOF parameter to the Caser.Transform method shadows this global constant. I'd delete these constants (see below). https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:190: src, out string Since your slice literal's elements are one-field-per-line, I'd split this line: src string out string https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:194: prefix prefix I'm not convinced that the "type prefix struct" is worth it. This field could easily be two: prefixArg string prefixWant bool https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:195: op []func() I'd pluralize the name of a slice: s/op/ops/ Alternatively, instead of a []func(), you could have this field be a string, like "next;next;checkpoint;next", and below, instead of for _, f := range tt.op { f() } you'd have: for _, op := range strings.Split(tt.ops, ";") { switch op { case "next": c.next() case "checkpoint": c.checkpoint() etc } } You could then delete the "var c context" above. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:197: { BTW, I like to indent slice-of-struct literals like this: tests := []struct { etc }{{ etc }, { etc }, { etc }} as the average indentation is one less, and the struct definition fields vertically align with the struct value fields. Similarly in the other _test.go files. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:200: atEOF, Instead of an atEOF named constant, I'd name the fields (which also helps e.g. remember which field is src and which is out): dstSize: 10, atEOF: true, src: "12", etc. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:206: prefix{"x", false}, I'm not sure why there's an "x" here. I'd expect either {"", true} or {"1", false}. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:445: t.Errorf("%s: prefix was %v; want %v", tt.desc, got, tt.prefix.res) It can be hard to tell where the test description end and the error message begins, since the description also contains ":". How about replacing "%s: " with "%s:\n", and similarly below.
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/180001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode32 cases/context.go:32: info caseInfo // case information of currently scanned rune On 2014/10/21 05:42:00, nigeltao wrote: > On 2014/10/13 18:36:54, mpvl wrote: > > On 2014/10/13 06:37:42, nigeltao wrote: > > > Maybe s/info/caseInfo/ > > > > Done. Although I'm a bit worried we may end up with 2 tries at some point. > We'll > > see. For now we're good. > > Heh, I meant to change the other way, but c'est la vie. It's fine as "info > info". Acknowledged. https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go#newcode22 cases/casemap.go:22: // version to the same context. It may advance the context to the next rune. It On 2014/10/21 05:42:00, nigeltao wrote: > I found "It returns false when future runes may cause written bytes to be > modified (analogous to ErrShortSrc)" hard to understand. How about: > > // etc. It > // returns whether a checkpoint is possible: whether the pDst bytes written to > // dst so far won't need changing as we see more source bytes. Done. https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go#newcode374 cases/casemap.go:374: // A case ignorable may also introduce a word break, so me may need On 2014/10/21 05:42:00, nigeltao wrote: > s/me/we/ Done. https://codereview.appspot.com/122420043/diff/220001/cases/casemap_test.go File cases/casemap_test.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/casemap_test.go#ne... cases/casemap_test.go:226: const minBufSize = norm.MaxSegmentSize I use Segment in norm to refer to a starter followed by a sequence of non-starters, representing a single character. Character is somewhat overloaded and ambiguous term, so we need something else. There is not really a word for this in the Unicode standard, as far as I can tell. The closest thing to it is probably Grapheme Cluster (e.g. referred to in the Segmentation annex), but this is not necessarily the same, so I used the more generic term Segment. On 2014/10/21 05:42:00, nigeltao wrote: > Out of scope of this CL, but I saw that norm.MaxSegmentSize doesn't have a doc > comment. > > An ignorant-of-Unicode question, but does "segment" here have a precise meaning, > or is simply the general-English meaning synonymous with "chunk", "part" or > "piece"? https://codereview.appspot.com/122420043/diff/220001/cases/casemap_test.go#ne... cases/casemap_test.go:441: t.Errorf("%d:%s: error was %v; want %v", i, tt.desc, err, tt.firstErr) On 2014/10/21 05:42:00, nigeltao wrote: > It can be hard to tell where the test description end and the error message > begins, since the description also contains ":". How about replacing "%d:%s: " > with "%d:%s:\n", and similarly below. Done. https://codereview.appspot.com/122420043/diff/220001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode50 cases/context.go:50: if c.atEOF && c.pSrc == len(c.src) || c.nSrc == len(c.src) { On 2014/10/21 05:42:00, nigeltao wrote: > Does > c.pSrc > need to be > c.pSrc+c.sz > here and on the next line?? Conceptually, yes. However, this point will only be reached if there is no ErrShortDst. This means that the source was exhausted, and that c.sz must be 0. It is fine to add the c.sz in both cases, though, so if you want I can still do that. https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode51 cases/context.go:51: return c.pDst, c.pSrc, nil As c.pSrc increases monotonically, the invariant c.pSrc >= c.nSrc holds. So if c.nDst == len(c.src), c.pSrc == len(c.src) must be true as well, as well as c.nSrc == c.pSrc. But I'm fine with this change. On 2014/10/21 05:42:00, nigeltao wrote: > If we get here by the second condition (c.nSrc == len(c.src)), are you sure that > we want c.pDst instead of c.nDst?? How about: > > // ret returns the Transform method's return values. > func (c *context) ret() (nDst, nSrc int, err error) { > // If we hit an error, or if the checkpoint was at the end of the source > // data, then return the checkpoint. > if c.err != nil || c.nSrc == len(c.src) { > return c.nDst, c.nSrc, c.err > } > // There is an implicit checkpoint if we are at the end of the source data > // and that source data is atEOF. > if c.atEOF && c.pSrc == len(c.src) { > return c.pDst, c.pSrc, nil > } > // nSrc < len(c.src), so return ErrShortSrc. > return c.nDst, c.nSrc, transform.ErrShortSrc > } https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode159 cases/context.go:159: // caseType returns a info with only the case bits, normalized to either On 2014/10/21 05:42:00, nigeltao wrote: > s/a info/an info/ Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go File cases/context_test.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:17: const lastRune = rune(0x1FFFF) On 2014/10/21 05:42:01, nigeltao wrote: > Delete this, use lastRuneForTesting instead. Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:68: On 2014/10/21 05:42:01, nigeltao wrote: > Move the contextFromRune definition up here, since it's used by > TestCaseProperties and other tests up here, not just TestContext? done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:91: func TestMapping(t *testing.T) { On 2014/10/21 05:42:01, nigeltao wrote: > If applyFunc is used only by this test, I'd make it local to this test (and > s/applyFunc/apply/): > > func TestMapping(t *testing.T) { > apply := func(r rune, f func(c *context) bool) string { > etc > } > etc > } I'm getting a bit mixed recommendations from reviewers here, but I'm fine with what your suggestion. Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:105: if _, ok := special[r]; ok { On 2014/10/21 05:42:01, nigeltao wrote: > Isn't this just > for r := range special { > modulo map iteration order? If you really want a stable order (even though > testing a rune should be independent), you could accumulate special's keys > beforehand and sort them. No. It is the inverse set of the runes in special. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:110: t.Errorf("lower:%q (%U): got %q %U; want %q %U", r, rune(r), got, []rune(got), want, []rune(want)) On 2014/10/21 05:42:00, nigeltao wrote: > s/rune(r)/r/ and similarly below. Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:160: On 2014/10/21 05:42:00, nigeltao wrote: > Delete the blank line. Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:174: atEOF = true On 2014/10/21 05:42:01, nigeltao wrote: > Having these in global scope seems weird, and while everything's actually > correct, it seems potentially confusing how the atEOF parameter to the > Caser.Transform method shadows this global constant. I'd delete these constants > (see below). Done. Added field names in the literal struct, though, to clarify what the values mean. done turned out to be unused (more or less replaced by err). https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:190: src, out string On 2014/10/21 05:42:01, nigeltao wrote: > Since your slice literal's elements are one-field-per-line, I'd split this line: > src string > out string Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:194: prefix prefix On 2014/10/21 05:42:01, nigeltao wrote: > I'm not convinced that the "type prefix struct" is worth it. This field could > easily be two: > prefixArg string > prefixWant bool Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:195: op []func() On 2014/10/21 05:42:00, nigeltao wrote: > I'd pluralize the name of a slice: s/op/ops/ > > Alternatively, instead of a []func(), you could have this field be a string, > like "next;next;checkpoint;next", and below, instead of > > for _, f := range tt.op { > f() > } > > you'd have: > > for _, op := range strings.Split(tt.ops, ";") { > switch op { > case "next": > c.next() > case "checkpoint": > c.checkpoint() > etc > } > } > > You could then delete the "var c context" above. Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:197: { On 2014/10/21 05:42:00, nigeltao wrote: > BTW, I like to indent slice-of-struct literals like this: > > tests := []struct { > etc > }{{ > etc > }, { > etc > }, { > etc > }} > > as the average indentation is one less, and the struct definition fields > vertically align with the struct value fields. > > Similarly in the other _test.go files. Interesting. Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:200: atEOF, On 2014/10/21 05:42:00, nigeltao wrote: > Instead of an atEOF named constant, I'd name the fields (which also helps e.g. > remember which field is src and which is out): > > dstSize: 10, > atEOF: true, > src: "12", > etc. Done. https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:206: prefix{"x", false}, On 2014/10/21 05:42:01, nigeltao wrote: > I'm not sure why there's an "x" here. I'd expect either {"", true} or {"1", > false}. added comment https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne... cases/context_test.go:445: t.Errorf("%s: prefix was %v; want %v", tt.desc, got, tt.prefix.res) On 2014/10/21 05:42:01, nigeltao wrote: > It can be hard to tell where the test description end and the error message > begins, since the description also contains ":". How about replacing "%s: " with > "%s:\n", and similarly below. Done.
Sign in to reply to this message.
I ran out of time today, and haven't looked closely at the latest round, but I've dropped a couple of (maybe obsolete) replies. https://codereview.appspot.com/122420043/diff/220001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode50 cases/context.go:50: if c.atEOF && c.pSrc == len(c.src) || c.nSrc == len(c.src) { On 2014/10/21 15:15:49, mpvl wrote: > On 2014/10/21 05:42:00, nigeltao wrote: > > Does > > c.pSrc > > need to be > > c.pSrc+c.sz > > here and on the next line?? > > Conceptually, yes. However, this point will only be reached if there is no > ErrShortDst. This means that the source was exhausted, and that c.sz must be 0. > > It is fine to add the c.sz in both cases, though, so if you want I can still do > that. I'd rather add c.sz in both cases, because I still think that it'll be cleaner if pSrc points after instead of before the rune read by c.next, but I can try that out in a follow-up CL to see if it actually simplifies things. https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode51 cases/context.go:51: return c.pDst, c.pSrc, nil On 2014/10/21 15:15:49, mpvl wrote: > As c.pSrc increases monotonically, the invariant c.pSrc >= c.nSrc holds. So if > c.nDst == len(c.src), c.pSrc == len(c.src) must be true as well, as well as > c.nSrc == c.pSrc. I get that c.pSrc == len(c.src) implies c.nSrc == len(c.src) but I'm worried about dst pointers, not source pointers, and whether returning c.pDst, etc instead of c.nDst etc is correct if we catch the c.nSrc == len(c.src) case.
Sign in to reply to this message.
https://codereview.appspot.com/122420043/diff/220001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode50 cases/context.go:50: if c.atEOF && c.pSrc == len(c.src) || c.nSrc == len(c.src) { On 2014/10/22 06:52:58, nigeltao wrote: > On 2014/10/21 15:15:49, mpvl wrote: > > On 2014/10/21 05:42:00, nigeltao wrote: > > > Does > > > c.pSrc > > > need to be > > > c.pSrc+c.sz > > > here and on the next line?? > > > > Conceptually, yes. However, this point will only be reached if there is no > > ErrShortDst. This means that the source was exhausted, and that c.sz must be > 0. > > > > It is fine to add the c.sz in both cases, though, so if you want I can still > do > > that. > > I'd rather add c.sz in both cases, because I still think that it'll be cleaner > if pSrc points after instead of before the rune read by c.next, but I can try > that out in a follow-up CL to see if it actually simplifies things. Not sure if you saw the comment I added. Adding c.sz does seem to degrade performance a small amount (about 1-3%, using BenchmarkSigmaLower, consistent over 10 measurements), so I rather not. If you feel strongly about it, though, I'll add the c.sz. https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode51 cases/context.go:51: return c.pDst, c.pSrc, nil On 2014/10/22 06:52:58, nigeltao wrote: > On 2014/10/21 15:15:49, mpvl wrote: > > As c.pSrc increases monotonically, the invariant c.pSrc >= c.nSrc holds. So if > > c.nDst == len(c.src), c.pSrc == len(c.src) must be true as well, as well as > > c.nSrc == c.pSrc. > > I get that c.pSrc == len(c.src) implies c.nSrc == len(c.src) but I'm worried > about dst pointers, not source pointers, and whether returning c.pDst, etc > instead of c.nDst etc is correct if we catch the c.nSrc == len(c.src) case. Same logic would apply. Either way, I moved the check as per your suggestion.
Sign in to reply to this message.
LGTM, but wait for r. Thanks for your patience with this code review. One last thing: I'd rename casemap.go to map.go (and likewise for casemap_test.go) before you submit, since the file is already in the "cases/" directory. https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go#newcode374 cases/casemap.go:374: // A case ignorable may also introduce a word break, so me may need On 2014/10/21 15:15:49, mpvl wrote: > On 2014/10/21 05:42:00, nigeltao wrote: > > s/me/we/ > > Done. Not done? https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode51 cases/casemap.go:51: // Many Uppercase mappers are reentrant, so we can precompute the s/Uppercase// if this comment also applies to lower casers?? If so, add a blank line after the final "//" so that it doesn't look like the comment only applies to the upperFunc variable on the subsequent line? Also s/reentrant/stateless/ ?? It's not like an upper caser calls something that calls back into that upper caser. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode231 cases/casemap.go:231: p = t.rewrite(c) I don't know if you have plans for other rewrites, but AFAICT t.rewrite can either be nil or afnlRewrite and afnlRewrite doesn't change c.info, so you could change this to if t.rewrite != nil { t.rewrite(c) } and rewrite changes from "func(*context) info" to "func(*context)". After that, "rewrite" doesn't seem like quite the right name. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode234 cases/casemap.go:234: wasMid := c.info.isCaseIgnorableAndNonBreakStarter() Do you mean s/c.info/p/ ?? https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode296 cases/casemap.go:296: ln := (e[0] >> lengthBits) & lengthMask I'd s/ln/n/ unless there's some meaning for "ln" that I'm missing. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode483 cases/casemap.go:483: // modifier if the last rune of the decomposition is in [0x300-0x311]. Unless I'm missing some subtlety, I'd change "0x300" to "\u0300" or "U+0300" since for me, "0x" means bytes. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode497 cases/casemap.go:497: // that is less than 230 (Above). We will insert the 0x0307, if Ditto. https://codereview.appspot.com/122420043/diff/240001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/240001/cases/context.go#newcode149 cases/context.go:149: func (c *context) hasPrefix(s string) bool { Is s always going to be a one-rune string? There may be an opportunity to optimize and/or simplify, but I guess that that can be a follow-up CL.
Sign in to reply to this message.
Thanks for the very thorough review! It was worth the wait. :) https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go#newcode374 cases/casemap.go:374: // A case ignorable may also introduce a word break, so me may need On 2014/10/23 06:29:44, nigeltao wrote: > On 2014/10/21 15:15:49, mpvl wrote: > > On 2014/10/21 05:42:00, nigeltao wrote: > > > s/me/we/ > > > > Done. > > Not done? Take 2. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode51 cases/casemap.go:51: // Many Uppercase mappers are reentrant, so we can precompute the On 2014/10/23 06:29:44, nigeltao wrote: > s/Uppercase// if this comment also applies to lower casers?? If so, add a blank > line after the final "//" so that it doesn't look like the comment only applies > to the upperFunc variable on the subsequent line? > > Also s/reentrant/stateless/ ?? It's not like an upper caser calls something that > calls back into that upper caser. This only applies to some uppercase mappers. Lowercasers use simpleCaser. which is not stateless. Changed reentrant to stateless. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode231 cases/casemap.go:231: p = t.rewrite(c) On 2014/10/23 06:29:44, nigeltao wrote: > I don't know if you have plans for other rewrites, but AFAICT t.rewrite can > either be nil or afnlRewrite and afnlRewrite doesn't change c.info, so you could > change this to > if t.rewrite != nil { > t.rewrite(c) > } > and rewrite changes from "func(*context) info" to "func(*context)". After that, > "rewrite" doesn't seem like quite the right name. It was changing it before, but not anymore. Good catch. Removing it. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode234 cases/casemap.go:234: wasMid := c.info.isCaseIgnorableAndNonBreakStarter() On 2014/10/23 06:29:44, nigeltao wrote: > Do you mean s/c.info/p/ ?? Yes, changed. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode296 cases/casemap.go:296: ln := (e[0] >> lengthBits) & lengthMask On 2014/10/23 06:29:44, nigeltao wrote: > I'd s/ln/n/ unless there's some meaning for "ln" that I'm missing. Done. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode483 cases/casemap.go:483: // modifier if the last rune of the decomposition is in [0x300-0x311]. On 2014/10/23 06:29:44, nigeltao wrote: > Unless I'm missing some subtlety, I'd change "0x300" to "\u0300" or "U+0300" > since for me, "0x" means bytes. Done. https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode497 cases/casemap.go:497: // that is less than 230 (Above). We will insert the 0x0307, if On 2014/10/23 06:29:44, nigeltao wrote: > Ditto. Done. https://codereview.appspot.com/122420043/diff/240001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/240001/cases/context.go#newcode149 cases/context.go:149: func (c *context) hasPrefix(s string) bool { On 2014/10/23 06:29:44, nigeltao wrote: > Is s always going to be a one-rune string? There may be an opportunity to > optimize and/or simplify, but I guess that that can be a follow-up CL. Yes, but we need a decoded rune, hence the string. Possibly faster, indeed, would be to pass b [4]byte, size uint8, but I would consider that option later.
Sign in to reply to this message.
LGTM For future big packages like this, please make it easier for the reviewers by starting with a small, maybe even non-functioning CL that defines the API and maybe a couple of examples. Once a huge blob of code lands, it's hard to see the overall design, and that is more important than the implementation. Once the API is approved and checked in, the implementation can start and, I hope, be reviewed in manageable pieces.
Sign in to reply to this message.
Thanks for the review. I did write down the gist of the design a while back in https://docs.google.com/document/d/1Q64ktYh7XptpEI3L2G7xYqsusohOeLUft865Zd7fb..., although granted it has changed a bit. Most notably, I've renamed Transform to Caser and dropped some of its functionality as well as the Mappings type, and changed the way options work. But will check in a skeleton design asap next time. On Thu, Nov 6, 2014 at 12:42 AM, <r@golang.org> wrote: > LGTM > > For future big packages like this, please make it easier for the > reviewers by starting with a small, maybe even non-functioning CL that > defines the API and maybe a couple of examples. Once a huge blob of code > lands, it's hard to see the overall design, and that is more important > than the implementation. > > Once the API is approved and checked in, the implementation can start > and, I hope, be reviewed in manageable pieces. > > https://codereview.appspot.com/122420043/ >
Sign in to reply to this message.
*** Submitted as https://code.google.com/p/go/source/detail?r=f230eca7a943&repo=text *** go.text/cases: first stab at cases package. Overview Includes language-specific casing. Title caser uses a close approximation of the Unicode Word Breaking Algorithm. Most information needed for case mapping is stored in a single trie. Core files: cases.go: API casemap.go: per-language, high-level case mappings context.go: trie lookup wrapper and state management LGTM=nigeltao, r R=nigeltao, r CC=golang-codereviews https://codereview.appspot.com/122420043
Sign in to reply to this message.
|