Issue 122420043: code review 122420043: go.text/cases: first stab at cases package.

mpvl

Hello nigeltao@golang.org (cc: golang-codereviews@googlegroups.com, r@golang.org), I'd like you to review this change to https://code.google.com/p/go.text

10 years, 6 months ago (2014-08-12 19:18:17 UTC) #1

nigeltao

I haven't digested all of this yet, but here's a first round of mostly trivial ...

10 years, 6 months ago (2014-08-13 06:26:31 UTC) #2

nigeltao

In the CL description: s/gettest/gentest/ and s/differes/differs/

10 years, 6 months ago (2014-08-13 06:27:07 UTC) #3

mpvl

https://codereview.appspot.com/122420043/diff/40001/cases/cases.go File cases/cases.go (right): https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35 cases/cases.go:35: // to Bytes or String or to reuse a ...

10 years, 6 months ago (2014-08-13 15:26:06 UTC) #4

https://codereview.appspot.com/122420043/diff/40001/cases/cases.go
File cases/cases.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35
cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has
returned a nil error.
On 2014/08/13 06:26:29, nigeltao wrote:
> Change "it" to "its Transform method"?

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35
cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has
returned a nil error.
On 2014/08/13 06:26:29, nigeltao wrote:
> Would it be safer to supply an explicit Reset method instead of relying on
> callers carefully checking for nil/non-nil error?
A few points on this:
- I'm okay with this, as long as it doesn't have to be called after using
Bytes() or String().

However, if we do this it is good to take a step back and look at Transformers.
So far, almost no Transformer has state (Chain is a notable exception and it
could benefit from a Reset method). Also, only the Title Caser has minimal
state.

We could add a Reset(t Transformer) to transform that will call Reset on a
transformer if it implements Reset or either 1) relies on all Transformers with
state to implement it, or 2) will rely on a nil error to mean the state is
clear. In both cases we need to augment the spec for Transfomer.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode13
cases/context.go:13: // This file contains the high-level lookup code for casing
properties as well
Yes, of course. My bad.

On 2014/08/13 06:26:29, nigeltao wrote:
> I like to put file-level comments between the package and import lines, not
> after the import:
> 
> ----
> // Copyright
> 
> package foo
> 
> // This file etc.
> 
> import (
> ----

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode45
cases/context.go:45: func makeContext(dst, src []byte, atEOF bool) context {
This is deliberate. It is easier to use this and ensure the returned context
does not escape. Granted, so far I did not have any implementation that did not
escape because of the use of function pointers, but I added a special case for
the majority-case upper caser where this is not the case. (sorry for that
sentence :-) ). This gives a speedup of about 50% for small strings (never mind
the avoiding of garbage) and a noticeable speedup even for larger strings. 
I considered making it a member variable for other implementations, but this did
not give worthy enough speedup (1-2%) considering the drawbacks.

Would be good to make such a special case for lowercase as well, as it is the
dominant use case. Added a TODO for this.

On 2014/08/13 06:26:29, nigeltao wrote:
> Return a *context instead of a context?

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode91
cases/context.go:91: c.doTitle = true
On 2014/08/13 06:26:29, nigeltao wrote:
> Add a comment to say why?? I was surprised that c.ret() can modify c's state.
The way context is designed is that, by default, the case mappers do not deal
with ErrShortSrc and ErrShortDst, only context does this. c.ret translates the
checkpoints and other buffer pointers to the ErrShortSrc if necessary. It makes
it very easy to verify that the code is correct. Note that all Transforms use
c.ret before returning. The only tricky part is that a checkpoint should always
be called before writing any rune in title case. I remove this tricky part by
having a separate retTitle where I always return ErrShortSrc if c.atEOF ==
false. This is not desirable for Lower or Upper as it has a small but noticeable
performance impact for them (less important and noticeable for title casing).

> I'm also surprised that the doTitle state depends on how big or small the
> Transform buffers are.
This block does not signify a short or long buffer, but rather whether
processing has completed. A
If it has not, and if

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode105
cases/context.go:105: // done returns true if Transform should terminate
immediately.
On 2014/08/13 06:26:29, nigeltao wrote:
> I'd s/true if/whether/ and similarly below and in trieval.go.

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode122
cases/context.go:122: func (c *context) insertString(str string) {
On 2014/08/13 06:26:30, nigeltao wrote:
> Personally, I'd shorten "str" to just "s".

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode140
cases/context.go:140: func (c *context) writeString(str string) {
On 2014/08/13 06:26:30, nigeltao wrote:
> s/str/s/

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode150
cases/context.go:150: func (c *context) writeBytes(str []byte) {
On 2014/08/13 06:26:30, nigeltao wrote:
> s/str/b/, or s/str/s/ and similarly for insertBytes.

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode150
cases/context.go:150: func (c *context) writeBytes(str []byte) {
On 2014/08/13 06:26:30, nigeltao wrote:
> If you changed the exceptions variable from a [...]byte to a string, then you
> could delete this method, and use writeString instead.

Nice. Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode171
cases/context.go:171: // Fast path for 6-bit XOR pattern, which covers the fast
majority of
On 2014/08/13 06:26:29, nigeltao wrote:
> Change the last "fast" to "vast"? Or just say "which covers most cases."

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode201
cases/context.go:201: e := c.info
On 2014/08/13 06:26:29, nigeltao wrote:
> You could delete this line, and on the next:
> cm := c.info & 0x7
> Below, "e&0x3" becomes "cm&0x3".

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode225
cases/context.go:225: } else {
On 2014/08/13 06:26:29, nigeltao wrote:
> I'd add a return before this line, drop the else, and outdent the rest.
> Similarly in the upper and title methods.

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode228
cases/context.go:228: c.writeBytes(e[1:][:ln])
On 2014/08/13 06:26:30, nigeltao wrote:
> Personally, I find
> e[1:1+ln]
> easier to read than
> e[1:][:ln]
> and similarly below.

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode278
cases/context.go:278: nu := (e[0] >> lengthBits) & lengthMask
it is length of upper part. should be un, to be consistent. Changed to something
more meaningful.

On 2014/08/13 06:26:29, nigeltao wrote:
> I must confess that I don't understand where the "nu" name comes from.
> 
> It's also a little weird that ln is the low bits here, but the high bits in
func
> little and func upper.

https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go
File cases/context_test.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new...
cases/context_test.go:47: return case_ignorable[r] // in data_test.go
On 2014/08/13 06:26:30, nigeltao wrote:
> s/data/tables/, or just delete the comment.

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new...
cases/context_test.go:97: t.Errorf("lowerSpecial:(%X): got %+q; want %+q", r,
got, want)
On 2014/08/13 06:26:30, nigeltao wrote:
> %U instead of %X?

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new...
cases/context_test.go:137: case 230:
On 2014/08/13 06:26:30, nigeltao wrote:
> s/230/above/ ?

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/context_test.go#new...
cases/context_test.go:139: default:
On 2014/08/13 06:26:30, nigeltao wrote:
> Delete.

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go
File cases/gen_test.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode73
cases/gen_test.go:73: // chars contains runeInfo structs for all runes. It is
filled by parseUCD.
On 2014/08/13 06:26:30, nigeltao wrote:
> Have parseUCD return chars instead of sharing a global variable?

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode295
cases/gen_test.go:295: fmt.Fprintln(w, "\n}\n")
removed as I'm not generating a string. BTW, how do I run go vet on test files?

On 2014/08/13 06:26:30, nigeltao wrote:
> I'd say
> fmt.Fprint(w, "\n}\n\n")
> to keep "go vet" happy.

https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode301
cases/gen_test.go:301: func makeEntry(ri *runeInfo) {
It is a bit tricky to do so. entry should really be seen as a struct, or bit
union, to be correct. It is assembled in two functions (makeEntry and
makeException), and combining these two subresults is not a self-evident
operation. So I think it would actually make things worse.
I could do this only for makeEntry and let makeException modify it as a side
effect, but I rather be consistent.

On 2014/08/13 06:26:30, nigeltao wrote:
> Would it be easier if this returned caseInfo instead of setting ri.entry as a
> side-effect?

https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode529
cases/gen_test.go:529: // See context.go for the definition of lookup.
Redid comment.
On 2014/08/13 06:26:30, nigeltao wrote:
> Definition?

https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode536
cases/gen_test.go:536: if err == nil && e != nil {
On 2014/08/13 06:26:30, nigeltao wrote:
> Personally, I'd rename "e" to "err" and rename "err" to "retErr".

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/gen_test.go#newcode672
cases/gen_test.go:672: logger.Fatalf("%U: cased, but got break type %s; want
ALetter", r, c.CCC, c.BreakType)
On 2014/08/13 06:26:30, nigeltao wrote:
> The %-verbs don't match the arguments.

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go
File cases/trieval.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode36
cases/trieval.go:36: casedMask      = 0x3 // uncased if 0
removed.

On 2014/08/13 06:26:31, nigeltao wrote:
> Is the comment necessary when you spell out cUncased == 0 (and cTitle, cLower
> and cUpper) below?

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode61
cases/trieval.go:61: // case runes to alternate for increasing rune values. We
use this property
On 2014/08/13 06:26:30, nigeltao wrote:
> Perhaps give an example of this, to help those of us familiar with only
ASCII's
> 'a' and 'A' not being consecutive.

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode100
cases/trieval.go:100: // The case mapping implementation will need to know about
various CCC values.
On 2014/08/13 06:26:31, nigeltao wrote:
> I'd spell out at least once, maybe here, that CCC stands for Canonical
Combining
> Class.

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode133
cases/trieval.go:133: iotaSubscrupt = 240
On 2014/08/13 06:26:30, nigeltao wrote:
> Typo in "script".

Done.

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode146
cases/trieval.go:146: // for title casing. For example, a4a -> A4a, but
a\u03004a -> A\u03004A.
I think it needs to be a Number class. How about replacing \u0300 with ~
(\u0308) and mentioning this?

On 2014/08/13 06:26:31, nigeltao wrote:
> Is it possible to use a different middle rune than '4'? It's not obvious that
> the last "4" doesn't belong to the "\u0300".

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode240
cases/trieval.go:240: noChange   = 0x7
0 is already reserved for a zero-length string. I think it is better to have the
length always as is and the "no change/ undefined" be the special value.

On 2014/08/13 06:26:30, nigeltao wrote:
> Is it easier (if possible?) to define noChange = 0?

Sign in to reply to this message.

nigeltao

Some more mostly trivial comments. I haven't had the time this week to give it ...

10 years, 6 months ago (2014-08-22 07:00:23 UTC) #5

Some more mostly trivial comments. I haven't had the time this week to give it a
proper look. Sorry.

https://codereview.appspot.com/122420043/diff/40001/cases/cases.go
File cases/cases.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35
cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has
returned a nil error.
On 2014/08/13 15:26:02, mpvl wrote:
> On 2014/08/13 06:26:29, nigeltao wrote:
> > Would it be safer to supply an explicit Reset method instead of relying on
> > callers carefully checking for nil/non-nil error?
> A few points on this:
> - I'm okay with this, as long as it doesn't have to be called after using
> Bytes() or String().
> 
> However, if we do this it is good to take a step back and look at
Transformers.
> So far, almost no Transformer has state (Chain is a notable exception and it
> could benefit from a Reset method). Also, only the Title Caser has minimal
> state.
> 
> We could add a Reset(t Transformer) to transform that will call Reset on a
> transformer if it implements Reset or either 1) relies on all Transformers
with
> state to implement it, or 2) will rely on a nil error to mean the state is
> clear. In both cases we need to augment the spec for Transfomer.

Yes, there's a TODO in go.text/transform/transform.go to think about resets.

You said that almost no Transformer so far has state, other than Chain, and the
Title Caser. I don't know the grand scope of what else you have coming down the
pipe. Do you anticipate any more complicated (stateful) transformers?

I haven't settled on what I think is the best design, but a couple of thoughts:

We could add a (mandatory) Reset method to the Transformer interface, and have
the transform.Bytes and transform.String functions' implementations start with a
Reset call. I wouldn't expect that to be a huge performance cost, but I also
have no actual data.

I'm not sure if implicitly resetting after a nil error is entirely safe. You
could be io.Copy'ing to a destination io.Writer that fails for whatever reason,
and you might exit early out of your transformation loop without ever seeing a
nil error, and you have no way of explicitly resetting the Transformer for
re-use.

https://codereview.appspot.com/122420043/diff/40001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode91
cases/context.go:91: c.doTitle = true
On 2014/08/13 15:26:05, mpvl wrote:
> On 2014/08/13 06:26:29, nigeltao wrote:
> > I'm also surprised that the doTitle state depends on how big or small the
> > Transform buffers are.
> This block does not signify a short or long buffer, but rather whether
> processing has completed. A
> If it has not, and if 

Unfinished codereview response??

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go
File cases/trieval.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode240
cases/trieval.go:240: noChange   = 0x7
On 2014/08/13 15:26:05, mpvl wrote:
> 0 is already reserved for a zero-length string. I think it is better to have
the
> length always as is and the "no change/ undefined" be the special value.

Casing can consume or output a zero-length string?? I should really know well
enough that, even if it can, that won't be the strangest thing in Unicode.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode61
cases/casemap.go:61: titleInfos = []titleInfo{
Just a suggestion, but if you want, you don't need to name the titleInfo type:

titleInfos = []struct {
  title, lower func(*context)
  rewrite      func(*context) caseInfo
}{
  {title, lower, nil},                // und
  etc,
}

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode73
cases/casemap.go:73: // options are not used (yet), but ensure that all make*
functions have the same
I'd just drop the "options are not used (yet), but ... in the tests."

Similarly below.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode80
cases/casemap.go:80: // makeLower configures a Caser from a basic lowercase
implementation.
s/from/for/? Or just delete the comment entirely. It's not an exported function.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode83
cases/casemap.go:83: func makeLower(t language.Tag, o options) *lowerCaser {
returning a transform.Transformer would be consistent with makeUpper. Similarly
below, for makeTitle.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode251
cases/casemap.go:251: // determines whether a lowercased sigma should be σ or ς,
 by looking ahead for
Two spaces after the comma.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode320
cases/casemap.go:320: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342,
0x0308, 0x0304, 0x345:
Adding another "0" to "0x345" would be more consistent.

https://codereview.appspot.com/122420043/diff/60001/cases/cases.go
File cases/cases.go (right):

https://codereview.appspot.com/122420043/diff/60001/cases/cases.go#newcode37
cases/cases.go:37: type Caser struct {
If a Caser is just a transfomer.Transformer, then I'm not totally convinced of
the utility of this type. An alternative design is that cases.Lower, cases.Title
and cases.Upper all return a transform.Transformer, and instead of calling
caser.Bytes(b), you call transform.Bytes(caser, b).

I haven't really decided either way which design I prefer, as I've only got
example_test.go to go on, and haven't tried to use it for real, but perhaps
something to think about. One thing is do you expect users of the cases package
to almost always lower/title/upper-case individual words (i.e. call Bytes or
String), or will they often want to convert blocks of text (i.e. use it with
transform.NewReader).

Sign in to reply to this message.

mpvl

[sorry, had a bunch of replies, but forgot to send out.] In the mean time ...

10 years, 5 months ago (2014-09-03 07:51:54 UTC) #6

[sorry, had a bunch of replies, but forgot to send out.]

In the mean time warmed up more to the idea of making Reset part of the
interface. Will do some experiments to see now this looks.

https://codereview.appspot.com/122420043/diff/40001/cases/cases.go
File cases/cases.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/cases.go#newcode35
cases/cases.go:35: // to Bytes or String or to reuse a Caser for after it has
returned a nil error.
On 2014/08/22 07:00:22, nigeltao wrote:
> On 2014/08/13 15:26:02, mpvl wrote:
> > On 2014/08/13 06:26:29, nigeltao wrote:
> > > Would it be safer to supply an explicit Reset method instead of relying on
> > > callers carefully checking for nil/non-nil error?
> > A few points on this:
> > - I'm okay with this, as long as it doesn't have to be called after using
> > Bytes() or String().
> > 
> > However, if we do this it is good to take a step back and look at
> Transformers.
> > So far, almost no Transformer has state (Chain is a notable exception and it
> > could benefit from a Reset method). Also, only the Title Caser has minimal
> > state.
> > 
> > We could add a Reset(t Transformer) to transform that will call Reset on a
> > transformer if it implements Reset or either 1) relies on all Transformers
> with
> > state to implement it, or 2) will rely on a nil error to mean the state is
> > clear. In both cases we need to augment the spec for Transfomer.
> 
> Yes, there's a TODO in go.text/transform/transform.go to think about resets.
> 
> You said that almost no Transformer so far has state, other than Chain, and
the
> Title Caser. I don't know the grand scope of what else you have coming down
the
> pipe. Do you anticipate any more complicated (stateful) transformers?
There are a few more, but most of them will be simpler and probably no state.
Note that most lower and upper are currently stateless. I could imagine having a
versions with state for this, for example, if users want to have an unlimited
number of modifiers or something alike.

> I haven't settled on what I think is the best design, but a couple of
thoughts:
> 
> We could add a (mandatory) Reset method to the Transformer interface, and have
> the transform.Bytes and transform.String functions' implementations start with
a
> Reset call. I wouldn't expect that to be a huge performance cost, but I also
> have no actual data.
Probably not a huge overhead indeed. I'm warming up to the idea, as it makes
things simple. It just seems a bit of a shame to do if most Transformers don't
need this.
I did some tests and an unconditional Reset is a lot less overhead than a
conditional one (not surprisingly so), so it could work.

> I'm not sure if implicitly resetting after a nil error is entirely safe.
For casing it is. It is pretty simple. For Chain I'm not sure and would need to
check, but we're talking about Transformer here.

> could be io.Copy'ing to a destination io.Writer that fails for whatever
reason,
> and you might exit early out of your transformation loop without ever seeing a
> nil error, and you have no way of explicitly resetting the Transformer for
> re-use.
Iterating until err == nil is explicit, just cumbersome. And probably more
expensive than unconditionally calling a Reset(). It is worth considering adding
a mandatory Reset to Transfomer, but probably separately from this CL?

https://codereview.appspot.com/122420043/diff/40001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/context.go#newcode91
cases/context.go:91: c.doTitle = true
Sometimes they disappear on me, and then I have to write them anew. So see the
previous one for that. Sometimes they come back, like this one, so please
ignore.

On 2014/08/22 07:00:22, nigeltao wrote:
> On 2014/08/13 15:26:05, mpvl wrote:
> > On 2014/08/13 06:26:29, nigeltao wrote:
> > > I'm also surprised that the doTitle state depends on how big or small the
> > > Transform buffers are.
> > This block does not signify a short or long buffer, but rather whether
> > processing has completed. A
> > If it has not, and if 
> 
> Unfinished codereview response??

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go
File cases/trieval.go (right):

https://codereview.appspot.com/122420043/diff/40001/cases/trieval.go#newcode240
cases/trieval.go:240: noChange   = 0x7
Yes! For example, the modifier "combining dot above" is dropped when lowercasing
in Turkish.

However, Looking at it more closely, though, this seems to only occur for
conditional mappings, which I do not include in this table. So 0 now means no
change. I kept the constant until a rune comes up that is removed under casing
or if conditional mappings are included in the map as well.

But indeed, stranger things are happening in Unicode, so this would not be
surprising. (Even with casing: runes that are case-ignorable AND cased; cased
letters that do not change when cased to the other form, etc.)

On 2014/08/22 07:00:22, nigeltao wrote:
> On 2014/08/13 15:26:05, mpvl wrote:
> > 0 is already reserved for a zero-length string. I think it is better to have
> the
> > length always as is and the "no change/ undefined" be the special value.
> 
> Casing can consume or output a zero-length string?? I should really know well
> enough that, even if it can, that won't be the strangest thing in Unicode.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode61
cases/casemap.go:61: titleInfos = []titleInfo{
Don't always like this approach, but here it is nicer. Done.

On 2014/08/22 07:00:22, nigeltao wrote:
> Just a suggestion, but if you want, you don't need to name the titleInfo type:
> 
> titleInfos = []struct {
>   title, lower func(*context)
>   rewrite      func(*context) caseInfo
> }{
>   {title, lower, nil},                // und
>   etc,
> }

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode73
cases/casemap.go:73: // options are not used (yet), but ensure that all make*
functions have the same
On 2014/08/22 07:00:22, nigeltao wrote:
> I'd just drop the "options are not used (yet), but ... in the tests."
> 
> Similarly below.

Done.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode80
cases/casemap.go:80: // makeLower configures a Caser from a basic lowercase
implementation.
I'll delete. 

It is actually from, as it is wrapping a basic lowercase function into a more
fancy one.

On 2014/08/22 07:00:22, nigeltao wrote:
> s/from/for/? Or just delete the comment entirely. It's not an exported
function.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode83
cases/casemap.go:83: func makeLower(t language.Tag, o options) *lowerCaser {
For makeUpper it has to be a Transformer, for the others not (yet).  I've
changed it for consistency and future proofness.

Had to move the noFinalSigma option from test to lib code to do this, but
ultimately something like that option should go there anyway.

On 2014/08/22 07:00:22, nigeltao wrote:
> returning a transform.Transformer would be consistent with makeUpper.
Similarly
> below, for makeTitle.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode251
cases/casemap.go:251: // determines whether a lowercased sigma should be σ or ς,
 by looking ahead for
On 2014/08/22 07:00:23, nigeltao wrote:
> Two spaces after the comma.

Done.

https://codereview.appspot.com/122420043/diff/60001/cases/casemap.go#newcode320
cases/casemap.go:320: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342,
0x0308, 0x0304, 0x345:
On 2014/08/22 07:00:22, nigeltao wrote:
> Adding another "0" to "0x345" would be more consistent.

Done.

https://codereview.appspot.com/122420043/diff/60001/cases/cases.go
File cases/cases.go (right):

https://codereview.appspot.com/122420043/diff/60001/cases/cases.go#newcode37
cases/cases.go:37: type Caser struct {
On 2014/08/22 07:00:23, nigeltao wrote:
> If a Caser is just a transfomer.Transformer, then I'm not totally convinced of
> the utility of this type. An alternative design is that cases.Lower,
cases.Title
> and cases.Upper all return a transform.Transformer, and instead of calling
> caser.Bytes(b), you call transform.Bytes(caser, b).
The vast majority of uses for casing are to operate on small pieces of text. The
will mostly mean not in a streaming setting.  If you compare

  s, _, _ := transform.String(cases.Lower(language.Und), "Foo")
  Use(s)

with 
  Use(cases.Lower(language.Und).String("Foo"))

or even: 
  Use(cases.Root.Lower().String("Foo"))  // not (yet?) implemented.

The first looks quite awkward. It will only drive people towards using
strings.Lower, even when this is the wrong choice.

There are two main uses for the Transformer interface. One of them is to make it
easy to implement some transformation that can be wrapped in a Reader or a
Writer. The other is to allow chaining together text transformations, even when
not used in a streaming setting.  The transforms in the ICU library are of the
latter ilk.
For example, to lowercase and normalize, one could write:
  t := transform.Chain(cases.Lower(language.Und), norm.NFC)

ICU uses a proprietary syntax for chaining transforms, go.text does it
programmatically, but the idea is roughly the same.

Either way, the vast majority of people will need simple transforms on short
strings, not dissimilar from the examples. Casing is also a bit special in that
there are no errors. This allows simplifying the API a bit. This is especially
useful if people care about casing but not about Transformers in general, which
I think will be common.

> 
> I haven't really decided either way which design I prefer, as I've only got
> example_test.go to go on, and haven't tried to use it for real, but perhaps
> something to think about. One thing is do you expect users of the cases
package
> to almost always lower/title/upper-case individual words (i.e. call Bytes or
> String), or will they often want to convert blocks of text (i.e. use it with
> transform.NewReader).

Sign in to reply to this message.

nigeltao

Another round of comments. I still haven't found the time to try to digest the ...

10 years, 5 months ago (2014-09-12 09:58:11 UTC) #7

mpvl

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode12 cases/casemap.go:12: "code.google.com/p/go.text/transform" On 2014/09/12 09:58:10, nigeltao wrote: > Move this ...

10 years, 5 months ago (2014-09-13 10:01:39 UTC) #8

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode12
cases/casemap.go:12: "code.google.com/p/go.text/transform"
On 2014/09/12 09:58:10, nigeltao wrote:
> Move this import to the second group (with the "code.google.com" prefix).

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode81
cases/casemap.go:81: // makeLower configures a Caser from a basic lowercase
implementation.
On 2014/09/12 09:58:10, nigeltao wrote:
> Delete this comment (and the one for makeUpper)? It still looks weird how one
is
> "for" and the other is "from".

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode146
cases/casemap.go:146: // checkpointing. lower case functions use for mid-word
runes should checkpoint
On 2014/09/12 09:58:10, nigeltao wrote:
> s/lower/Lower/

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode196
cases/casemap.go:196: // buffer to avoid possibilities for DNS attacks.
On 2014/09/12 09:58:09, nigeltao wrote:
> s/DNS/DoS/ ??

Done. Must be muscle memory or something; I keep typing one when I mean the
other.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode203
cases/casemap.go:203: // Annex #29, “Unicode Text Segmentation.” For each word
boundary, find the
On 2014/09/12 09:58:10, nigeltao wrote:
> Change the “smart quotes” to "regular quotes"?

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode269
cases/casemap.go:269: c.dst[p+1]-- // σ → ς
On 2014/09/12 09:58:09, nigeltao wrote:
> Do you know that p+1 is in bounds?

Yes, although it is subtle. 
1) both sigmas are two bytes in size. So if there is enough space to completely
write sigma, p+1 is valid.
2) If not, nextRune and initRune will not be called and the for loop will be
skipped (c.done() is true).
3) This means that c.info still contains the info from the sigma.
4) sigma is cased, so the if block in question will be skipped.
Q.E.D.

Added comments to clarify and more tests to verify.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode289
cases/casemap.go:289: r, _ := utf8.DecodeRune(c.src[c.pSrc:])
On 2014/09/12 09:58:10, nigeltao wrote:
> What happens if the src is invalid UTF-8?

Nothing, this is fine. It will return FFFD which doesn't map to any of the
special runes. It will then do a trie lookup which will, once again, detect it
is an invalid rune. Invalid runes have a zero CCC, so it will return soon
thereafter.

Note that the Decoding result is never used for iteration increment.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode297
cases/casemap.go:297: // Take the properties of the uppercased rune that is
already written to the
On 2014/09/12 09:58:10, nigeltao wrote:
> I'm not overly familiar with Unicode. I'm guessing that this code here is
> covered by the test cases in casemap_test.go, but I think it would be helpful
if
> you gave an example of what this code does, such as:
> // This handles converting "Οδός" to "ΟΔΟΣ".
> or whatever the relevant test case is.
> 
> Maybe that's already covered by the "From CLDR:" comment above, but I must
> confess that I don't grok it.
> 
> Similarly with all the other non-trivial functions in this file.

Done.

Yeah, the CLDR syntax is a bit special.  They are production rules. The stuff
before { and after } is merely context for the production rules to trigger.
Production rules need to be applied to newly replaced text. The square brackets
are used to indicate sets of runes.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap.go#newcode314
cases/casemap.go:314: for ; !c.done() && i <= maxIgnorable; i++ {
On 2014/09/12 09:58:10, nigeltao wrote:
> When comparing against maxIgnorable, sometimes you use < and sometimes <=.
That actually was because of a bug! Fixed that. 

> Similarly when comparing > or >=. It'd be nice to be consistent.
Moved some of these comparisons to the for loop. However, in both cases I
actually need to check for maxIgnorables-1.  So I did <= maxIgnorables-1 instead
of < maxIgnorables.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go
File cases/casemap_test.go (right):

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne...
cases/casemap_test.go:32: func (tc *testCase) mkUpper(tag language.Tag) Caser {
On 2014/09/12 09:58:11, nigeltao wrote:
> I'm not really convinced that you need these testCase methods. (See below).

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne...
cases/casemap_test.go:44: // testCases contains test cases
On 2014/09/12 09:58:11, nigeltao wrote:
> The redundant comment seems redundant. I'd delete it.

:) Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne...
cases/casemap_test.go:46: {
On 2014/09/12 09:58:11, nigeltao wrote:
> If a test fails, below, you say
> t.Errorf("%d: etc", i, etc)
> Thus, it might be nice here if the
> {
> on this line was
> 0: {
> and similarly below: "1: {", "2: {", etc., so that if a test fails, it's easy
to
> see which one it is.

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne...
cases/casemap_test.go:56: "und",
On 2014/09/12 09:58:10, nigeltao wrote:
> As you sometimes use named fields, I'd be consistent and always use them:
> lang: "und",
> src: "etc",
> etc
> You could then drop the "options{}," lines.

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne...
cases/casemap_test.go:178: for i, tt := range testCases {
On 2014/09/12 09:58:10, nigeltao wrote:
> I'd s/tt/tc/

Why exactly? I know that tc matches testCases nicely. However, tt is also a
natural choice and seems by far the most common one in core. Only if the var
iterated upon is called testCases is the use of tc slightly more prevalent.

So is it because it is a new standard we're converting to, because in this case
particular it is called testCases or for other reasons?

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne...
cases/casemap_test.go:186: testEntry := func(name string, mk func(language.Tag)
Caser, gold interface{}) {
On 2014/09/12 09:58:10, nigeltao wrote:
> Instead of
> mk func(language.Tag) Caser
> I'd have
> mk func(language.Tag, options) transform.Transformer
> 
> and then
> 
> c := mk(tag)
> becomes
> c := Caser{mk(tag, tt.opts)}
> 
> below,
> 
> tt.mkUpper
> becomes
> makeUpper

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne...
cases/casemap_test.go:189: wanted, ok := gold.([]string)
On 2014/09/12 09:58:11, nigeltao wrote:
> I'd s/wanted/wants/

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne...
cases/casemap_test.go:236: const minBufSize = norm.MaxSegmentSize
On 2014/09/12 09:58:11, nigeltao wrote:
> If this constant is only used by TestShortBuffersAndOverflow, I'd move it into
> that function's body.

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/casemap_test.go#ne...
cases/casemap_test.go:397: func goldBench(b *testing.B, s string, f func(b
[]byte) []byte) {
On 2014/09/12 09:58:11, nigeltao wrote:
> I'd rename goldBench and doBench to be benchFunc and benchTransformer, and
also
> make their argument order consistent.

Done.

https://codereview.appspot.com/122420043/diff/100001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/100001/cases/context.go#newcode12
cases/context.go:12: import (
On 2014/09/12 09:58:11, nigeltao wrote:
> Separate the standard library imports from the others:
> 
> import (
>   "unicode"
>   "unicode/utf8"
> 
>   "code.google.com/p/go.text/transform"
> )

Sorry. Done.

https://codereview.appspot.com/122420043/diff/100001/cases/context.go#newcode26
cases/context.go:26: // ignorables as lookahead (analogous to the limit in norm)
and to use state if
On 2014/09/12 09:58:11, nigeltao wrote:
> IANAUnicodeExpert:
IANAUE henceforth. :)

> is "case ignorables" a Unicode term?
Yep.

> Can I more or less think
> of it as "rune" or can it mean something different?
It is a rune with a derived property. See Case_Ignorable in
http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt.

https://codereview.appspot.com/122420043/diff/100001/cases/context.go#newcode278
cases/context.go:278: // write exception bits.
On 2014/09/12 09:58:11, nigeltao wrote:
> s/bits/bytes/ ?

The whole comment made not sense. Replaced it.

Sign in to reply to this message.

nigeltao

I dropped a few more comments, but I have a bigger design question: Currently, a ...

10 years, 5 months ago (2014-09-22 09:57:37 UTC) #9

I dropped a few more comments, but I have a bigger design question:

Currently, a context peeks *ahead* of pSrc: you have separate initRune, nextRune
and done methods, and c.src[c.pSrc:c.pSrc+c.sz] contains the next rune. It's
also a little weird how you've written both
done := c.atEOF && c.pSrc == len(c.src)
and separately:
func (c *context) done() bool {
  return c.pSrc >= len(c.src) || c.err != nil
}
and the two definitions of doneness differ.

What if there was only one method instead, called next, which returns (ok bool)
and if ok, the rune just read is *behind* pSrc, in c.src[c.pSrc-c.sz:c.pSrc].
Like a bufio.Scanner, the idiom becomes:
for c.next() {
  // Do stuff with the rune just read.
}

I think this simplifies a few things. I prototyped what upper casing would look
like with this design in https://codereview.appspot.com/142380043

I'm not 100% sure that I'm doing the right thing wrt Unicode in that CL, but the
tests pass.

WDYT?

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode322
cases/casemap.go:322: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342,
0x0308, 0x0304, 0x0345:
Add comments like "U+0300 COMBINING GRAVE ACCENT"?

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode432
cases/casemap.go:432: if softDotted {
I'd say
if !softDotted {
  return
}
and outdent the rest.

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode433
cases/casemap.go:433: for i := 0; !c.done() && i <= maxIgnorable-1; i++ {
Just double-checking... I'm still not sure why there's a "-1" in
"maxIgnorable-1". Not that I know what an ignorable is, but I'm just pattern
matching how other comparisons are "i <= maxIgnorable".

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode434
cases/casemap.go:434: switch ccc := c.info.cccType(); {
switch c.info.cccType() {

https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go
File cases/casemap_test.go (right):

https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go#ne...
cases/casemap_test.go:19: src   interface{} // string, []string, or nil to skip
test
src is always a string, right?

https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go#ne...
cases/casemap_test.go:175: t.Errorf("%d:%s:%s.String(%q):\n\tgot  %+q;\n\twant
%+q", i, lang, name, src[j], got, want)
The "\t"s should be unnecessary, as "go test" should indent.

https://codereview.appspot.com/122420043/diff/120001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode46
cases/context.go:46: func makeContext(dst, src []byte, atEOF bool) context {
Just a thought: instead of returning a context (which isn't a large struct, but
it's not a small struct), what about

func (c *context) reset(dst, src []byte, atEOF bool) {
  *c = context{ etc }
  c.initRune()
}

and in casemap.go, embed the context:

type upperCaser struct{}

becomes

type upperCaser struct{
  context
}

and similarly for simpleCaser, lowerCaser, titleCaser.

https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode104
cases/context.go:104: if done := c.atEOF && c.pSrc == len(c.src); done {
It may be worth factoring out a method
func (c *context) doneAndAtEOF() bool {
  return c.atEOF && c.pSrc == len(c.src)
}
and use it both here and above, in context.ret.

https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode105
cases/context.go:105: return true, c.pDst, c.pSrc, nil
Here, we are atEOF, so can we replace
return true, etc
with
return c.doTitle, etc
since there shouldn't be any more source bytes anyway?

Or is this necessary because Transformers don't have an explicit Reset method?
Still, as I think I've previously said, "callers should read until EOF to reset"
seems fragile, especially if we can chain Transformers and possibly return
early.

If we do that, there's no need for a separate retTitle method, and instead, the
caller can say:

t.doTitle = c.doTitle
return c.ret()

Sign in to reply to this message.

mpvl

Some comments for now. Still working on this, including reworking checkpointing etc. to allow your ...

10 years, 5 months ago (2014-09-24 13:08:18 UTC) #10

Some comments for now. Still working on this, including reworking checkpointing
etc. to allow your simplification proposal.

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode322
cases/casemap.go:322: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342,
0x0308, 0x0304, 0x0345:
On 2014/09/22 09:57:36, nigeltao wrote:
> Add comments like "U+0300 COMBINING GRAVE ACCENT"?

will do.

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode433
cases/casemap.go:433: for i := 0; !c.done() && i <= maxIgnorable-1; i++ {
On 2014/09/22 09:57:36, nigeltao wrote:
> Just double-checking... I'm still not sure why there's a "-1" in
> "maxIgnorable-1". Not that I know what an ignorable is, but I'm just pattern
> matching how other comparisons are "i <= maxIgnorable".

As usual, we are matching a maximum number of ignorables before matching the
\u0307. However, the latter is also an ignorable, so to keep the total number
consistent, we have to subtract 1.

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode434
cases/casemap.go:434: switch ccc := c.info.cccType(); {
On 2014/09/22 09:57:36, nigeltao wrote:
> switch c.info.cccType() {

Done.

https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go
File cases/casemap_test.go (right):

https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go#ne...
cases/casemap_test.go:19: src   interface{} // string, []string, or nil to skip
test
On 2014/09/22 09:57:36, nigeltao wrote:
> src is always a string, right?
No, either a space-separated list of strings or an actual list of strings.

https://codereview.appspot.com/122420043/diff/120001/cases/casemap_test.go#ne...
cases/casemap_test.go:175: t.Errorf("%d:%s:%s.String(%q):\n\tgot  %+q;\n\twant
%+q", i, lang, name, src[j], got, want)
On 2014/09/22 09:57:36, nigeltao wrote:
> The "\t"s should be unnecessary, as "go test" should indent.

Done.

https://codereview.appspot.com/122420043/diff/120001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode104
cases/context.go:104: if done := c.atEOF && c.pSrc == len(c.src); done {
On 2014/09/22 09:57:36, nigeltao wrote:
> It may be worth factoring out a method
> func (c *context) doneAndAtEOF() bool {
>   return c.atEOF && c.pSrc == len(c.src)
> }
> and use it both here and above, in context.ret.

As retTitle is gone, I guess it is not that useful anymore.

https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode105
cases/context.go:105: return true, c.pDst, c.pSrc, nil
On 2014/09/22 09:57:36, nigeltao wrote:
> Here, we are atEOF, so can we replace
> return true, etc
> with
> return c.doTitle, etc
> since there shouldn't be any more source bytes anyway?
> 
> Or is this necessary because Transformers don't have an explicit Reset method?
> Still, as I think I've previously said, "callers should read until EOF to
reset"
> seems fragile, especially if we can chain Transformers and possibly return
> early.
> 
> If we do that, there's no need for a separate retTitle method, and instead,
the
> caller can say:
> 
> t.doTitle = c.doTitle
> return c.ret()

Done. Was indeed to work around not having an explicit reset method.

Sign in to reply to this message.

nigeltao

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode433 cases/casemap.go:433: for i := 0; !c.done() && i <= maxIgnorable-1; ...

10 years, 4 months ago (2014-09-25 00:06:20 UTC) #11

mpvl

Did another iteration incorporating many of your ideas from CL 142380043, but taking a different ...

10 years, 4 months ago (2014-09-25 19:37:38 UTC) #12

Did another iteration incorporating many of your ideas from CL 142380043, but
taking a different approach to checkpointing. The mapping functions now returns
whether the result was "short", including whether future runes may change what
has been written. This is then used my the main Transformer implementations to
do the checkpointing. This localizes things a bit more.

Also simplified things a bit as a result of the Reset. I would like to have a
reentrant Upper and Lower caser as these are the dominant use cases and this
will save allocs. But other than that I've embedded the context in the caser
types.

Let me know if you would like me to only check in uppercase first or create a
new CL. At least I will move the casemap* files to mappings*, but wanted to keep
it for now to keep the CL diffs and comments.

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode322
cases/casemap.go:322: case 0x0313, 0x0314, 0x0301, 0x0300, 0x0306, 0x0342,
0x0308, 0x0304, 0x0345:
On 2014/09/22 09:57:36, nigeltao wrote:
> Add comments like "U+0300 COMBINING GRAVE ACCENT"?

Done.

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode432
cases/casemap.go:432: if softDotted {
On 2014/09/22 09:57:36, nigeltao wrote:
> I'd say
> if !softDotted {
>   return
> }
> and outdent the rest.

Done.

https://codereview.appspot.com/122420043/diff/120001/cases/casemap.go#newcode433
cases/casemap.go:433: for i := 0; !c.done() && i <= maxIgnorable-1; i++ {
On 2014/09/24 13:08:18, mpvl wrote:
> On 2014/09/22 09:57:36, nigeltao wrote:
> > Just double-checking... I'm still not sure why there's a "-1" in
> > "maxIgnorable-1". Not that I know what an ignorable is, but I'm just pattern
> > matching how other comparisons are "i <= maxIgnorable".
> 
> As usual, we are matching a maximum number of ignorables before matching the
> \u0307. However, the latter is also an ignorable, so to keep the total number
> consistent, we have to subtract 1.

added comment

https://codereview.appspot.com/122420043/diff/120001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/120001/cases/context.go#newcode46
cases/context.go:46: func makeContext(dst, src []byte, atEOF bool) context {

Now we have a reset this seemsprevent an allocation where this applies and it
avoids the need for saving state for titlecase. It just makes some casers are
currently reentrant no longer so. This is not necessarily a bad thing, though.
Will investigate.

Sign in to reply to this message.

nigeltao

https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go#newcode25 cases/casemap.go:25: type MapFunc func(*context) bool s/MapFunc/mapFunc/ unless there's a reason ...

10 years, 4 months ago (2014-09-29 07:15:14 UTC) #13

mpvl

https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/160001/cases/casemap.go#newcode25 cases/casemap.go:25: type MapFunc func(*context) bool On 2014/09/29 07:15:14, nigeltao wrote: ...

10 years, 4 months ago (2014-10-06 09:41:16 UTC) #14

nigeltao

In an effort to cut this CL into managable chunks, I split off the Unicode ...

10 years, 4 months ago (2014-10-08 08:34:07 UTC) #15

mpvl

SGTM. I'll remove the generation code from this CL once the other one is checked ...

10 years, 4 months ago (2014-10-08 09:29:29 UTC) #16

nigeltao

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode52 cases/casemap.go:52: // Transformers and safe a bit on runtime allocations. ...

10 years, 4 months ago (2014-10-13 06:37:43 UTC) #17

nigeltao

On 2014/10/08 09:29:29, mpvl wrote: > SGTM. I'll remove the generation code from this CL ...

10 years, 4 months ago (2014-10-13 06:38:30 UTC) #18

mpvl

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go File cases/casemap.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode24 cases/casemap.go:24: // cause written bytes to be modified (analogous to ...

10 years, 4 months ago (2014-10-13 18:36:55 UTC) #19

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode24
cases/casemap.go:24: // cause written bytes to be modified (analogous to
ErrShortSrc).
Updated comment to reflect reality: mapFunc doesn't have to return false in case
of a ErrShortDst.

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode52
cases/casemap.go:52: // Transformers and safe a bit on runtime allocations.
On 2014/10/13 06:37:42, nigeltao wrote:
> s/safe/save/

Done.

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode140
cases/casemap.go:140: // Standard upper case does not need any look ahead so we
can safely not use
On 2014/10/13 06:37:42, nigeltao wrote:
> You usually spell "lookahead" without the space.

Done.

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode176
cases/casemap.go:176: // state inter-word
On 2014/10/13 06:37:42, nigeltao wrote:
> Why isn't inter-word vs first vs mid-word state not stored in the context, the
> same way as the c.isMidWord field?
For lowercase it is not necessary to save the state. This allows me to create a
lowercaser for the root locale without allocating memory (this is not done yet,
but would be similar to the root uppercaser).

> 
> Would it be better to make the state machine explicit instead of being implied
> by gotos?
I think gotos is often clearer in this case and gives also a better handle on
performance (less unnecessary ifs).

However, I changed it to an implementation without gotos. As there are really
only two states, I just use an if statement. The performance is roughly equal to
the previous implementation (seems about 0.5-1% slower, but hard to say). An
advantage of this implementation is that it is closer to an implementation that
stores the state and checkpoints more often.

> 
> for {
>   switch c.lowerCaserState:
>   case lcsInterWord:
>     etc
>   case lcsFirst:
>     etc
>   case lcsMidWord:
>     etc
>   }
> }
> return c.ret()

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode349
cases/casemap.go:349: c.writeString("ς")
On 2014/10/13 06:37:42, nigeltao wrote:
> if !c.writeString("ς") {
>   return false
> }
> 
> Similarly elsewhere in this file.

I realize my comment for mapFunc was not entirely correct. It is not necessary
to return false in case of a ErrShortDst.
I prefer no to add the if statements as it  just adds additional checks and,
making the code messier and possibly causing a performance hit (the check is
likely to be far more often true than false for reasonable dst buffer sizes).

However, I've at least made it such that if a write fails it will return false
as well (though not strictly necessary, it helps to leave a casing loop early).
In this particular case, though, it is not necessary, as call to next that
follows will return false if there was an error.

I would like to keep this check in next and checkpoint as this means I'm not
relying on not missing out on an if statement. In general, it makes it much
easier to reason about the correctness of the code. 

Let me know if you still strongly feel about this check given that it is not
necessary that for a mapFunc to return false in case of a ErrShortDst and that
in this case (and others) it will return false anyway.

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode361
cases/casemap.go:361: c.copy()
On 2014/10/13 06:37:42, nigeltao wrote:
> if !c.copy() {
>   return false
> }
> 
> Similarly elsewhere in this file.

Same as before. This is not necessary: not a requirement to return false if this
fails, plus it will do so anyway as it is immediately followed by a next().

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode402
cases/casemap.go:402: c.writeBytes(b[:sz])
On 2014/10/13 06:37:42, nigeltao wrote:
> if !c.writeBytes(b[:sz]) {
>   return false
> }
> 
> Similarly elsewhere in this file.

Acknowledged.

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode536
cases/casemap.go:536: for ; i <= maxIgnorable-1 && c.next(); i++ {
I've simplified this now. The choice of maxIgnorable is somewhat arbitrary. So I
now say that both input and output should not span maxIgnorable, which is also a
reasonable choice. This makes all checks identical except for finalSigma, where
the check involves checking ignorables followed by a cased letter, rather than
another ignorable.

Anyway, I hope it is clearer now. I've also simplified the loop for finalSigma
and added a comment there to explain the difference in treatment.

On 2014/10/13 06:37:42, nigeltao wrote:
> I'm still confused. You subtract 1, but you also say <= instead of <, which
> means that this loop still executes maxIgnorable times. If that's really what
> you want, idiomatic is checking "< N" instead of "<= N-1". How about?
> 
> // We check for up to maxIgnorable ignorables, up to and including the
> // U+0307 COMBINING DOT ABOVE, which is itself an ignorable.
> for i := 0; i < maxIgnorable && c.next(); i++ {

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode574
cases/casemap.go:574: return i > maxIgnorable
On 2014/10/13 06:37:42, nigeltao wrote:
> How can this be true?

Good catch. It is safe to always return false here, it is just not the most
efficient as it would result in a ErrShortSrc unnecessarily. That's why the
tests still passed.

I've modified the tests to check the correct errors codes are returned.

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode616
cases/casemap.go:616: for ; i <= maxIgnorable-1 && c.next(); i++ {
On 2014/10/13 06:37:42, nigeltao wrote:
> Ditto.

Done.

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode622
cases/casemap.go:622: return c.writeBytes(c.src[start:c.pSrc]) // ignore u+0307
On 2014/10/13 06:37:42, nigeltao wrote:
> s/u/U/

Done.

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode634
cases/casemap.go:634: if i > maxIgnorable {
On 2014/10/13 06:37:42, nigeltao wrote:
> Ditto.

Done.

https://codereview.appspot.com/122420043/diff/180001/cases/casemap.go#newcode637
cases/casemap.go:637: c.writeString("ı")
On 2014/10/13 06:37:42, nigeltao wrote:
> return ok && c.writeString(etc) && c.writeBytes(etc)

This is incorrect: returning false means that that it is not safe to checkpoint
as the written output may change if we have more bytes. ok == false means we ran
out of source bytes. This still is okay if atEOF == true. So we need to write
the bytes even if ok == false. It is fine checking for ok last, though.

https://codereview.appspot.com/122420043/diff/180001/cases/cases.go
File cases/cases.go (right):

https://codereview.appspot.com/122420043/diff/180001/cases/cases.go#newcode33
cases/cases.go:33: // A Caser may be stateful and should therefore not be shared
between go
On 2014/10/13 06:37:42, nigeltao wrote:
> "goroutines" is one word.

Done.

https://codereview.appspot.com/122420043/diff/180001/cases/cases.go#newcode70
cases/cases.go:70: // No options (yet) for lowercase.
I do (e.g. ignoring the final sigma handling), but I've removed the comment.

On 2014/10/13 06:37:42, nigeltao wrote:
> I'm not sure why there's a comment here, but not a corresponding one in Upper
or
> Title. Do you have options in mind?

https://codereview.appspot.com/122420043/diff/180001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode12
cases/context.go:12: // writing to a a destination buffer.
On 2014/10/13 06:37:42, nigeltao wrote:
> Typo in "a a".

Done.

https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode22
cases/context.go:22: dst, src []byte
It makes the struct about 7% bigger. This difference is measurable but, albeit
hardly. Changed it.

On 2014/10/13 06:37:42, nigeltao wrote:
> It might be worth pulling up atEOF up here:
> 
> type context struct {
>   // dst, src and atEOF are the arguments to Transform.
>   dst, src []byte
>   atEOF bool
> 
>   pDst etc
>   etc
> }
> 
> Otherwise, it looks odd how context.Reset sets c.isMidWord but not c.atEOF.

https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode32
cases/context.go:32: info      caseInfo // case information of currently scanned
rune
On 2014/10/13 06:37:42, nigeltao wrote:
> Maybe s/info/caseInfo/

Done. Although I'm a bit worried we may end up with 2 tries at some point. We'll
see. For now we're good.

https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode57
cases/context.go:57: if ok && c.err == nil {
On 2014/10/13 06:37:42, nigeltao wrote:
> Can this just be
> if ok {
This would require mapFuncs to always consistently return false on ErrShortDst. 
With this check it doesn't matter. 
I like the fact that I can guarantee no checkpoint is written after a write
failure without looking at every specific mapFunc implementation.

https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode159
cases/context.go:159: func caseType(c *context) caseInfo {
On 2014/10/13 06:37:42, nigeltao wrote:
> s/function/method/ ?

Done.

Sign in to reply to this message.

nigeltao

https://codereview.appspot.com/122420043/diff/180001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode32 cases/context.go:32: info caseInfo // case information of currently scanned rune ...

10 years, 4 months ago (2014-10-21 05:42:01 UTC) #20

mpvl

https://codereview.appspot.com/122420043/diff/180001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode32 cases/context.go:32: info caseInfo // case information of currently scanned rune ...

10 years, 4 months ago (2014-10-21 15:15:50 UTC) #21

https://codereview.appspot.com/122420043/diff/180001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/180001/cases/context.go#newcode32
cases/context.go:32: info      caseInfo // case information of currently scanned
rune
On 2014/10/21 05:42:00, nigeltao wrote:
> On 2014/10/13 18:36:54, mpvl wrote:
> > On 2014/10/13 06:37:42, nigeltao wrote:
> > > Maybe s/info/caseInfo/
> > 
> > Done. Although I'm a bit worried we may end up with 2 tries at some point.
> We'll
> > see. For now we're good.
> 
> Heh, I meant to change the other way, but c'est la vie. It's fine as "info
> info".

Acknowledged.

https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go#newcode22
cases/casemap.go:22: // version to the same context. It may advance the context
to the next rune. It
On 2014/10/21 05:42:00, nigeltao wrote:
> I found "It returns false when future runes may cause written bytes to be
> modified (analogous to ErrShortSrc)" hard to understand. How about:
> 
> // etc. It
> // returns whether a checkpoint is possible: whether the pDst bytes written to
> // dst so far won't need changing as we see more source bytes.

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go#newcode374
cases/casemap.go:374: // A case ignorable may also introduce a word break, so me
may need
On 2014/10/21 05:42:00, nigeltao wrote:
> s/me/we/

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/casemap_test.go
File cases/casemap_test.go (right):

https://codereview.appspot.com/122420043/diff/220001/cases/casemap_test.go#ne...
cases/casemap_test.go:226: const minBufSize = norm.MaxSegmentSize
I use Segment in norm to refer to a starter followed by a sequence of
non-starters, representing a single character. Character is somewhat overloaded
and ambiguous term, so we need something else. There is not really a word for
this in the Unicode standard, as far as I can tell. The closest thing to it is
probably Grapheme Cluster (e.g. referred to in the Segmentation annex), but this
is not necessarily the same, so I used the more generic term Segment.

On 2014/10/21 05:42:00, nigeltao wrote:
> Out of scope of this CL, but I saw that norm.MaxSegmentSize doesn't have a doc
> comment.
> 
> An ignorant-of-Unicode question, but does "segment" here have a precise
meaning,
> or is simply the general-English meaning synonymous with "chunk", "part" or
> "piece"?

https://codereview.appspot.com/122420043/diff/220001/cases/casemap_test.go#ne...
cases/casemap_test.go:441: t.Errorf("%d:%s: error was %v; want %v", i, tt.desc,
err, tt.firstErr)
On 2014/10/21 05:42:00, nigeltao wrote:
> It can be hard to tell where the test description end and the error message
> begins, since the description also contains ":". How about replacing "%d:%s: "
> with "%d:%s:\n", and similarly below.

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode50
cases/context.go:50: if c.atEOF && c.pSrc == len(c.src) || c.nSrc == len(c.src)
{
On 2014/10/21 05:42:00, nigeltao wrote:
> Does
> c.pSrc
> need to be
> c.pSrc+c.sz
> here and on the next line??

Conceptually, yes. However, this point will only be reached if there is no
ErrShortDst. This means that the source was exhausted, and that c.sz must be 0.

It is fine to add the c.sz in both cases, though, so if you want I can still do
that.

https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode51
cases/context.go:51: return c.pDst, c.pSrc, nil
As c.pSrc increases monotonically, the invariant c.pSrc >= c.nSrc holds. So if
c.nDst == len(c.src), c.pSrc == len(c.src) must be true as well, as well as
c.nSrc == c.pSrc.

But I'm fine with this change.

On 2014/10/21 05:42:00, nigeltao wrote:
> If we get here by the second condition (c.nSrc == len(c.src)), are you sure
that
> we want c.pDst instead of c.nDst?? How about:
> 
> // ret returns the Transform method's return values.
> func (c *context) ret() (nDst, nSrc int, err error) {
> 	// If we hit an error, or if the checkpoint was at the end of the source
> 	// data, then return the checkpoint.
> 	if c.err != nil || c.nSrc == len(c.src) {
> 		return c.nDst, c.nSrc, c.err
> 	}
> 	// There is an implicit checkpoint if we are at the end of the source data
> 	// and that source data is atEOF.
> 	if c.atEOF && c.pSrc == len(c.src) {
> 		return c.pDst, c.pSrc, nil
> 	}
> 	// nSrc < len(c.src), so return ErrShortSrc.
> 	return c.nDst, c.nSrc, transform.ErrShortSrc
> }

https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode159
cases/context.go:159: // caseType returns a info with only the case bits,
normalized to either
On 2014/10/21 05:42:00, nigeltao wrote:
> s/a info/an info/

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go
File cases/context_test.go (right):

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:17: const lastRune = rune(0x1FFFF)
On 2014/10/21 05:42:01, nigeltao wrote:
> Delete this, use lastRuneForTesting instead.

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:68: 
On 2014/10/21 05:42:01, nigeltao wrote:
> Move the contextFromRune definition up here, since it's used by
> TestCaseProperties and other tests up here, not just TestContext?

done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:91: func TestMapping(t *testing.T) {
On 2014/10/21 05:42:01, nigeltao wrote:
> If applyFunc is used only by this test, I'd make it local to this test (and
> s/applyFunc/apply/):
> 
> func TestMapping(t *testing.T) {
>   apply := func(r rune, f func(c *context) bool) string {
>     etc
>   }
>   etc
> }

I'm getting a bit mixed recommendations from reviewers here, but I'm fine with
what your suggestion. Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:105: if _, ok := special[r]; ok {
On 2014/10/21 05:42:01, nigeltao wrote:
> Isn't this just
> for r := range special {
> modulo map iteration order? If you really want a stable order (even though
> testing a rune should be independent), you could accumulate special's keys
> beforehand and sort them.

No. It is the inverse set of the runes in special.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:110: t.Errorf("lower:%q (%U): got %q %U; want %q %U", r,
rune(r), got, []rune(got), want, []rune(want))
On 2014/10/21 05:42:00, nigeltao wrote:
> s/rune(r)/r/ and similarly below.

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:160: 
On 2014/10/21 05:42:00, nigeltao wrote:
> Delete the blank line.

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:174: atEOF    = true
On 2014/10/21 05:42:01, nigeltao wrote:
> Having these in global scope seems weird, and while everything's actually
> correct, it seems potentially confusing how the atEOF parameter to the
> Caser.Transform method shadows this global constant. I'd delete these
constants
> (see below).

Done.  Added field names in the literal struct, though, to clarify what the
values mean.  done turned out to be unused (more or less replaced by err).

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:190: src, out string
On 2014/10/21 05:42:01, nigeltao wrote:
> Since your slice literal's elements are one-field-per-line, I'd split this
line:
> src string
> out string

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:194: prefix   prefix
On 2014/10/21 05:42:01, nigeltao wrote:
> I'm not convinced that the "type prefix struct" is worth it. This field could
> easily be two:
> prefixArg  string
> prefixWant bool

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:195: op       []func()
On 2014/10/21 05:42:00, nigeltao wrote:
> I'd pluralize the name of a slice: s/op/ops/
> 
> Alternatively, instead of a []func(), you could have this field be a string,
> like "next;next;checkpoint;next", and below, instead of
> 
> for _, f := range tt.op {
>   f()
> }
> 
> you'd have:
> 
> for _, op := range strings.Split(tt.ops, ";") {
>   switch op {
>   case "next":
>     c.next()
>   case "checkpoint":
>     c.checkpoint()
>   etc
>   }
> }
> 
> You could then delete the "var c context" above.

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:197: {
On 2014/10/21 05:42:00, nigeltao wrote:
> BTW, I like to indent slice-of-struct literals like this:
> 
> tests := []struct {
>   etc
> }{{
>   etc
> }, {
>   etc
> }, {
>   etc
> }}
> 
> as the average indentation is one less, and the struct definition fields
> vertically align with the struct value fields.
> 
> Similarly in the other _test.go files.

Interesting. Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:200: atEOF,
On 2014/10/21 05:42:00, nigeltao wrote:
> Instead of an atEOF named constant, I'd name the fields (which also helps e.g.
> remember which field is src and which is out):
> 
> dstSize: 10,
> atEOF: true,
> src: "12",
> etc.

Done.

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:206: prefix{"x", false},
On 2014/10/21 05:42:01, nigeltao wrote:
> I'm not sure why there's an "x" here. I'd expect either {"", true} or {"1",
> false}.

added comment

https://codereview.appspot.com/122420043/diff/220001/cases/context_test.go#ne...
cases/context_test.go:445: t.Errorf("%s: prefix was %v; want %v", tt.desc, got,
tt.prefix.res)
On 2014/10/21 05:42:01, nigeltao wrote:
> It can be hard to tell where the test description end and the error message
> begins, since the description also contains ":". How about replacing "%s: "
with
> "%s:\n", and similarly below.

Done.

Sign in to reply to this message.

nigeltao

I ran out of time today, and haven't looked closely at the latest round, but ...

10 years, 4 months ago (2014-10-22 06:52:58 UTC) #22

mpvl

https://codereview.appspot.com/122420043/diff/220001/cases/context.go File cases/context.go (right): https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode50 cases/context.go:50: if c.atEOF && c.pSrc == len(c.src) || c.nSrc == ...

10 years, 4 months ago (2014-10-22 12:19:53 UTC) #23

https://codereview.appspot.com/122420043/diff/220001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode50
cases/context.go:50: if c.atEOF && c.pSrc == len(c.src) || c.nSrc == len(c.src)
{
On 2014/10/22 06:52:58, nigeltao wrote:
> On 2014/10/21 15:15:49, mpvl wrote:
> > On 2014/10/21 05:42:00, nigeltao wrote:
> > > Does
> > > c.pSrc
> > > need to be
> > > c.pSrc+c.sz
> > > here and on the next line??
> > 
> > Conceptually, yes. However, this point will only be reached if there is no
> > ErrShortDst. This means that the source was exhausted, and that c.sz must be
> 0.
> > 
> > It is fine to add the c.sz in both cases, though, so if you want I can still
> do
> > that.
> 
> I'd rather add c.sz in both cases, because I still think that it'll be cleaner
> if pSrc points after instead of before the rune read by c.next, but I can try
> that out in a follow-up CL to see if it actually simplifies things.

Not sure if you saw the comment I added. Adding c.sz does seem to degrade
performance a small amount (about 1-3%, using BenchmarkSigmaLower, consistent
over 10 measurements), so I rather not. If you feel strongly about it, though,
I'll add the c.sz.

https://codereview.appspot.com/122420043/diff/220001/cases/context.go#newcode51
cases/context.go:51: return c.pDst, c.pSrc, nil
On 2014/10/22 06:52:58, nigeltao wrote:
> On 2014/10/21 15:15:49, mpvl wrote:
> > As c.pSrc increases monotonically, the invariant c.pSrc >= c.nSrc holds. So
if
> > c.nDst == len(c.src), c.pSrc == len(c.src) must be true as well, as well as
> > c.nSrc == c.pSrc.
> 
> I get that c.pSrc == len(c.src) implies c.nSrc == len(c.src) but I'm worried
> about dst pointers, not source pointers, and whether returning c.pDst, etc
> instead of c.nDst etc is correct if we catch the c.nSrc == len(c.src) case.

Same logic would apply. Either way, I moved the check as per your suggestion.

Sign in to reply to this message.

nigeltao

LGTM, but wait for r. Thanks for your patience with this code review. One last ...

10 years, 4 months ago (2014-10-23 06:29:44 UTC) #24

mpvl

Thanks for the very thorough review! It was worth the wait. :) https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go File cases/casemap.go ...

10 years, 3 months ago (2014-10-27 12:38:31 UTC) #25

Thanks for the very thorough review! It was worth the wait. :)

https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/220001/cases/casemap.go#newcode374
cases/casemap.go:374: // A case ignorable may also introduce a word break, so me
may need
On 2014/10/23 06:29:44, nigeltao wrote:
> On 2014/10/21 15:15:49, mpvl wrote:
> > On 2014/10/21 05:42:00, nigeltao wrote:
> > > s/me/we/
> > 
> > Done.
> 
> Not done?

Take 2.

https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go
File cases/casemap.go (right):

https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode51
cases/casemap.go:51: // Many Uppercase mappers are reentrant, so we can
precompute the
On 2014/10/23 06:29:44, nigeltao wrote:
> s/Uppercase// if this comment also applies to lower casers?? If so, add a
blank
> line after the final "//" so that it doesn't look like the comment only
applies
> to the upperFunc variable on the subsequent line?
> 
> Also s/reentrant/stateless/ ?? It's not like an upper caser calls something
that
> calls back into that upper caser.

This only applies to some uppercase mappers. Lowercasers use simpleCaser. which
is not stateless. Changed reentrant to stateless.

https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode231
cases/casemap.go:231: p = t.rewrite(c)
On 2014/10/23 06:29:44, nigeltao wrote:
> I don't know if you have plans for other rewrites, but AFAICT t.rewrite can
> either be nil or afnlRewrite and afnlRewrite doesn't change c.info, so you
could
> change this to
> if t.rewrite != nil {
>   t.rewrite(c)
> }
> and rewrite changes from "func(*context) info" to "func(*context)". After
that,
> "rewrite" doesn't seem like quite the right name.

It was changing it before, but not anymore. Good catch. Removing it.

https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode234
cases/casemap.go:234: wasMid := c.info.isCaseIgnorableAndNonBreakStarter()
On 2014/10/23 06:29:44, nigeltao wrote:
> Do you mean s/c.info/p/ ??

Yes, changed.

https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode296
cases/casemap.go:296: ln := (e[0] >> lengthBits) & lengthMask
On 2014/10/23 06:29:44, nigeltao wrote:
> I'd s/ln/n/ unless there's some meaning for "ln" that I'm missing.

Done.

https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode483
cases/casemap.go:483: // modifier if the last rune of the decomposition is in
[0x300-0x311].
On 2014/10/23 06:29:44, nigeltao wrote:
> Unless I'm missing some subtlety, I'd change "0x300" to "\u0300" or "U+0300"
> since for me, "0x" means bytes.

Done.

https://codereview.appspot.com/122420043/diff/240001/cases/casemap.go#newcode497
cases/casemap.go:497: // that is less than 230 (Above). We will insert the
0x0307, if
On 2014/10/23 06:29:44, nigeltao wrote:
> Ditto.

Done.

https://codereview.appspot.com/122420043/diff/240001/cases/context.go
File cases/context.go (right):

https://codereview.appspot.com/122420043/diff/240001/cases/context.go#newcode149
cases/context.go:149: func (c *context) hasPrefix(s string) bool {
On 2014/10/23 06:29:44, nigeltao wrote:
> Is s always going to be a one-rune string? There may be an opportunity to
> optimize and/or simplify, but I guess that that can be a follow-up CL.

Yes, but we need a decoded rune, hence the string. Possibly faster, indeed,
would be to pass b [4]byte, size uint8, but I would consider that option later.

Sign in to reply to this message.

r

LGTM For future big packages like this, please make it easier for the reviewers by ...

10 years, 3 months ago (2014-11-05 23:42:48 UTC) #26

mpvl

Thanks for the review. I did write down the gist of the design a while ...

10 years, 3 months ago (2014-11-06 09:25:06 UTC) #27

mpvl

10 years, 3 months ago (2014-11-10 16:26:26 UTC) #28

*** Submitted as
https://code.google.com/p/go/source/detail?r=f230eca7a943&repo=text ***

go.text/cases: first stab at cases package.

Overview
Includes language-specific casing. Title caser uses a close
approximation of the Unicode Word Breaking Algorithm. Most
information needed for case mapping is stored in a single
trie.

Core files:
cases.go:    API
casemap.go:  per-language, high-level case mappings
context.go:  trie lookup wrapper and state management

LGTM=nigeltao, r
R=nigeltao, r
CC=golang-codereviews
https://codereview.appspot.com/122420043

Sign in to reply to this message.

Issue 122420043: code review 122420043: go.text/cases: first stab at cases package. (Closed)

Description

Patch Set 1 #

Patch Set 2 : diff -r d65bffbc88a1 https://code.google.com/p/go.text #

Patch Set 3 : diff -r d65bffbc88a1 https://code.google.com/p/go.text #

Patch Set 4 : diff -r d65bffbc88a1 https://code.google.com/p/go.text #

Patch Set 5 : diff -r a18e7d21c06d https://code.google.com/p/go.text #

Patch Set 6 : diff -r 12288f41f508 https://code.google.com/p/go.text #

Patch Set 7 : diff -r 12288f41f508 https://code.google.com/p/go.text #

Patch Set 8 : diff -r 024681b033be https://code.google.com/p/go.text #

Patch Set 9 : diff -r 024681b033be https://code.google.com/p/go.text #

Patch Set 10 : diff -r 024681b033be https://code.google.com/p/go.text #

Patch Set 11 : diff -r 1ac75970ff9e https://code.google.com/p/go.text #

Patch Set 12 : diff -r 1ac75970ff9e https://code.google.com/p/go.text #

Patch Set 13 : diff -r f8db539672d0 https://code.google.com/p/go.text #

Patch Set 14 : diff -r f8db539672d0 https://code.google.com/p/go.text #

Patch Set 15 : diff -r f8db539672d0 https://code.google.com/p/go.text #

Patch Set 16 : diff -r f8db539672d0 https://code.google.com/p/go.text #

Patch Set 17 : diff -r b8406a47db8d https://code.google.com/p/go.text #

Messages