|
|
Created:
9 years, 6 months ago by mpvl Modified:
9 years, 4 months ago Reviewers:
CC:
golang-codereviews, nigeltao, r Visibility:
Public. |
Descriptiongo.text/runes: added Rewrite functionality.
A Rewriter allows for fast prototyping of Transformers. It will not be as
fast as writing it out oneself, but will automatically handle many of the
intricate details of writing a Transformer defined on runes.
Patch Set 1 #Patch Set 2 : diff -r f8db539672d0 https://code.google.com/p/go.text #Patch Set 3 : diff -r f8db539672d0 https://code.google.com/p/go.text #
MessagesTotal messages: 12
Hello golang-codereviews@googlegroups.com, I'd like you to review this change to https://code.google.com/p/go.text
Sign in to reply to this message.
Ah, neither r or I had seen this yet, since we weren't explicitly added to the R= line. My immediate reaction is that I'm not sure if this prototyping-helper belongs in the repo. Will anyone other than you use it? My second reaction is that this makes me less comfortable about a "runes" package as the right place for this and the other functionality. I don't think "runes.Context" is the right name. It's not about runes, it's about implementing transform.Transformer. I haven't looked at the actual code yet, sorry. As r just said in the https://codereview.appspot.com/122420043/ code review, it might be quicker in the long run to get agreement on design and API first before putting in all the work to send out a 100% feature-complete 100% tested implementation ex nihilo.
Sign in to reply to this message.
On Thu, Nov 6, 2014 at 12:50 AM, <nigeltao@golang.org> wrote: > Ah, neither r or I had seen this yet, since we weren't explicitly added > to the R= line. > > My immediate reaction is that I'm not sure if this prototyping-helper > belongs in the repo. Will anyone other than you use it? > I looked at what kind of things people want to do with text and how they are using existing libs like Cocoa and for which there exist no easy way to do it in go.text. The examples in this CL are based on such cases. Maybe prototyping is not the right word. The "prototype" will typically roughly run at 4 times the running time of an optimized version, which is probably fine for most people. Writing out the transformers for such things can still be quite tricky. > My second reaction is that this makes me less comfortable about a > "runes" package as the right place for this and the other functionality. > That's why I wanted to send this out before checking in the other CL as I thought it might. :) > I don't think "runes.Context" is the right name. Wasn't able to come up with something better. Thought about having a Source and Destination, but found that splitting it in two arguments was too cumbersome and contrived. ReadWriter and Buffers are also a possibility, but don't capture it precisely. > It's not about runes, > it's about implementing transform.Transformer. > It also doesn't really fit in transform. Transformers are not specific to processing Unicode (e.g. see the encoding packages). The Rewriter only supports handling Unicode and automatically rewrites illegal UTF-8, for example. unicode might have been a better name for the package, but that will obviously conflict considering it is supposed to be used in unison > > I haven't looked at the actual code yet, sorry. As r just said in the > https://codereview.appspot.com/122420043/ code review, it might be > quicker in the long run to get agreement on design and API first before > putting in all the work to send out a 100% feature-complete 100% tested > implementation ex nihilo. > In this case the code isn't big, though, right? Getting the API right was by far the trickiest part here. Writing the code was fairly easy in comparison. The runes package was also special a bit because performance influenced the design (as the iterations are so tiny). For cases this is different. > > https://codereview.appspot.com/157620043/ >
Sign in to reply to this message.
On Thu, Nov 6, 2014 at 6:54 PM, <mpvl@golang.org> wrote: > On Thu, Nov 6, 2014 at 12:50 AM, <nigeltao@golang.org> wrote: >> My immediate reaction is that I'm not sure if this prototyping-helper >> belongs in the repo. Will anyone other than you use it? > > I looked at what kind of things people want to do with text and how they are > using existing libs like Cocoa and for which there exist no easy way to do > it in go.text. The examples in this CL are based on such cases. I don't find these examples compelling. It seems just as easy to use nothing but the standard library: http://play.golang.org/p/KdrVMX-lzN Sure, I don't expect programmers to write their own Shift-JIS decoder, but still, not everything has to be a transform.Transformer.
Sign in to reply to this message.
On Tue, Nov 11, 2014 at 7:31 AM, Nigel Tao <nigeltao@golang.org> wrote: > On Thu, Nov 6, 2014 at 6:54 PM, <mpvl@golang.org> wrote: > > On Thu, Nov 6, 2014 at 12:50 AM, <nigeltao@golang.org> wrote: > >> My immediate reaction is that I'm not sure if this prototyping-helper > >> belongs in the repo. Will anyone other than you use it? > > > > I looked at what kind of things people want to do with text and how they > are > > using existing libs like Cocoa and for which there exist no easy way to > do > > it in go.text. The examples in this CL are based on such cases. > > I don't find these examples compelling. It seems just as easy to use > nothing but the standard library: > http://play.golang.org/p/KdrVMX-lzN Not the same, as it will be slower and doesn't allow for streaming. But might work for most. > Sure, I don't expect programmers to write their own Shift-JIS decoder, > but still, not everything has to be a transform.Transformer. > If we assume users will near to never need to write a transformer, than this is fine. No problem waiting with this and see if demand may turn up in the future.
Sign in to reply to this message.
Why not just use strings.Map? -rob
Sign in to reply to this message.
On 2014/11/12 00:54:47, r wrote: > Why not just use strings.Map? The point of this CL was to allow people easily making transformers. Transformers allow for streaming, including getting a Reader and Writer for free and they can be Chain-ed together. If you mean runes.Map, you need a mechanism to discard that last bit of processing if your at the end of buffer, nut not atEOF. I tried this with Map, but found it cumbersome. There are obvious needs for the Map and Replace transforms in the runes package. People will use this in Chain. Map and Remove are not well-suited for all kind of transformations, though, so Rewrite fills the gap there. I implemented some cases of what people might want to do that are hard or impossible to do with Map or Replace. Granted, though, I have not seen obvious cases where such more complex transforms are needed in transforms, like with Map and Reduce. Of course people could simply write their own transformer if they need one. This can be a bit tricky for UTF-8, though, and there are several subtleties people need to take into account (when is an illegal byte at the end short vs it needing be replaced, etc.) I can wait with this CL until such need arises. Another reason to put this CL out is to force the naming debate. :) Nigel mentioned runes.Context makes no sense. It may be possible to generalize the Rewrite transform to more than UTF-8 and put it in transform. transform.Context is arguably better. If the later is done, though, it may be more consistent to put runes.Map and runes.Replace in transform as well. The question is then what names they should get to not be unclear? Also what to do with the Transform wrapper in that case?
Sign in to reply to this message.
On Wed, Nov 12, 2014 at 8:02 PM, <mpvl@golang.org> wrote: > Another reason to put this CL out is to force the naming debate. :) > Nigel mentioned runes.Context makes no sense. It may be possible to > generalize the Rewrite transform to more than UTF-8 and put it in > transform. transform.Context is arguably better. A better package name could be golang.org/x/text/transform/utf8transform but I still feel like I'm missing the bigger picture. You mentioned https://docs.google.com/document/d/1Q64ktYh7XptpEI3L2G7xYqsusohOeLUft865Zd7fb... recently but parts of it are out of date. Also, that document presents *a* design, but I still don't know the motivating use cases. For example, your example code in this CL escapes "\u1234" as `\U00001234` and unescapes the other way. Is that really something that programmers want to do or is that merely a simple API example? Even with something like casing, do programmers usually want to lower-case strings or io.Readers? If they usually work on words, lines or sentences at a time, then a string-based API seems fine. If they want to lower-case documents at a time, then io.Reader seems more appropriate. With the non-UTF-8 Unicode encodings like Shift-JIS, I can believe that it's documents at a time, so io.Reader and transform.Transformer seems appropriate. But for casing, I don't have the i18n experience to know whether or not programmers usually lower-case phrases at a time or documents at a time, so I don't know if it even makes sense for Casers to implement transform.Transformer, or whether things like your runes.Context or runes.Rewrite proposal makes sense. Those are the sort of design questions I would like discussed, whether by e-mail or by that document, before being presented with a 5,000 line CL or a list of 20 packages under golang.org/x/text as a fait accompli.
Sign in to reply to this message.
On Thu, Nov 13, 2014 at 1:08 AM, Nigel Tao <nigeltao@golang.org> wrote: > On Wed, Nov 12, 2014 at 8:02 PM, <mpvl@golang.org> wrote: > > Another reason to put this CL out is to force the naming debate. :) > > Nigel mentioned runes.Context makes no sense. It may be possible to > > generalize the Rewrite transform to more than UTF-8 and put it in > > transform. transform.Context is arguably better. > > A better package name could be > golang.org/x/text/transform/utf8transform > but I still feel like I'm missing the bigger picture. You mentioned > > https://docs.google.com/document/d/1Q64ktYh7XptpEI3L2G7xYqsusohOeLUft865Zd7fb... > recently but parts of it are out of date. Also, that document presents > *a* design, but I still don't know the motivating use cases. For > example, your example code in this CL escapes "\u1234" as `\U00001234` > and unescapes the other way. Is that really something that programmers > want to do or is that merely a simple API example? > Just a simplified example. > Even with something like casing, do programmers usually want to > lower-case strings or io.Readers? If they usually work on words, lines > or sentences at a time, then a string-based API seems fine. If they > want to lower-case documents at a time, then io.Reader seems more > appropriate. It is mostly chunks of text. So having a String or Bytes method is quite crucial. However, there are situations where streaming is appropriate: - casing sometimes needs to be used in conjunctions with other transforms, in which case transform.Chain makes it easy to construct new transforms. - for case folding (a variant of lower-casing), for example, one could use to implement search on a large file. In such a case using a streaming approach is useful. - not sure yet whether to put width-changing operations in cases or a separate package, but for these one can easily envision a use for both streaming and more per-segment processing. On a higher level. ICU provides a general transform mechanism <http://userguide.icu-project.org/transforms/general> that looks quite powerful. One often sees chains of more complex operations with some simple tweaks. The Go implementation was inspired by this, even though the implementation is radically different: In ICU there is a language for chaining transformers, in Go this is all done programmatically (e.g. using transform.Chain). The Go approach is a bit more verbose, but, imo, easier to use and much easier to extend. As an example: The canonical example in ICU: NFD; [:Nonspacing Mark:] Remove; NFC In Go now: transform.Chain(norm.NFD, transform.RemoveFunc(func(r rune) bool { return unicode.In(r, unicode.Mn) }), norm.NFC) or with the runes package and unicode package additions: transform.Chain(norm.NFD, runes.Remove(unicode.Mn.In), norm.NFC) or without the runes package: transform.Chain(norm.NFD, transform.RemoveFunc(unicode.Mn.In), norm.NFC) Anyway, it is true that for some transforms it makes more sense to use them in a streaming context and for some more on a chunk basis. However, being able to combine them is very useful and the fact you can do both for all is just a consequence of fitting them in this general mechanism. Moreover, it is a bit grey. A transform that is normally only useful in a chunk-at-a-time setting can all of a sudden require a streaming implementation if it is used as a small tweak in a streaming transform. That is a more important point than figuring out whether it makes sense to use casing in a streaming fashion or not. It is hard to predict how the user will end up using it. If you look at the link above you can see also why a Rewriter would be handy (especially the cases at the bottom), although transform may be a better place for it. I don't mind holding off on it until the rest is more clear. My main reason to post the Rewriter CL now was because I knew it would make the rest of the runes package more questionable (at least it did for me) and figured it is better to have this discussion now rather than later. The Remove, Map and ReplaceIllFormed funcs are important, but Rewriter is not. > With the non-UTF-8 Unicode encodings like Shift-JIS, I > can believe that it's documents at a time, so io.Reader and > transform.Transformer seems appropriate. But for casing, I don't have > the i18n experience to know whether or not programmers usually > lower-case phrases at a time or documents at a time, so I don't know > if it even makes sense for Casers to implement transform.Transformer, > or whether things like your runes.Context or runes.Rewrite proposal > makes sense. Those are the sort of design questions I would like > discussed, whether by e-mail or by that document, before being > presented with a 5,000 line CL or a list of 20 packages under > golang.org/x/text as a fait accompli. > I don't mind reworking the API of presented packages quite a bit, even if they are near done. The runes package is relatively simple and didn't take much time to implement. The Rewriter was a bit tricky as it is hard to figure out an API that takes away the subtleties of writing a rune-handling transformer as much as possible. For casing by far the most work went in to getting the tables to be of an acceptable format.
Sign in to reply to this message.
On Thu, Nov 13, 2014 at 7:55 PM, <mpvl@golang.org> wrote: > I don't mind reworking the API of presented packages quite a bit, even if > they are near done. To be blunt, it's not about you, it's about me (and r). We find it far easier to review the design in a 500 line CL than a 5,000 line CL, and implementation follows from the design.
Sign in to reply to this message.
On Fri, Nov 14, 2014 at 12:24 AM, Nigel Tao <nigeltao@golang.org> wrote: > On Thu, Nov 13, 2014 at 7:55 PM, <mpvl@golang.org> wrote: > > I don't mind reworking the API of presented packages quite a bit, even if > > they are near done. > > To be blunt, it's not about you, it's about me (and r). We find it far > easier to review the design in a 500 line CL than a 5,000 line CL, and > implementation follows from the design. > Fair enough.
Sign in to reply to this message.
R=close To the author of this CL: The Go project has moved to Gerrit Code Review. If this CL should be continued, please see the latest version of https://golang.org/doc/contribute.html for instructions on how to set up Git and the Go project's Gerrit codereview plugin, and then create a new change with your current code. If there has been discussion on this CL, please give a link to it (golang.org/cl/157620043 is best) in the description in your new CL. Thanks very much.
Sign in to reply to this message.
|