|
|
Created:
11 years, 4 months ago by nigeltao Modified:
11 years, 3 months ago CC:
golang-dev Visibility:
Public. |
Descriptionexp/publicsuffix: new package.
The tables were generated by:
go run gen.go -version "publicsuffix.org's effective_tld_names.dat, hg revision 05b11a8d1ace (2012-11-09)" >table.go
go run gen.go -version "publicsuffix.org's effective_tld_names.dat, hg revision 05b11a8d1ace (2012-11-09)" -test >table_test.go
The input data is temporarily filtered to the .ao, .ar, .arpa, .uk and
.zw domains, so that code review is easier while still covering the
interesting * and ! rules. A follow-up changelist will check in the
unfiltered public suffix list.
Update issue 1960.
Patch Set 1 #Patch Set 2 : diff -r 751e8610bd5d https://code.google.com/p/go #Patch Set 3 : diff -r 751e8610bd5d https://code.google.com/p/go #Patch Set 4 : diff -r 751e8610bd5d https://code.google.com/p/go #Patch Set 5 : diff -r 751e8610bd5d https://code.google.com/p/go #Patch Set 6 : diff -r 751e8610bd5d https://code.google.com/p/go #
Total comments: 16
MessagesTotal messages: 23
Hello rsc@golang.org, dr.volker.dobler@gmail.com (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://code.google.com/p/go
Sign in to reply to this message.
Just two remarks from a quick skim: nodes is an array? Would the GC have to scan this array or is it treated as one? I found the kyoto/kobe test in the testdata suite which can be found at http://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/data/test_psl... often broke my work. Could you include those too? On 2012/12/05 00:47:08, nigeltao wrote: > Hello mailto:rsc@golang.org, mailto:dr.volker.dobler@gmail.com (cc: > mailto:golang-dev@googlegroups.com), > > I'd like you to review this change to > https://code.google.com/p/go
Sign in to reply to this message.
Current code breaks for the jp rules like *.kobe.jp which should match kobe.jp. But the fix seems trivial (see below). Your algorithm is very clean and nice. Maybe some more comments about how the rules are encoded into the tree of nodes might be useful for those who get confronted with this code for the first time. You named the table generator "gen" as does atom and md5 while unicode and norm named it "maketable(s)". Any chance of unifying these? Any rule when to pick which name? The generated table.go can be stripped further and seems of reasonable size to be checked in. The comments per node should be suppressed by default and turned on via a command line flag to gen (if checked in). It is okay if all this goes to go.net, but I do not think that the public suffixes are just some "internet database that comes along": Cookies belong to HTTP, they are used for security relevant tasks and their security depends (to some degree) on not allowing domain cookies on certain domains. Unfortunately this list is long but not locally accessible like time zone data. A "striped version" is useless, even dangerous. The main issue I see with a built-in list is, that the Go release cycle is much longer than the lists (or current browser) and users might rely on an outdated list (which might be as bad as none or a stripped list). There is just one solution here: fetch at runtime. Fetching at runtime would allow a long running server to updated its list. Fetch at runtime seems okay, as you won't need the list if you don't have internet connection. Building Go itself or any other project would still be possible without network connectivity. No large, quickly out-dating table needs to be checked in (neither the raw, nor some generated table.go). Fetch at runtime doesn't necessarily mean fetch the raw list from the net: a local server, local file or falling back from net to local file might be simple and usable. Anyway, even fetch at runtime needs an internal representation of the rules and the proposed solution is a very nice one. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... File src/pkg/exp/publicsuffix/gen.go (right): https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:83: res, err := http.Get(*url) It would be nice to be able to use a local file on disc too. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:215: fmt.Fprintf(w, "const text = ") Maybe a short doc comment for text too? https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:236: // [2] nodeType [1] wildcard [13] number of children [16] first child. A reversed layout of numChildren firstChild nodeType wildcard would result in an encoding with the two most significant bytes set to zero for all nodes without children. If output as %x instead of %08x, this would save some more bytes of checked in table data. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:241: var nodes = [...][2]uint32{ Using a uint64 instead of [2]uint32 reduces each line of the generated table by 6 byte (total of 22k or 6%) but might be slower on 32 bit hardware. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:277: children map[string]*node If you use a slice instead of a map for the children, you can sort it once and do not need to "collect keys, sort keys, iterate map in sorted keys order" as you do twice below in assignNodeIndex and printTable. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:325: fmt.Fprintf(w, "{0x%08x, 0x%08x}, // 0x%04x (%s) %s%c %s\n", Could use "0x%x" as format which would save some bytes in the generated table (while decreasing readability). https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/li... File src/pkg/exp/publicsuffix/list.go (right): https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/li... src/pkg/exp/publicsuffix/list.go:47: switch u { Empty nodes do match wildcard nodes. Adding case nodeTypeEmpty: if wildcard { suffix = 1 + dot } fixes all the testcases, including the kobe.jp ones. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/li... File src/pkg/exp/publicsuffix/list_test.go (right): https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/li... src/pkg/exp/publicsuffix/list_test.go:163: got := slowPublicSuffix(tc.domain) Very nice idea. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/li... src/pkg/exp/publicsuffix/list_test.go:242: // TODO(nigeltao): add the "Effective Top Level Domain Plus 1" tests from These are the ones which discovered the kobe.jp bug. Feel free to just copy my testcases.
Sign in to reply to this message.
Just an idea: it could live in the go.net subrepo. Punycode could also live there, I guess. That could push exp/cookiejar into go.net too, since it's not really useful without a public suffix list implementation. On Wed, Dec 5, 2012 at 3:50 PM, Russ Cox <rsc@golang.org> wrote: > How big is the source file form of the generated full list? The generated table.go is 390K, but half of that is comments, since most lines look like: {0x40100c95, 0x006a7f09}, // 0x0629 (0x0c95-0x0ca5) + yamaguchi {0x401c0ca5, 0x006aa709}, // 0x062a (0x0ca5-0x0cc1) + yamanashi {0x20010cc1, 0x006bb608}, // 0x062b (0x0cc1-0x0cc2) _* yokohama {0x40000000, 0x00018705}, // 0x062c (-------------) + aisai {0x40000000, 0x00027303}, // 0x062d (-------------) + ama {0x40000000, 0x00036a04}, // 0x062e (-------------) + anjo {0x40000000, 0x00059005}, // 0x062f (-------------) + asuke {0x40000000, 0x000d9306}, // 0x0630 (-------------) + chiryu {0x40000000, 0x000d9905}, // 0x0631 (-------------) + chita The node table has 5840 rows, and the "const text = `etc`" string is 27990 bytes, presumably the latter could get smaller if we dedupe it a la exp/html/atom's text string. The pkg/linux_amd64/exp/publicsuffix.a file is 562K. That package imports exp/cookiejar and strings. exp/cookiejar.a is 152K and strings.a is 334K. In comparison, unicode.a is 626K. The raw effective_tld_names.dat file has 6949 lines and weighs in at 99K. For me, effective_tld_names.dat.gz is 33K, so I don't know where bradfitz is getting his 96K number from. > How often does the publicsuffix.org list get updated? > Is there a version history? http://hg.mozilla.org/mozilla-central/filelog/e8f0504ccbb9/netwerk/dns/effect... suggests that there were 40 changes so far in 2012, although a couple of those were formatting changes.
Sign in to reply to this message.
I can't believe I'm saying this but what does IE do?
Sign in to reply to this message.
What are the security implications of the suffix list updates? If the list is checked into the main Go repo, does that mean people at Go 1.1.0 might be left with an insecure HTTP client if a change is made between 1.1.0 and 1.1.1? Andrew
Sign in to reply to this message.
On Thu, Dec 6, 2012 at 2:53 PM, Russ Cox <rsc@golang.org> wrote: > Although I am not worried about the size, I do like Florian's > suggestion of just not having the list at all. Is that tenable? Is it > only full web browsers that really need to care? One alternative would > be to implement a parser for the actual web format of the list and say > that if you care your client must supply the file content to use. It would be possible to parse at runtime, but you might have to care then about how efficient the table-building code is. It's tenable to provide a 'dev null' public suffix list that always returns the last label (the au in foo.com.au). It would mean that evil.com.au could affect the cookies for google.com.au, which is obviously bad for a full web browser but a simple web-scraping program might not care. A slightly smarter list is to return the z in x.y.z if z is in {com,edu,gov,int,mil,net,org}, otherwise return y.z. I'm told that the original Netscape cookie spec did this, and it works for a lot of the internet, including example.com, example.com.au and example.co.uk, but it isn't correct for example.fr, example.shibuya.tokyo.jp or example.pvt.k12.ma.us. I believe that Florian is right in that co-operating web sites can work around the publicsuffix restrictions on cookies, but I think the concern is more about malicious web sites. > I would also like to know what the deal is with punycode. Why does it > come up here? Why does it matter for cookies? Can we make the client > be in charge of dealing with punycode too instead of making the cookie > jar do it? Punycode matters because some elements of http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names... are IDNs (which I'm leaving as TODO for now). For example, grepping for "Hong Kong" finds: // xn--j6w193g ("Hong Kong" Chinese-Han) : HK // https://www2.hkirc.hk/register/rules.jsp 香港 I think the client should be responsible for punycoding the domain, but if we're generating the table ahead of time, we'll have to punycode at that time too.
Sign in to reply to this message.
On Wed, Dec 5, 2012 at 6:12 PM, <dr.volker.dobler@gmail.com> wrote: > nodes is an array? Would the GC have to scan > this array or is it treated as one? The GC skips the nodes array. For example, take this program: $ cat main.go package main var u = [...][2]uint32{ {0x0101, 0x0102}, {0x0201, 0x0202}, {0x0301, 0x0302}, } var p = [...][2]*uint32{ {&u[0][0], &u[0][1]}, {&u[1][0], &u[1][1]}, {&u[2][0], &u[2][1]}, } func main() { println(u[0][0]) println(p[0][0]) } The compiler output is: $ go tool 6g -SS -o /dev/null main.go | grep GLOBL 0040 (main.go:7) GLOBL u+0(SB),16,$24(AL*0) 0040 (main.go:13) GLOBL p+0(SB),$48(AL*0) [etc] Note that the u global variable is marked "16". This matches $GOROOT/src/cmd/6l/6.out.h:37: #define NOPTR (1<<4) which means that the value doesn't directly or indirectly contain any pointers, and thus doesn't need scanning during garbage collection. Keep grepping for NOPTR in $GOROOT/src/cmd if you're curious about the details (e.g. it leads to the haspointers function in $GOROOT/src/cmd/gc/reflect.c, which is where it's decided that [...][2]uint32 has no pointers). I'll look at the the kyoto/kobe test tomorrow.
Sign in to reply to this message.
We shouldn't check in the raw data file, even gzipped. Nothing but the table generator will ever read that file, and it can read it from the internet, just like Unicode's maketables.go does. The only thing we should be talking about checking in is the generated tables.go file, which cannot be compressed (or it can't be compiled). Russ
Sign in to reply to this message.
I don't want to move cookiejar code to go.net. I think it's embarrassing that we can't make HTTP requests with cookies because we don't have a cookiejar implementation, and I want to keep the pressure on to get one for Go 1.1. We might not make it, but I don't want to give up yet. I realize that the process is generating a lot of discussion, but for the most part I think it is helping us arrive a better API and implementation. I would like to continue iterating on the design of the table here, with an eye toward keeping both the in-memory and in-Go-file size as small as possible. Once Nigel and Volker are happy with both of those, I am not worried about the size of the generator program and the generated Go file (don't check in the original web version). In the current repository a few hundred K is not a significant difference. I would be more worried if we were checking in new tables every day, but we're not. However... Although I am not worried about the size, I do like Florian's suggestion of just not having the list at all. Is that tenable? Is it only full web browsers that really need to care? One alternative would be to implement a parser for the actual web format of the list and say that if you care your client must supply the file content to use. I would also like to know what the deal is with punycode. Why does it come up here? Why does it matter for cookies? Can we make the client be in charge of dealing with punycode too instead of making the cookie jar do it? Thanks. Russ
Sign in to reply to this message.
On Tue, Dec 4, 2012 at 9:37 PM, Russ Cox <rsc@golang.org> wrote: > We shouldn't check in the raw data file, even gzipped. Nothing but the > table generator will ever read that file, and it can read it from the > internet, just like Unicode's maketables.go does. > > The only thing we should be talking about checking in is the generated > tables.go file, which cannot be compressed (or it can't be compiled). I was under the impression this thread had turned into concerns about the repository's disk usage. It's not clear to me what this thread is even about: -- the web's grossness -- the necessity of more tables -- build system complexity -- build time -- network availability -- disk space ... and in what order.
Sign in to reply to this message.
This thread is about where and how to make this data accessible to Go programs. Does it need to be in the main repo? Does all of it need to be in the main repo? Would it be sufficient to put it in go.net? How should it be represented a -rob
Sign in to reply to this message.
The Marianas trench rebuttal is unwarranted: we've never required a network connection to run all.bash so far. This shouldn't be what breaks that. For comparison, we have 2 MB of images checked in to the repo. This is 96 KB compressed. With an domain-specific compression scheme we could make that even less. I don't care about this database much but it's hard to discount its importance. It's as important to the web as HTTP and an HTML5 parser is, which we seem cool with having in the main repo. On Tue, Dec 4, 2012 at 9:19 PM, Rob Pike <r@golang.org> wrote: > I think of this as more like the time zone data than the unicode > table. Regardless, it's data that most programs will not need and I > strongly object to checking in a version of every internet database > that comes along. > > There are going to be more of these. If you insist on building it in, > I'd prefer a subrepo to set the proper precedent. You don't need it > built in anyway, since you'll have the repo built when you're on the > plane. > > I don't know much about the data so I can't judge the merit of the > hybrid approach but it does seem to solve all problems: little churn, > small hit to repo, can work in in the Marianas trench, full > flexibility available for show-offs. > > -rob >
Sign in to reply to this message.
On Tue, Dec 4, 2012 at 10:03 PM, Rob Pike <r@golang.org> wrote: > This thread is about where and how to make this data accessible to Go > programs. > > Does it need to be in the main repo? Does all of it need to be in the main repo? Would it be sufficient to put it in go.net? > depends what the concerns are. what are your concerns? > How should it be represented? > How should it be maintained? > etc > > -rob >
Sign in to reply to this message.
This thread is about where and how to make this data accessible to Go programs. Does it need to be in the main repo? Does all of it need to be in the main repo? Would it be sufficient to put it in go.net? How should it be represented? How should it be maintained? -rob
Sign in to reply to this message.
> The input data is temporarily filtered to the .ao, .ar, .arpa, .uk and > .zw domains, so that code review is easier while still covering the > interesting * and ! rules. A follow-up changelist will check in the > unfiltered public suffix list. What's the technical need for this list? As far as I can tell, it's just a cargo cult. Few things need it (I can see that Noscript benefits from it, for instance). The cookie restrictions never made much sense to me because they are trivially bypassed with a couple of server round trips (among cooperating web sites).
Sign in to reply to this message.
Yeah, I thought tables.go was the only thing being considered for checking in.
Sign in to reply to this message.
* Russ Cox: > I can't believe I'm saying this but what does IE do? Here's a somewhat dated description, but I think the rationale still applies (but IE's behavior may have changed since): <http://blogs.msdn.com/b/ieinternals/archive/2009/09/19/private-domain-names-a...> It's hard to come up with a short summary. The bypass I mentioned involves having something like this in web pages: <script src="https://www.enyo.de/metrics/guid.js"> And the web server would look at a "GUID" cookie, set a new one if it does not exist, and return page content like this, reproducing the cookie value: enyoMetricsGUID = "90b57a93-4433-42b5-81cc-321b09edf868"; The web page including the script then has access to this global cookie, and could use XMLHttpRequest to report it back to its server. Regarding the protection from session fixation attacks, these have to be targeted at a specific web site (because a valid session cookie is required). Setting such a cookie globally therefore does not make much sense. But it seems that all web browsers support third party cookies by default (and I recently discovered that some Wordpress services rely on them), so you could directly set the cookie on the victim domain, public suffix list or not.
Sign in to reply to this message.
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... File src/pkg/exp/publicsuffix/gen.go (right): https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:105: if s == "" || strings.HasPrefix(s, "//") || !isASCII(s) { Warning: This ignores the distinction between the effectively two lists that are reflected in effective_tld_names.dat - public suffixes and private domains. (denoted by "// ===BEGIN " and "// ===END" directives) The distinction and implication of these can be seen in the table at https://wiki.mozilla.org/Public_Suffix_List/Use_Cases If you're only intending to use this for a cookiejar implementation, then this logic is sufficient. However, if you feel the other uses cases should be met by this package, then you may wish to include the extra bit of information that comes from knowing whether it's an ICANN TLD or whether it's a "private" domain that should be treated as an effective TLD for purposes of cookies, but not for other purposes.
Sign in to reply to this message.
The plan is: * the cookiejar code stays in exp, and aims to be promoted to the standard library proper by Go 1.1. A "log in to the nytimes.com site and fetch the crossword" program can use a cookiejar with the 'dev null' public suffix list. * the publicsuffix package (based on publicsuffix.org's data) will live in go.net/publicsuffix, and will be compiled ahead of time. The generated source code can be updated more frequently than the Go release. Thus, this codereview change is obsolete. Please review https://codereview.appspot.com/6912045 instead. I have added text packing to the new change (merging e.g. "arpa" and "parliament" into "arparliament"). This is the same packing code as exp/html/atom/gen.go. On the full public suffix list, the text constant shrinks from 27990 bytes to 18018 bytes (-36%). The computation took around 10 minutes. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... File src/pkg/exp/publicsuffix/gen.go (right): https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:83: res, err := http.Get(*url) On 2012/12/05 13:41:55, volker.dobler wrote: > It would be nice to be able to use a local file on disc too. If the url flag is empty, it reads from stdin. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:105: if s == "" || strings.HasPrefix(s, "//") || !isASCII(s) { On 2012/12/07 03:48:14, Ryan Sleevi wrote: > Warning: This ignores the distinction between the effectively two lists that are > reflected in effective_tld_names.dat - public suffixes and private domains. > (denoted by "// ===BEGIN " and "// ===END" directives) Ack. I've added a TODO to think about this. The API is not frozen yet. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:236: // [2] nodeType [1] wildcard [13] number of children [16] first child. On 2012/12/05 13:41:55, volker.dobler wrote: > A reversed layout of > numChildren firstChild nodeType wildcard > would result in an encoding with the two most significant > bytes set to zero for all nodes without children. > If output as %x instead of %08x, this would save some more > bytes of checked in table data. I'm not so concerned about the size of table.go. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/ge... src/pkg/exp/publicsuffix/gen.go:277: children map[string]*node On 2012/12/05 13:41:55, volker.dobler wrote: > If you use a slice instead of a map for the children, > you can sort it once and do not need to "collect keys, > sort keys, iterate map in sorted keys order" as you do > twice below in assignNodeIndex and printTable. Done. https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/li... File src/pkg/exp/publicsuffix/list.go (right): https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/li... src/pkg/exp/publicsuffix/list.go:47: switch u { On 2012/12/05 13:41:55, volker.dobler wrote: > Empty nodes do match wildcard nodes. Adding > case nodeTypeEmpty: > if wildcard { suffix = 1 + dot } > fixes all the testcases, including the kobe.jp > ones. That's not quite the right fix: given these rules: jp *.kobe.jp !city.kobe.jp You wouldn't see it in real life, but the public suffix of "kobe.jp" is "jp", not "kobe.jp". The slowPublicSuffix implementation correctly returns "jp". I've fixed this algorithm to do that: case nodeTypeException: suffix = prevSuffix becomes case nodeTypeException: suffix = 1 + len(s) https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/li... File src/pkg/exp/publicsuffix/list_test.go (right): https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/li... src/pkg/exp/publicsuffix/list_test.go:242: // TODO(nigeltao): add the "Effective Top Level Domain Plus 1" tests from On 2012/12/05 13:41:55, volker.dobler wrote: > These are the ones which discovered the kobe.jp bug. > Feel free to just copy my testcases. Yeah, I'll add these testcases when I generate the full table, not just a subset.
Sign in to reply to this message.
*** Abandoned ***
Sign in to reply to this message.
On Fri, Dec 7, 2012 at 2:48 PM, <rsleevi@chromium.org> wrote: > Warning: This ignores the distinction between the effectively two lists > that are reflected in effective_tld_names.dat - public suffixes and > private domains. (denoted by "// ===BEGIN " and "// ===END" directives) ICANN domains vs private domains is addressed in https://codereview.appspot.com/7060046
Sign in to reply to this message.
|