Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(1585)

Issue 6971044: code review 6971044: exp/locale/collate: include composed characters into th... (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
12 years, 7 months ago by mpvl
Modified:
12 years, 7 months ago
Reviewers:
CC:
rsc, mpvl_google.com, golang-dev
Visibility:
Public.

Description

exp/locale/collate: include composed characters into the table. This eliminates the need to decompose characters for the majority of cases. This considerably speeds up collation while increasing the table size minimally. To detect non-normalized strings, rather than relying on exp/norm, the table now includes CCC information. The inclusion of this information does not increase table size. DETAILS - Raw collation elements are now a struct that includes the CCC, rather than a slice of ints. - Builder now ensures that NFD and NFC counterparts are included in the table. This also fixes a bug for Korean which is responsible for most of the growth of the table size. - As there is no more normalization step, code should now handle both strings and byte slices as input. Introduced source type to facilitate this. NOTES - This change does not handle normalization correctly entirely for contractions. This causes a few failures with the regtest. table_test.go contains a few uncommented tests that can be enabled once this is fixed. The easiest is to fix this once we have the new norm.Iter. - Removed a test cases in table_test that covers cases that are now guaranteed to not exist.

Patch Set 1 #

Patch Set 2 : diff -r e7cd0a82d669 https://go.googlecode.com/hg #

Total comments: 6

Patch Set 3 : diff -r a298f2d529ec https://go.googlecode.com/hg #

Unified diffs Side-by-side diffs Delta from patch set Stats (+783 lines, -219 lines) Patch
M src/pkg/exp/locale/collate/build/builder.go View 1 2 12 chunks +109 lines, -44 lines 0 comments Download
M src/pkg/exp/locale/collate/build/builder_test.go View 1 2 3 chunks +40 lines, -22 lines 0 comments Download
M src/pkg/exp/locale/collate/build/colelem.go View 1 2 10 chunks +58 lines, -32 lines 0 comments Download
M src/pkg/exp/locale/collate/build/colelem_test.go View 1 2 5 chunks +25 lines, -14 lines 0 comments Download
M src/pkg/exp/locale/collate/build/order.go View 1 2 8 chunks +37 lines, -16 lines 0 comments Download
M src/pkg/exp/locale/collate/build/order_test.go View 1 2 2 chunks +3 lines, -3 lines 0 comments Download
M src/pkg/exp/locale/collate/colelem.go View 1 2 6 chunks +44 lines, -13 lines 0 comments Download
M src/pkg/exp/locale/collate/colelem_test.go View 1 2 7 chunks +100 lines, -12 lines 0 comments Download
M src/pkg/exp/locale/collate/collate.go View 1 2 4 chunks +135 lines, -38 lines 0 comments Download
M src/pkg/exp/locale/collate/collate_test.go View 1 2 1 chunk +1 line, -0 lines 0 comments Download
M src/pkg/exp/locale/collate/contract.go View 1 2 3 chunks +60 lines, -1 line 0 comments Download
M src/pkg/exp/locale/collate/export_test.go View 1 2 3 chunks +3 lines, -3 lines 0 comments Download
M src/pkg/exp/locale/collate/table.go View 1 2 4 chunks +90 lines, -13 lines 0 comments Download
M src/pkg/exp/locale/collate/table_test.go View 1 2 4 chunks +17 lines, -8 lines 0 comments Download
M src/pkg/exp/locale/collate/trie.go View 1 2 1 chunk +61 lines, -0 lines 0 comments Download

Messages

Total messages: 5
mpvl
Hello rsc@golang.org (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://go.googlecode.com/hg
12 years, 7 months ago (2012-12-19 09:47:48 UTC) #1
rsc
LGTM Does the trie encoding make sure that you don't accept 4-byte sequences with value ...
12 years, 7 months ago (2012-12-21 20:12:52 UTC) #2
mpvl_google.com
Good point regarding trie. 4-byte values > 0x10FFFF are treated as unknown code points and ...
12 years, 7 months ago (2012-12-24 15:20:30 UTC) #3
mpvl
https://codereview.appspot.com/6971044/diff/2001/src/pkg/exp/locale/collate/build/builder.go File src/pkg/exp/locale/collate/build/builder.go (right): https://codereview.appspot.com/6971044/diff/2001/src/pkg/exp/locale/collate/build/builder.go#newcode662 src/pkg/exp/locale/collate/build/builder.go:662: log.Fatalf("%s:processContractions: unexpected length for '%X'; len=%d; want %d", o.id, ...
12 years, 7 months ago (2012-12-24 15:41:12 UTC) #4
mpvl
12 years, 7 months ago (2012-12-24 15:42:49 UTC) #5
*** Submitted as https://code.google.com/p/go/source/detail?r=4115d29db6bb ***

exp/locale/collate: include composed characters into the table. This eliminates
the need to decompose characters for the majority of cases.  This considerably
speeds up collation while increasing the table size minimally.

To detect non-normalized strings, rather than relying on exp/norm, the table
now includes CCC information. The inclusion of this information does not
increase table size.

DETAILS
 - Raw collation elements are now a struct that includes the CCC, rather
   than a slice of ints.
 - Builder now ensures that NFD and NFC counterparts are included in the table.
   This also fixes a bug for Korean which is responsible for most of the growth
   of the table size.
 - As there is no more normalization step, code should now handle both strings
   and byte slices as input. Introduced source type to facilitate this.

NOTES
 - This change does not handle normalization correctly entirely for
contractions.
   This causes a few failures with the regtest. table_test.go contains a few
   uncommented tests that can be enabled once this is fixed.  The easiest is to
   fix this once we have the new norm.Iter.
 - Removed a test cases in table_test that covers cases that are now guaranteed
   to not exist.

R=rsc, mpvl
CC=golang-dev
https://codereview.appspot.com/6971044
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld f62528b