Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(691)

Issue 330940043: RBBI, add caching, remove reverse rules.

Can't Edit
Can't Publish+Mail
Start Review
Created:
6 years, 5 months ago by andy.heninger
Modified:
6 years, 5 months ago
Reviewers:
mark, mark.edward.davis
Base URL:
svn+ssh://source.icu-project.org/repos/icu/trunk/
Visibility:
Public.

Description

combined changes.

Patch Set 1 #

Total comments: 14
Unified diffs Side-by-side diffs Delta from patch set Stats (+1706 lines, -3676 lines) Patch
M . View 0 chunks +0 lines, -0 lines 0 comments Download
M icu4c View 0 chunks +0 lines, -0 lines 0 comments Download
M icu4c/source View 0 chunks +0 lines, -0 lines 0 comments Download
M icu4c/source/common/Makefile.in View 1 chunk +1 line, -1 line 0 comments Download
M icu4c/source/common/brkeng.h View 4 chunks +4 lines, -9 lines 0 comments Download
M icu4c/source/common/brkeng.cpp View 3 chunks +11 lines, -18 lines 2 comments Download
M icu4c/source/common/common.vcxproj View 2 chunks +3 lines, -0 lines 0 comments Download
M icu4c/source/common/common.vcxproj.filters View 2 chunks +6 lines, -0 lines 0 comments Download
M icu4c/source/common/common_uwp.vcxproj View 2 chunks +3 lines, -0 lines 0 comments Download
M icu4c/source/common/dictbe.h View 8 chunks +10 lines, -12 lines 0 comments Download
M icu4c/source/common/dictbe.cpp View 8 chunks +13 lines, -35 lines 0 comments Download
M icu4c/source/common/rbbi.cpp View 47 chunks +196 lines, -678 lines 0 comments Download
M icu4c/source/common/rbbidata.h View 1 chunk +2 lines, -2 lines 0 comments Download
M icu4c/source/common/rbbidata.cpp View 3 chunks +13 lines, -5 lines 0 comments Download
M icu4c/source/common/rbbirb.cpp View 4 chunks +45 lines, -13 lines 0 comments Download
M icu4c/source/common/rbbiscan.cpp View 2 chunks +5 lines, -0 lines 0 comments Download
M icu4c/source/common/rbbisetb.h View 1 chunk +2 lines, -2 lines 0 comments Download
M icu4c/source/common/rbbisetb.cpp View 2 chunks +13 lines, -5 lines 0 comments Download
M icu4c/source/common/unicode/rbbi.h View 7 chunks +57 lines, -74 lines 0 comments Download
M icu4c/source/data/brkitr/rules/char.txt View 3 chunks +5 lines, -36 lines 0 comments Download
M icu4c/source/data/brkitr/rules/line.txt View 5 chunks +5 lines, -221 lines 0 comments Download
M icu4c/source/data/brkitr/rules/line_fi.txt View 5 chunks +5 lines, -225 lines 0 comments Download
M icu4c/source/data/brkitr/rules/line_loose.txt View 5 chunks +6 lines, -225 lines 0 comments Download
M icu4c/source/data/brkitr/rules/line_loose_cj.txt View 5 chunks +5 lines, -231 lines 0 comments Download
M icu4c/source/data/brkitr/rules/line_loose_fi.txt View 5 chunks +5 lines, -228 lines 0 comments Download
M icu4c/source/data/brkitr/rules/line_normal.txt View 5 chunks +5 lines, -222 lines 0 comments Download
M icu4c/source/data/brkitr/rules/line_normal_cj.txt View 5 chunks +5 lines, -224 lines 0 comments Download
M icu4c/source/data/brkitr/rules/line_normal_fi.txt View 4 chunks +5 lines, -224 lines 0 comments Download
M icu4c/source/data/brkitr/rules/sent.txt View 5 chunks +6 lines, -15 lines 0 comments Download
M icu4c/source/data/brkitr/rules/sent_el.txt View 5 chunks +5 lines, -5 lines 0 comments Download
M icu4c/source/data/brkitr/rules/title.txt View 4 chunks +2 lines, -21 lines 0 comments Download
M icu4c/source/data/brkitr/rules/word.txt View 4 chunks +10 lines, -116 lines 0 comments Download
M icu4c/source/data/brkitr/rules/word_POSIX.txt View 7 chunks +14 lines, -120 lines 0 comments Download
M icu4c/source/test View 0 chunks +0 lines, -0 lines 0 comments Download
M icu4c/source/test/intltest View 0 chunks +0 lines, -0 lines 0 comments Download
M icu4c/source/test/intltest/rbbiapts.cpp View 3 chunks +4 lines, -4 lines 0 comments Download
M icu4c/source/test/intltest/rbbimonkeytest.h View 1 chunk +1 line, -0 lines 0 comments Download
M icu4c/source/test/intltest/rbbimonkeytest.cpp View 2 chunks +24 lines, -0 lines 0 comments Download
M icu4c/source/test/intltest/rbbitst.cpp View 3 chunks +4 lines, -1 line 0 comments Download
M icu4c/source/test/testdata/rbbitst.txt View 2 chunks +81 lines, -2 lines 0 comments Download
M icu4j View 0 chunks +0 lines, -0 lines 0 comments Download
M icu4j/main/classes/core/src/com/ibm/icu/text/DictionaryBreakEngine.java View 4 chunks +26 lines, -19 lines 0 comments Download
M icu4j/main/classes/core/src/com/ibm/icu/text/LanguageBreakEngine.java View 2 chunks +12 lines, -13 lines 0 comments Download
M icu4j/main/classes/core/src/com/ibm/icu/text/RBBIDataWrapper.java View 2 chunks +10 lines, -1 line 0 comments Download
M icu4j/main/classes/core/src/com/ibm/icu/text/RBBIRuleBuilder.java View 5 chunks +47 lines, -11 lines 2 comments Download
M icu4j/main/classes/core/src/com/ibm/icu/text/RBBIRuleScanner.java View 4 chunks +7 lines, -3 lines 0 comments Download
M icu4j/main/classes/core/src/com/ibm/icu/text/RBBISetBuilder.java View 1 chunk +8 lines, -3 lines 0 comments Download
M icu4j/main/classes/core/src/com/ibm/icu/text/RuleBasedBreakIterator.java View 31 chunks +922 lines, -631 lines 10 comments Download
M icu4j/main/classes/core/src/com/ibm/icu/text/UnhandledBreakEngine.java View 1 chunk +4 lines, -11 lines 0 comments Download
M icu4j/main/shared/data/icudata.jar View Binary file 0 comments Download
M icu4j/main/shared/data/icutzdata.jar View Binary file 0 comments Download
M icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/RBBIAPITest.java View 3 chunks +5 lines, -5 lines 0 comments Download
M icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/RBBITestExtended.java View 1 chunk +3 lines, -3 lines 0 comments Download
M icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt View 2 chunks +81 lines, -2 lines 0 comments Download

Messages

Total messages: 3
mark.edward.davis
Code looks good. Just a few small questions. https://codereview.appspot.com/330940043/diff/1/icu4c/source/common/brkeng.cpp File icu4c/source/common/brkeng.cpp (right): https://codereview.appspot.com/330940043/diff/1/icu4c/source/common/brkeng.cpp#newcode84 icu4c/source/common/brkeng.cpp:84: int32_t ...
6 years, 5 months ago (2017-10-10 13:59:23 UTC) #1
andy.heninger
Suggested changes to comments made and checked in. I ran the RBBI tests with a ...
6 years, 5 months ago (2017-10-10 22:05:33 UTC) #2
mark_macchiato.com
6 years, 5 months ago (2017-10-11 05:10:26 UTC) #3
Thanks

{phone}

On Oct 10, 2017 15:05, <andy.heninger@gmail.com> wrote:

> Reviewers: mark_macchiato.com, mark.edward.davis,
>
> Message:
> Suggested changes to comments made and checked in.
>
> I ran the RBBI tests with a cache size of 8, everything was OK.
>
> Running with 4 fails; there's a spot in the code where, if the cache is
> full, I clear out 6 entries, with the expectation that they will soon to
> be used. It fails with an assert, which is fine.
>
>
> https://codereview.appspot.com/330940043/diff/1/icu4c/source
> /common/brkeng.cpp
> File icu4c/source/common/brkeng.cpp (right):
>
> https://codereview.appspot.com/330940043/diff/1/icu4c/source
> /common/brkeng.cpp#newcode84
> icu4c/source/common/brkeng.cpp:84: int32_t /* startPos */,
> On 2017/10/10 13:59:23, mark.edward.davis wrote:
>
>> don't understand the /* ... */ here.
>>
>
> The parameter startPos is unused. Commenting it out stops compiler
> warnings about unused parameters.
>
> This function is an override. The base class, with an abstract function
> declaration, is LanguageBreakEngine. class DictionaryBreakEngine has a
> real, non-stub, implementation that does need and use the startPos
> parameter.
>
> A more widely used C/C++ idiom for suppressing unused parameter warnings
> is this:
> (void)foundBreaks;
>
> https://codereview.appspot.com/330940043/diff/1/icu4j/main/
> classes/core/src/com/ibm/icu/text/RBBIRuleBuilder.java
> File icu4j/main/classes/core/src/com/ibm/icu/text/RBBIRuleBuilder.java
> (right):
>
> https://codereview.appspot.com/330940043/diff/1/icu4j/main/
> classes/core/src/com/ibm/icu/text/RBBIRuleBuilder.java#newcode178
> icu4j/main/classes/core/src/com/ibm/icu/text/RBBIRuleBuilder.java:178:
> int reverseTableSize = align8(fReverseTables.getTableSize());
> On 2017/10/10 13:59:23, mark.edward.davis wrote:
>
>> Why is the safeFwdTableSize removed, but not the reverseTableSize.
>>
> Seems
>
>> inconsistent with change below.
>>
>
> This goes back to the Rich Gillam era and old rule compatibility. In the
> source rules, if there are reverse rules but no safe rules, the reverse
> rules are actually safe reverse rules. Further down, around line 231, is
> code that deals with that.
>
> This is all going to be revisited for ICU 61, when I hope to synthesize
> safe rules from the forward rules.
>
> https://codereview.appspot.com/330940043/diff/1/icu4j/main/
> classes/core/src/com/ibm/icu/text/RuleBasedBreakIterator.java
> File
> icu4j/main/classes/core/src/com/ibm/icu/text/RuleBasedBreakIterator.java
> (right):
>
> https://codereview.appspot.com/330940043/diff/1/icu4j/main/
> classes/core/src/com/ibm/icu/text/RuleBasedBreakIterator.java#newcode458
> icu4j/main/classes/core/src/com/ibm/icu/text/RuleBasedBreakI
> terator.java:458:
> // Move requested offset to a code point start. It might be on a trail
> surrogate.
> On 2017/10/10 13:59:23, mark.edward.davis wrote:
>
>> Very minor: It might be between lead and trail surrogates.
>>
>
> (best to be consistent about offsets being "between" characters.)
>>
>
> Done.
>
> https://codereview.appspot.com/330940043/diff/1/icu4j/main/
> classes/core/src/com/ibm/icu/text/RuleBasedBreakIterator.java#newcode481
> icu4j/main/classes/core/src/com/ibm/icu/text/RuleBasedBreakI
> terator.java:481:
> // Move requested offset to a code point start. It might be on a trail
> surrogate.
> On 2017/10/10 13:59:23, mark.edward.davis wrote:
>
>> Ditto vis a vis trail surrogates.
>>
>
> Done.
>
> https://codereview.appspot.com/330940043/diff/1/icu4j/main/
> classes/core/src/com/ibm/icu/text/RuleBasedBreakIterator.java#newcode1149
> icu4j/main/classes/core/src/com/ibm/icu/text/RuleBasedBreakI
> terator.java:1149:
> ci.next();
> On 2017/10/10 13:59:23, mark.edward.davis wrote:
>
>> Didn't quite get this. Since it works with the tests it must behave
>>
> ok.
>
> I would expect it to go 1 forward IFF <text[index-1], text[index]> is
>>
> a
>
>> surrogate pair.
>>
>
> Looks like it says:
>>
>
> set the internal index to index (eg 5)
>> if the character following the index (logically text[5]) is a low
>>
> surrogate
>
>>    then if the character before that (logically text[4]) IS NOT a high
>>
> surrogate,
>
>> go forward 1.
>>
>
> I'd expect the test to be IS.
>>
>
> I think it's ok. It's doing what you say, except that the test
>
>   then if the character before that (logically text[4]) IS NOT a high
>>
> (lead) surrogate,
> also moves the CI back one, so its position is now 4, and the following
>
>    go forward 1.
>>
>
> then moves the position back to the original 5, for the case of an
> unpaired trail surrogate. If there was a proper surrogate pair, the
> index remains at 4, the lead of the pair.
>
> https://codereview.appspot.com/330940043/diff/1/icu4j/main/
> classes/core/src/com/ibm/icu/text/RuleBasedBreakIterator.java#newcode1767
> icu4j/main/classes/core/src/com/ibm/icu/text/RuleBasedBreakI
> terator.java:1767:
> /*
> On 2017/10/10 13:59:23, mark.edward.davis wrote:
>
>> A few instances where /* should be /** (unless that was intentional)
>>
>
> Accidental. They're not public, but it should be consistent and it's
> not.
>
> https://codereview.appspot.com/330940043/diff/1/icu4j/main/
> classes/core/src/com/ibm/icu/text/RuleBasedBreakIterator.java#newcode1867
> icu4j/main/classes/core/src/com/ibm/icu/text/RuleBasedBreakI
> terator.java:1867:
> static final int CACHE_SIZE = 128;
> On 2017/10/10 13:59:23, mark.edward.davis wrote:
>
>> 1. Have you tested with CACHE_SIZE = (say) 2, to make sure all the
>>
> boundary
>
>> conditions work?
>> 2. It feels like a cache of 256 would be safer. Did you test for
>>
> optimal sizes?
>
> I haven't tested with a tiny cache, but I've tested with text with well
> over 128 boundaries, causing the cache to circle around, and with a
> variety of access patterns.
>
> Testing with a small cache size is a good idea. Two might not work, I
> need to review the code, but I think I've got places where I assume I
> can get a two or three boundaries to fit. But 8 or 16 should be fine.
>
> The optimum cache size is an interesting question. For straight forward
> or reverse iteration, it's not going to matter much. For non-sequential
> access, I don't have any usage data to guide tuning. The biggest point
> of the cache, initially, is for reverse iteration, to buffer up the
> boundaries found going forwards after backing up to a safe point.
>
> Another interesting tuning parameter is how much to back up during
> reverse iteration. Too little, and excess time will be spent repeatedly
> applying the safe rules. Too much, and we may find and cache a bunch of
> extra boundaries that will never be retrieved by the user. I played with
> this a little. Again, the optimum depends on usage information, which we
> don't have. It might be possible to make it self-tuning.
>
> Description:
> combined changes.
>
> Please review this at https://codereview.appspot.com/330940043/
>
> Affected files (+1704, -3674 lines):
>    M    .
>    M    icu4c
>    M    icu4c/source
>   M     icu4c/source/common/Makefile.in
>   M     icu4c/source/common/brkeng.h
>   M     icu4c/source/common/brkeng.cpp
>   M     icu4c/source/common/common.vcxproj
>   M     icu4c/source/common/common.vcxproj.filters
>   M     icu4c/source/common/common_uwp.vcxproj
>   M     icu4c/source/common/dictbe.h
>   M     icu4c/source/common/dictbe.cpp
>   M     icu4c/source/common/rbbi.cpp
>   M     icu4c/source/common/rbbidata.h
>   M     icu4c/source/common/rbbidata.cpp
>   M     icu4c/source/common/rbbirb.cpp
>   M     icu4c/source/common/rbbiscan.cpp
>   M     icu4c/source/common/rbbisetb.h
>   M     icu4c/source/common/rbbisetb.cpp
>   M     icu4c/source/common/unicode/rbbi.h
>   M     icu4c/source/data/brkitr/rules/char.txt
>   M     icu4c/source/data/brkitr/rules/line.txt
>   M     icu4c/source/data/brkitr/rules/line_fi.txt
>   M     icu4c/source/data/brkitr/rules/line_loose.txt
>   M     icu4c/source/data/brkitr/rules/line_loose_cj.txt
>   M     icu4c/source/data/brkitr/rules/line_loose_fi.txt
>   M     icu4c/source/data/brkitr/rules/line_normal.txt
>   M     icu4c/source/data/brkitr/rules/line_normal_cj.txt
>   M     icu4c/source/data/brkitr/rules/line_normal_fi.txt
>   M     icu4c/source/data/brkitr/rules/sent.txt
>   M     icu4c/source/data/brkitr/rules/sent_el.txt
>   M     icu4c/source/data/brkitr/rules/title.txt
>   M     icu4c/source/data/brkitr/rules/word.txt
>   M     icu4c/source/data/brkitr/rules/word_POSIX.txt
>    M    icu4c/source/test
>    M    icu4c/source/test/intltest
>   M     icu4c/source/test/intltest/rbbiapts.cpp
>   M     icu4c/source/test/intltest/rbbimonkeytest.h
>   M     icu4c/source/test/intltest/rbbimonkeytest.cpp
>   M     icu4c/source/test/intltest/rbbitst.cpp
>   M     icu4c/source/test/testdata/rbbitst.txt
>    M    icu4j
>   M     icu4j/main/classes/core/src/com/ibm/icu/text/DictionaryBrea
> kEngine.java
>   M     icu4j/main/classes/core/src/com/ibm/icu/text/LanguageBreakE
> ngine.java
>   M     icu4j/main/classes/core/src/com/ibm/icu/text/RBBIDataWrapper.java
>   M     icu4j/main/classes/core/src/com/ibm/icu/text/RBBIRuleBuilder.java
>   M     icu4j/main/classes/core/src/com/ibm/icu/text/RBBIRuleScanner.java
>   M     icu4j/main/classes/core/src/com/ibm/icu/text/RBBISetBuilder.java
>   M     icu4j/main/classes/core/src/com/ibm/icu/text/RuleBasedBreak
> Iterator.java
>   M     icu4j/main/classes/core/src/com/ibm/icu/text/UnhandledBreak
> Engine.java
>   M     icu4j/main/shared/data/icudata.jar
>   M     icu4j/main/shared/data/icutzdata.jar
>   M     icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/RBBI
> APITest.java
>   M     icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/RBBI
> TestExtended.java
>   M     icu4j/main/tests/core/src/com/ibm/icu/dev/test/rbbi/rbbitst.txt
>
>
>
Sign in to reply to this message.

Powered by Google App Engine
RSS Feeds Recent Issues | This issue
This is Rietveld f62528b