Issue 5505090: lexer.ll: Warn about non-UTF-8 characters

LGTM. http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll File lily/lexer.ll (right): http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1009 lily/lexer.ll:1009: case 0xc2: Wouldn't it be more effective to ...

12 years, 3 months ago (2011-12-31 18:28:00 UTC) #1

Keith

Works nicely. Showing the input location will probably be very helpful. We probably want to ...

12 years, 3 months ago (2012-01-01 02:01:11 UTC) #2

Keith

On 2012/01/01 02:01:11, Keith wrote: > > Showing the input location will probably be very ...

12 years, 3 months ago (2012-01-01 02:03:50 UTC) #3

dak

On 2012/01/01 02:01:11, Keith wrote: > Works nicely. > > Showing the input location will ...

12 years, 3 months ago (2012-01-01 09:40:36 UTC) #5

dak

http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll File lily/lexer.ll (right): http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode134 lily/lexer.ll:134: A [a-zA-Z\200-\377] On 2012/01/01 02:01:11, Keith wrote: > non-ASCII ...

12 years, 3 months ago (2012-01-01 09:56:04 UTC) #6

http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll
File lily/lexer.ll (right):

http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode134
lily/lexer.ll:134: A		[a-zA-Z\200-\377]
On 2012/01/01 02:01:11, Keith wrote:
> non-ASCII characters are used internally as-read, tested below only to warn
the
> user if the input is invalid as UTF-8

You mean, you want a comment here?  I actually don't know how they look in this
section of the Flex source.

http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode220
lily/lexer.ll:220: (void) YYText_utf8 ();
On 2012/01/01 02:01:11, Keith wrote:
> Complaining about comments doesn't help anybody.

There are editors and tools that decide about which encoding to use based on an
analysis of the whole file.  Exceptions inside of comments can make them pick a
"safe choice" like displaying every byte in binary.  So I don't agree with
"doesn't help anybody".  Also if you start with a file that contains non-ASCII
non-UTF-8 characters in comments, the editor will not assume that you want to
save in UTF-8.  All additionally entered characters will be encoded compatibly
with the previously detected file encoding.

So I don't agree with your statement that basically says that files that contain
non-UTF-8 characters in comments can't cause trouble.  I won't insist on
throwing a warning here, but I won't remove it without being sure that this is
based on an informed decision.

http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1009
lily/lexer.ll:1009: case 0xc2:
On 2011/12/31 18:28:00, lemzwerg wrote:
> Wouldn't it be more effective to create an array of 128 bytes (for 0x80-0xFF)
> which maps `p[i]' to the value of `more' (or 0 if there is more to do)?

How and when to break switch statements down into conditionals and lookup tables
depending on the individual processor's performance characteristics is a pet
peeve of compiler writers.  I see little point meddling with that.  Have you
looked at the generated code? I'd probably have exploded the case 0x00..0x7f
into the switch as well, but did not like the resulting number of lines, and I
don't know if clang++ or whatever other compilers we are using can deal with
case ranges (like gcc can).

http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1083
lily/lexer.ll:1083: LexerWarning (_ ("non-UTF-8 characters").c_str ());
On 2012/01/01 02:01:11, Keith wrote:
> I suggest "unsupported file encoding, expected UTF-8"

I think that error message would be appropriate once per run, either because of
making it a fatal error, or as a summary at the end.  But when flagging the
actual place where there is a problematic character?

> You could warn once per token, rather than once per character, by following
this
> with  break;

I think once per character is ok (after all, users don't appreciate fixing all
errors of one run, just to get fresh errors on the next).  It does not help,
however, if the error message points merely to the token (more accurately, the
lexing unit which does not need to result in a token).

So the final version to commit should either point straight to the bad
character, or give just one message per string.  I'll change this to the latter
before the final commit if nobody gives a good suggestion of how to make it do
the former.

Keith

On Sun, 01 Jan 2012 01:40:36 -0800, <dak@gnu.org> wrote: > Our lexer has been written ...

12 years, 3 months ago (2012-01-01 10:05:42 UTC) #7

dak

On 2012/01/01 10:05:42, Keith wrote: > On Sun, 01 Jan 2012 01:40:36 -0800, <mailto:dak@gnu.org> wrote: ...

12 years, 3 months ago (2012-01-01 10:12:27 UTC) #8

Keith

On Sun, 01 Jan 2012 01:56:04 -0800, <dak@gnu.org> wrote: > http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode134 > lily/lexer.ll:134: A [a-zA-Z\200-\377] ...

12 years, 3 months ago (2012-01-01 10:16:47 UTC) #9

dak

On 2012/01/01 10:16:47, Keith wrote: > On Sun, 01 Jan 2012 01:56:04 -0800, <mailto:dak@gnu.org> wrote: ...

12 years, 3 months ago (2012-01-01 11:07:53 UTC) #10

Keith

On 2012/01/01 10:12:27, dak wrote: > > Consider a comment in your case/switch statement that ...

12 years, 3 months ago (2012-01-01 22:06:52 UTC) #11

dak

12 years, 3 months ago (2012-01-02 09:11:24 UTC) #12

On 2012/01/01 22:06:52, Keith wrote:
> On 2012/01/01 10:12:27, dak wrote:
> 
> Sorry, I wasn't making much sense.  As a reader I want to *recognize* what the
> but switch/case is doing rather than trying to figure it out.  Maybe :
>  // Test if these bytes are a UTF-8 encoding of a Unicode character,
>  // and warn if not.   Trap overly-long UTF-8 encodings, but we don't
>  // need to worry about finer details like some filters do.

What are the finer details?  UTF-16 surrogates?  I was just too lazy to figure
out the patterns for them.

> because your test is almost but not quite equivalent to the regex in the back
of
> the Flex manual.

I can't find a regex there.  Got a link or section name?  Is this a released
version?

Expand All Messages | Collapse All Messages

Issue 5505090: lexer.ll: Warn about non-UTF-8 characters (Closed)

Description

Patch Set 1 #

Patch Set 2 : "Better error message locator" #

Patch Set 3 : If critics followed the same standards they ask of others they would have no time left to complain. #

Messages