|
|
Descriptionlexer.ll: Warn about non-UTF-8 characters
Making the warnings point to the exact bad byte rather than the
enclosing construct would be nice.
Patch Set 1 #
Total comments: 8
Patch Set 2 : "Better error message locator" #Patch Set 3 : If critics followed the same standards they ask of others they would have no time left to complain. #
MessagesTotal messages: 12
LGTM. http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll File lily/lexer.ll (right): http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1009 lily/lexer.ll:1009: case 0xc2: Wouldn't it be more effective to create an array of 128 bytes (for 0x80-0xFF) which maps `p[i]' to the value of `more' (or 0 if there is more to do)?
Sign in to reply to this message.
Works nicely. Showing the input location will probably be very helpful. We probably want to remove the similar message from lily/misc.cc, because both message together are very noisy. I wish I could think of a way to check the input with a canned regular expression like <http://flex.sourceforge.net/manual/Identifiers.html#Identifiers> or better one with comments <http://www.w3.org/International/questions/qa-forms-utf-8> Doing so seems to require backing up (which probably won't cause any harm) or maybe I'm just not seeing an easy way. http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll File lily/lexer.ll (right): http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode134 lily/lexer.ll:134: A [a-zA-Z\200-\377] non-ASCII characters are used internally as-read, tested below only to warn the user if the input is invalid as UTF-8 http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode220 lily/lexer.ll:220: (void) YYText_utf8 (); Complaining about comments doesn't help anybody. http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1083 lily/lexer.ll:1083: LexerWarning (_ ("non-UTF-8 characters").c_str ()); I suggest "unsupported file encoding, expected UTF-8" You could warn once per token, rather than once per character, by following this with break;
Sign in to reply to this message.
On 2012/01/01 02:01:11, Keith wrote: > > Showing the input location will probably be very helpful. Just to clarify, showing the input location **as this patch does** will probably be helpful to confused users, because they can compare what Lily read with what they intended to write, while learning how to set the file encoding. choice of
Sign in to reply to this message.
LGTM Carl
Sign in to reply to this message.
On 2012/01/01 02:01:11, Keith wrote: > Works nicely. > > Showing the input location will probably be very helpful. We probably want to > remove the similar message from lily/misc.cc, because both message together are > very noisy. Depends on what the message does. This patch checks exclusively the input to the lexer/parser. There are ways of generating strings for the backend programmatically, however. I have decided to check strings, comments and file names here as well. This means that if you use literal strings as binary containers or have to encode file names in non-utf-8 because of other deficiencies in Lilypond, you'll get complaints. > I wish I could think of a way to check the input with a canned regular > expression like > <http://flex.sourceforge.net/manual/Identifiers.html#Identifiers> or better one > with comments <http://www.w3.org/International/questions/qa-forms-utf-8> > > Doing so seems to require backing up (which probably won't cause any harm) or > maybe I'm just not seeing an easy way. Our lexer has been written with the decision of using non-compressed tables and without backing up. I spent more than a day's worth on doing utf-8 right in the grammar. That's pretty pointless. It also means that we need to provide an error path for every item containing non-UTF-8 characters in order to get a UTF-8 related error message instead of something more mysterious. So I don't think it is really worth the trouble.
Sign in to reply to this message.
http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll File lily/lexer.ll (right): http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode134 lily/lexer.ll:134: A [a-zA-Z\200-\377] On 2012/01/01 02:01:11, Keith wrote: > non-ASCII characters are used internally as-read, tested below only to warn the > user if the input is invalid as UTF-8 You mean, you want a comment here? I actually don't know how they look in this section of the Flex source. http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode220 lily/lexer.ll:220: (void) YYText_utf8 (); On 2012/01/01 02:01:11, Keith wrote: > Complaining about comments doesn't help anybody. There are editors and tools that decide about which encoding to use based on an analysis of the whole file. Exceptions inside of comments can make them pick a "safe choice" like displaying every byte in binary. So I don't agree with "doesn't help anybody". Also if you start with a file that contains non-ASCII non-UTF-8 characters in comments, the editor will not assume that you want to save in UTF-8. All additionally entered characters will be encoded compatibly with the previously detected file encoding. So I don't agree with your statement that basically says that files that contain non-UTF-8 characters in comments can't cause trouble. I won't insist on throwing a warning here, but I won't remove it without being sure that this is based on an informed decision. http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1009 lily/lexer.ll:1009: case 0xc2: On 2011/12/31 18:28:00, lemzwerg wrote: > Wouldn't it be more effective to create an array of 128 bytes (for 0x80-0xFF) > which maps `p[i]' to the value of `more' (or 0 if there is more to do)? How and when to break switch statements down into conditionals and lookup tables depending on the individual processor's performance characteristics is a pet peeve of compiler writers. I see little point meddling with that. Have you looked at the generated code? I'd probably have exploded the case 0x00..0x7f into the switch as well, but did not like the resulting number of lines, and I don't know if clang++ or whatever other compilers we are using can deal with case ranges (like gcc can). http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1083 lily/lexer.ll:1083: LexerWarning (_ ("non-UTF-8 characters").c_str ()); On 2012/01/01 02:01:11, Keith wrote: > I suggest "unsupported file encoding, expected UTF-8" I think that error message would be appropriate once per run, either because of making it a fatal error, or as a summary at the end. But when flagging the actual place where there is a problematic character? > You could warn once per token, rather than once per character, by following this > with break; I think once per character is ok (after all, users don't appreciate fixing all errors of one run, just to get fresh errors on the next). It does not help, however, if the error message points merely to the token (more accurately, the lexing unit which does not need to result in a token). So the final version to commit should either point straight to the bad character, or give just one message per string. I'll change this to the latter before the final commit if nobody gives a good suggestion of how to make it do the former.
Sign in to reply to this message.
On Sun, 01 Jan 2012 01:40:36 -0800, <dak@gnu.org> wrote: > Our lexer has been written with the decision of using non-compressed > tables and without backing up. Off topic, but maybe interesting to you: I don't think that decision was ever implemented. I don't see any "%option full" or similar that would generate non-compressed tables. For at least a few years lexer.ll did generate backup states; I avoided them as part of the purge of warnings. Also, lexing takes a tiny fraction of LilyPond's execution time. If you ever want to use a method that requires backing up, I think you should. > I spent more than a day's worth on doing > utf-8 right in the grammar. That's pretty pointless. It also means > that we need to provide an error path for every item containing > non-UTF-8 characters in order to get a UTF-8 related error message > instead of something more mysterious. > Okay, then. Consider a comment in your case/switch statement that points to some reference on the various types of UTF-8 validators.
Sign in to reply to this message.
On 2012/01/01 10:05:42, Keith wrote: > On Sun, 01 Jan 2012 01:40:36 -0800, <mailto:dak@gnu.org> wrote: > > > Our lexer has been written with the decision of using non-compressed > > tables and without backing up. > > Off topic, but maybe interesting to you: I don't think that decision was ever > implemented. I do. > I don't see any "%option full" or similar that would generate > non-compressed tables. stepmake/stepmake/c++-rules.make: $(FLEX) -Cfe -p -p -o$@ $< Look up -Cfe > Consider a comment in your case/switch statement that points to some reference on the various types of UTF-8 validators. I don't understand.
Sign in to reply to this message.
On Sun, 01 Jan 2012 01:56:04 -0800, <dak@gnu.org> wrote: > http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode134 > lily/lexer.ll:134: A [a-zA-Z\200-\377] > On 2012/01/01 02:01:11, Keith wrote: >> non-ASCII characters are used internally as-read, tested below only to > warn the >> user if the input is invalid as UTF-8 > > You mean, you want a comment here? Nope; I was writing what I understood the code to be doing, so you could complain if the innocent reader's understanding is different from your intent. I confirmed that a Latin1-encoded file using French variable names continues to work, with warnings. > http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1083 > lily/lexer.ll:1083: LexerWarning (_ ("non-UTF-8 characters").c_str ()); > On 2012/01/01 02:01:11, Keith wrote: >> I suggest "unsupported file encoding, expected UTF-8" > > I think that error message would be appropriate once per run, either > because of making it a fatal error, or as a summary at the end. But > when flagging the actual place where there is a problematic character? My suggestion was to emphasize the encoding as the problem, rather than the character(s) as the problem --- even if the error is output at each character for which the encoding makes a difference.
Sign in to reply to this message.
On 2012/01/01 10:16:47, Keith wrote: > On Sun, 01 Jan 2012 01:56:04 -0800, <mailto:dak@gnu.org> wrote: > > > http://codereview.appspot.com/5505090/diff/1/lily/lexer.ll#newcode1083 > > lily/lexer.ll:1083: LexerWarning (_ ("non-UTF-8 characters").c_str ()); > > On 2012/01/01 02:01:11, Keith wrote: > >> I suggest "unsupported file encoding, expected UTF-8" > > > > I think that error message would be appropriate once per run, either > > because of making it a fatal error, or as a summary at the end. But > > when flagging the actual place where there is a problematic character? > > My suggestion was to emphasize the encoding as the problem, rather than the > character(s) as the problem --- even if the error is output at each character > for which the encoding makes a difference. I actually considered doing an initial scan through the whole file (it is read into memory after all) and refusing it point blank if there are non-UTF-8 parts in it. "non-UTF-8 input" would be a message not requiring mention of either file or characters, and I can change this accordingly. The problem I still see is that either version may just be gibberish to some people. I have no good idea for a better message.
Sign in to reply to this message.
On 2012/01/01 10:12:27, dak wrote: > > Consider a comment in your case/switch statement that points to > > some reference on the various types of UTF-8 validators. > > I don't understand. Sorry, I wasn't making much sense. As a reader I want to *recognize* what the but switch/case is doing rather than trying to figure it out. Maybe : // Test if these bytes are a UTF-8 encoding of a Unicode character, // and warn if not. Trap overly-long UTF-8 encodings, but we don't // need to worry about finer details like some filters do. because your test is almost but not quite equivalent to the regex in the back of the Flex manual.
Sign in to reply to this message.
On 2012/01/01 22:06:52, Keith wrote: > On 2012/01/01 10:12:27, dak wrote: > > Sorry, I wasn't making much sense. As a reader I want to *recognize* what the > but switch/case is doing rather than trying to figure it out. Maybe : > // Test if these bytes are a UTF-8 encoding of a Unicode character, > // and warn if not. Trap overly-long UTF-8 encodings, but we don't > // need to worry about finer details like some filters do. What are the finer details? UTF-16 surrogates? I was just too lazy to figure out the patterns for them. > because your test is almost but not quite equivalent to the regex in the back of > the Flex manual. I can't find a regex there. Got a link or section name? Is this a released version?
Sign in to reply to this message.
|