|
|
Created:
10 years, 5 months ago by Dan Beam Modified:
10 years, 5 months ago Base URL:
https://chromium.googlesource.com/external/libaddressinput.git@master Visibility:
Public. |
DescriptionMake libaddressinput compile on Windows with MSVS.
Without this I get these errors while trying to integrate into Chrome (and
hence, compiling on Windows):
third_party\libaddressinput\src\cpp\src\region_data_constants.cc(470) :
warning C4566: character represented by universal-character-name u3012 cannot
be represented in the current code page (1252)
R=roubert@google.com
BUG=3
Patch Set 1 #Patch Set 2 : \x #Patch Set 3 : . #
Total comments: 2
Patch Set 4 : \x #Patch Set 5 : stupid vim #
Total comments: 2
Patch Set 6 : . #MessagesTotal messages: 19
i haven't been able to try this (windows build locked down at the moment), but maybe rouslan@ can when he gets in tomorrow (or i'll actually be able to build)
Sign in to reply to this message.
On 2013/11/20 07:46:54, dbeam wrote: > warning C4566: character represented by universal-character-name u3012 cannot > be represented in the current code page (1252) Using the L prefix for these string literals is not the correct way to fix this problem. The L prefix for a string literal denotes that the string literal is of type wchar_t. That is not what these string literals should be, they should all be of type char as you can see from the type declaration std::string (and they should be text in the UTF-8 encoding). C++11 introduces the u8 prefix to explicitly specify that a string literal should be of type char in the UTF-8 encoding, but as this code is supposed to be used also with older compilers, you can't use that method. Instead, I recommend that you investigate how to set this "current code page" for your compiler. It should not be using Windows code page 1252 (really, no-one should), it should be using the UTF-8 encoding. Then update libaddressinput.gyp to set the appropriate compiler flags under the appropriate conditions, to solve the build error. (The Windows code page number for the UTF-8 encoding is 65001, in case you need to specify the encoding in the form of code page number.) If you fail doing this, a possible work-around is to manually specify the UTF-8 encoding of \u00C5 and \u3012 as \xC3\x85 and \xE3\x80\x92 respectively, but that would best be avoided. Absolutely no-one can read string literals like \xC3\x85 or \xE3\x80\x92 and say what they're supposed to mean, so such code is bound to be broken by mistake sooner or later, and is utterly unmaintainable.
Sign in to reply to this message.
On 2013/11/20 08:08:19, roubert wrote: > On 2013/11/20 07:46:54, dbeam wrote: > > > warning C4566: character represented by universal-character-name u3012 cannot > > be represented in the current code page (1252) > > Using the L prefix for these string literals is not the correct way to fix this > problem. > > The L prefix for a string literal denotes that the string literal is of type > wchar_t. That is not what these string literals should be, they should all be of > type char as you can see from the type declaration std::string (and they should > be text in the UTF-8 encoding). > > C++11 introduces the u8 prefix to explicitly specify that a string literal > should be of type char in the UTF-8 encoding, but as this code is supposed to be > used also with older compilers, you can't use that method. Instead, I recommend > that you investigate how to set this "current code page" for your compiler. It > should not be using Windows code page 1252 (really, no-one should), it should be > using the UTF-8 encoding. Then update libaddressinput.gyp to set the appropriate > compiler flags under the appropriate conditions, to solve the build error. ^ i'll try this first. > > (The Windows code page number for the UTF-8 encoding is 65001, in case you need > to specify the encoding in the form of code page number.) > > If you fail doing this, a possible work-around is to manually specify the UTF-8 > encoding of \u00C5 and \u3012 as \xC3\x85 and \xE3\x80\x92 respectively, but > that would best be avoided. Absolutely no-one can read string literals like > \xC3\x85 or \xE3\x80\x92 and say what they're supposed to mean, so such code is > bound to be broken by mistake sooner or later, and is utterly unmaintainable. ^ yeah, that's not so pretty.
Sign in to reply to this message.
On 2013/11/20 08:33:44, dbeam wrote: > On 2013/11/20 08:08:19, roubert wrote: > > On 2013/11/20 07:46:54, dbeam wrote: > > > > > warning C4566: character represented by universal-character-name u3012 > cannot > > > be represented in the current code page (1252) > > > > Using the L prefix for these string literals is not the correct way to fix > this > > problem. > > > > The L prefix for a string literal denotes that the string literal is of type > > wchar_t. That is not what these string literals should be, they should all be > of > > type char as you can see from the type declaration std::string (and they > should > > be text in the UTF-8 encoding). > > > > C++11 introduces the u8 prefix to explicitly specify that a string literal > > should be of type char in the UTF-8 encoding, but as this code is supposed to > be > > used also with older compilers, you can't use that method. Instead, I > recommend > > that you investigate how to set this "current code page" for your compiler. It > > should not be using Windows code page 1252 (really, no-one should), it should > be > > using the UTF-8 encoding. Then update libaddressinput.gyp to set the > appropriate > > compiler flags under the appropriate conditions, to solve the build error. > > ^ i'll try this first. > > > > > (The Windows code page number for the UTF-8 encoding is 65001, in case you > need > > to specify the encoding in the form of code page number.) > > > > If you fail doing this, a possible work-around is to manually specify the > UTF-8 > > encoding of \u00C5 and \u3012 as \xC3\x85 and \xE3\x80\x92 respectively, but > > that would best be avoided. Absolutely no-one can read string literals like > > \xC3\x85 or \xE3\x80\x92 and say what they're supposed to mean, so such code > is > > bound to be broken by mistake sooner or later, and is utterly unmaintainable. > > ^ yeah, that's not so pretty. I suggest that you try the literal string from http://i18napis.appspot.com/address/data/AX instead: "%O%n%N%n%A%nAX-%Z %C%nÅLAND"
Sign in to reply to this message.
For Japan, the format from http://i18napis.appspot.com/address/data/JP is "〒%Z%n%S%C%n%A%n%O%n%N".
Sign in to reply to this message.
On 2013/11/20 19:12:10, Rouslan Solomakhin wrote: > I suggest that you try the literal string from > http://i18napis.appspot.com/address/data/AX instead: "%O%n%N%n%A%nAX-%Z > %C%nÅLAND" That is an excellent suggestion, but it could fail silently if the compiler chooses to compile "Å" to anything else but \xC3\x85. (A compiler that uses Windows code page 1252 internally would encode it as \xC5 instead of \xC3\x85.) So adding a unit test case to verify that the compiler got the encoding right would be a very good idea.
Sign in to reply to this message.
You're right. Visual Studio compiler silently fails to parse the literals. We can specify '#pramga execution_character_set("utf-8")', but this will work only with characters with at most 2 code points. (〒 is 3 code points.) I propose that we copy this function from Chrome: "std::string WideToUTF(const std::wstring& input)" and use L"" string literals in data for "AX" and "JP" region codes. Here are the original files that we would need to copy: - WideToUTF8(): https://code.google.com/p/chromium/codesearch#chromium/src/base/strings/utf_s... https://code.google.com/p/chromium/codesearch#chromium/src/base/strings/utf_s... - ReadUnicodeCharacter(), WriteUnicodeCharacter(), IsValidCodePoint(): https://code.google.com/p/chromium/codesearch#chromium/src/base/strings/utf_s... https://code.google.com/p/chromium/codesearch#chromium/src/base/strings/utf_s... - UTF utilities used in the conversion: https://code.google.com/p/chromium/codesearch#chromium/src/base/third_party/i... https://code.google.com/p/chromium/codesearch#chromium/src/base/third_party/i...
Sign in to reply to this message.
On 2013/11/21 20:08:24, Rouslan Solomakhin wrote: > (〒 is 3 code points.) Just to not make this more confusing than it already is: 〒 is one codepoint (U+3012) but three bytes (E3 80 92) in the UTF-8 encoding. > I propose that > we copy this function from Chrome: "std::string WideToUTF(const std::wstring& > input)" and use L"" string literals in data for "AX" and "JP" region codes. I vote against that. That's far too much unnecessary overhead just to work around one buggy compiler. If the compiler truly is unable to handle this (I have a hard time to believe this in the year 2013), then using hex-escaped UTF-8 in the source code (which is going to be autogenerated anyway) is a better workaround.
Sign in to reply to this message.
https://codereview.appspot.com/29520043/diff/40001/cpp/src/region_data_consta... File cpp/src/region_data_constants.cc (right): https://codereview.appspot.com/29520043/diff/40001/cpp/src/region_data_consta... cpp/src/region_data_constants.cc:94: "\"fmt\":\"%O%n%N%n%A%nAX-%Z %C%n\xC3\x85LAND\"" Please add an end of line comment explaining what \x bytes are.
Sign in to reply to this message.
On 2013/11/21 21:35:23, roubert wrote: > On 2013/11/21 20:08:24, Rouslan Solomakhin wrote: > > > (〒 is 3 code points.) > > Just to not make this more confusing than it already is: 〒 is one codepoint > (U+3012) but three bytes (E3 80 92) in the UTF-8 encoding. > > > I propose that > > we copy this function from Chrome: "std::string WideToUTF(const std::wstring& > > input)" and use L"" string literals in data for "AX" and "JP" region codes. > > I vote against that. That's far too much unnecessary overhead just to work > around one buggy compiler. If the compiler truly is unable to handle this (I > have a hard time to believe this in the year 2013), then using hex-escaped UTF-8 > in the source code (which is going to be autogenerated anyway) is a better > workaround. Done.
Sign in to reply to this message.
https://codereview.appspot.com/29520043/diff/40001/cpp/src/region_data_consta... File cpp/src/region_data_constants.cc (right): https://codereview.appspot.com/29520043/diff/40001/cpp/src/region_data_consta... cpp/src/region_data_constants.cc:94: "\"fmt\":\"%O%n%N%n%A%nAX-%Z %C%n\xC3\x85LAND\"" On 2013/11/21 21:56:14, Rouslan Solomakhin wrote: > Please add an end of line comment explaining what \x bytes are. Done.
Sign in to reply to this message.
https://codereview.appspot.com/29520043/diff/100001/cpp/src/region_data_const... File cpp/src/region_data_constants.cc (right): https://codereview.appspot.com/29520043/diff/100001/cpp/src/region_data_const... cpp/src/region_data_constants.cc:94: "\"fmt\":\"%O%n%N%n%A%nAX-%Z %C%n\xC3\x85LAND\"" // \xC3\x85 is Å. Two spaces before // please.
Sign in to reply to this message.
https://codereview.appspot.com/29520043/diff/100001/cpp/src/region_data_const... File cpp/src/region_data_constants.cc (right): https://codereview.appspot.com/29520043/diff/100001/cpp/src/region_data_const... cpp/src/region_data_constants.cc:94: "\"fmt\":\"%O%n%N%n%A%nAX-%Z %C%n\xC3\x85LAND\"" // \xC3\x85 is Å. On 2013/11/21 22:16:17, Rouslan Solomakhin wrote: > Two spaces before // please. Done.
Sign in to reply to this message.
Fredrik: I've verified that this compiles in Windows. I can't think of a unit test to verify that it's working. Does anything come to mind for you?
Sign in to reply to this message.
On 2013/11/21 22:20:41, Rouslan Solomakhin wrote: > Fredrik: I've verified that this compiles in Windows. I can't think of a unit > test to verify that it's working. Does anything come to mind for you? Nah, if the UTF-8 encoding is hard-coded, then I can't think of any useful test.
Sign in to reply to this message.
LGTM
Sign in to reply to this message.
Sign in to reply to this message.
On 2013/11/22 18:49:27, Rouslan Solomakhin wrote: > Committed: https://code.google.com/p/libaddressinput/source/detail?r=166 Thanks!
Sign in to reply to this message.
On 2013/11/22 20:59:07, dbeam wrote: > On 2013/11/22 18:49:27, Rouslan Solomakhin wrote: > > Committed: https://code.google.com/p/libaddressinput/source/detail?r=166 > > Thanks! Dan: Can you please close this CL? Only you have the power.
Sign in to reply to this message.
|