|
|
Created:
16 years, 6 months ago by GvR Modified:
15 years, 6 months ago Base URL:
http://svn.python.org/view/*checkout*/python/branches/py3k/ Visibility:
Public. |
Descriptionhttp://bugs.python.org/issue3300
Patch Set 1 #Patch Set 2 : parse.py.patch8 #Patch Set 3 : patch 9 #Patch Set 4 : patch 10 #
Downloaded from: http://bugs.python.org/file11111/parse.py.patch10
MessagesTotal messages: 17
Hi Matt, Here's a code review of your patch. I'm leaning more and more towards wanting this for 3.0, but I have some API design issues and also some style nits. I'm cross-linking this with the Python tracker issue, through the subject. http://codereview.appspot.com/2827/diff/1/2 File Doc/library/urllib.parse.rst (right): http://codereview.appspot.com/2827/diff/1/2#newcode198 Line 198: replaced by a placeholder character. I don't think that's a good default. I'd rather see it default to strict -- that's what encoding translates to everywhere else. I believe that lenient error handling by default can cause subtle security problems too, by hiding problem characters from validation code. http://codereview.appspot.com/2827/diff/1/2#newcode215 Line 215: An alias for :func:`quote`, intended for use with a :class:`bytes` object I'd rather see this as a wrapper that raises TypeError if the argument isn't a bytes or bytearray instance. Otherwise it's needless redundancy. http://codereview.appspot.com/2827/diff/1/2#newcode223 Line 223: Replace ``%xx`` escapes by their single-character equivalent. Should add what the argument type is -- I vote for str or bytes/bytearray. http://codereview.appspot.com/2827/diff/1/2#newcode242 Line 242: .. function:: unquote_to_bytes(string) Again, add what the argument type is. http://codereview.appspot.com/2827/diff/1/4 File Lib/email/utils.py (right): http://codereview.appspot.com/2827/diff/1/4#newcode224 Line 224: except: An unqualified except clause is unacceptable here. Why do you need this anyway? http://codereview.appspot.com/2827/diff/1/5 File Lib/test/test_http_cookiejar.py (right): http://codereview.appspot.com/2827/diff/1/5#newcode1450 Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5", I'm guessing this test broke otherwise? Given that this references an RFC, is it correct to just fix it this way? http://codereview.appspot.com/2827/diff/1/3 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/1/3#newcode10 Line 10: "urlsplit", "urlunsplit"] Please add all the quote/unquote versions here too. (They were there in 2.5, but somehow got dropped from 3.0. http://codereview.appspot.com/2827/diff/1/3#newcode265 Line 265: # Maps lowercase and uppercase variants (but not mixed case). That sounds like a disaster. Why would %aa and %AA be correct but not %aA and %Aa? (Even though the old code had the same problem.) http://codereview.appspot.com/2827/diff/1/3#newcode283 Line 283: def unquote(s, encoding = "utf-8", errors = "replace"): Please no spaces around the '=' when used for an argument default (or for a keyword arg). Also see my comment about defaulting to 'replace' in the doc file. Finally -- let's be consistent about quotes. It seems most of this file uses single quotes, so lets stick to that (except docstrings always use double quotes). And more: what should a None value for encoding or errors mean? IMO it should mean "use the default". http://codereview.appspot.com/2827/diff/1/3#newcode382 Line 382: safe = safe.encode('ascii', 'ignore') Using errors='ignore' seems like a mistake -- it will hide errors. I also wonder why safe should be limited to ASCII though. http://codereview.appspot.com/2827/diff/1/3#newcode399 Line 399: if ' ' in s: This test means that it won't work if the input is bytes. E.g. urllib.parse.quote_plus(b"abc def") raises a TypeError.
Sign in to reply to this message.
Hi Guido, Thanks very much for this very detailed review. I've replied to the comments. I will make the changes as described below and send a new patch to the tracker. http://codereview.appspot.com/2827/diff/1/2 File Doc/library/urllib.parse.rst (right): http://codereview.appspot.com/2827/diff/1/2#newcode198 Line 198: replaced by a placeholder character. Okay. I did errors='replace' because URLs often come from users (eg. typed into a web browser) so it's most likely we're just going to want to ignore errors. I have no strong opinion on this though. http://codereview.appspot.com/2827/diff/1/2#newcode215 Line 215: An alias for :func:`quote`, intended for use with a :class:`bytes` object Yup, OK. I posted exactly that to the mailing list awhile back. I'll put that in the next patch. http://codereview.appspot.com/2827/diff/1/2#newcode223 Line 223: Replace ``%xx`` escapes by their single-character equivalent. Currently I think unquote only accepts a str. Should it also accept a bytes/bytearray? http://codereview.appspot.com/2827/diff/1/4 File Lib/email/utils.py (right): http://codereview.appspot.com/2827/diff/1/4#newcode224 Line 224: except: I *think* it's to catch the case where charset=None (the default). It's an awful way to do it, I agree. There are no test cases for this function, so it's hard to tell either way. It's difficult to tell how to handle this. I think the string should be encoded using charset (this would be true with or without this patch - as evidenced by line 301 which decodes with charset). But I'm not sure what to do with charset=None. Currently it defaults to UTF-8. For the next patch, I will just change it to use an if statement rather than a try/except, but it will have this same functionality. http://codereview.appspot.com/2827/diff/1/5 File Lib/test/test_http_cookiejar.py (right): http://codereview.appspot.com/2827/diff/1/5#newcode1450 Line 1450: "%3c%3c%0Anew%C3%A5/%C3%A5", As far as I can tell, it's any arbitrary URL allowed here. The RFC says nothing about encoding. Therefore, I assume an implementation can decide to encode a string one way or another, as long as it results in a valid URL which is used as the part of the cookie identifier. http://codereview.appspot.com/2827/diff/1/3 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/1/3#newcode10 Line 10: "urlsplit", "urlunsplit"] OK. http://codereview.appspot.com/2827/diff/1/3#newcode265 Line 265: # Maps lowercase and uppercase variants (but not mixed case). Yes, I know!! I was going to suggest this as a separate fix, since it's a separate bug. But since you bring it up, I'll correct it in this patch. http://codereview.appspot.com/2827/diff/1/3#newcode283 Line 283: def unquote(s, encoding = "utf-8", errors = "replace"): All noted. http://codereview.appspot.com/2827/diff/1/3#newcode382 Line 382: safe = safe.encode('ascii', 'ignore') > Using errors='ignore' seems like a mistake -- it will hide errors. All this means is that non-ASCII characters get dropped from the "safe" set. I can see the issue with this though - however the change below means this will probably go away. > I also wonder why safe should be limited to ASCII though. This was a bit of a problem. I thought it was a conceptual/fundamendal issue, but now that I think about it more, it's just a technical problem. The first thing quote does is call s.encode(). Therefore, from that point on, we can only deal in bytes, and can't distinguish one character from another. I think this needs to change; that will mean rather than encoding right away, encoding is done in Quoter.__call__ on a per-character basis. This could mess with the ability to accept either bytes or str. I'll give it a go. http://codereview.appspot.com/2827/diff/1/3#newcode399 Line 399: if ' ' in s: Good point.
Sign in to reply to this message.
http://codereview.appspot.com/2827/diff/1/4 File Lib/email/utils.py (right): http://codereview.appspot.com/2827/diff/1/4#newcode224 Line 224: except: On 2008/08/07 12:01:31, mgiuca wrote: > But I'm not sure what to do with > charset=None. Currently it defaults to UTF-8. From a brief look at RFC 2231, charset=None should imply ascii because the return string is not decorated with any charset, and email headers by default contain only ascii values (RFC 2231 is designed to circumvent that). So you probably can remove the try/except and replace it with: s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii') http://codereview.appspot.com/2827/diff/1/8 File Lib/test/test_urllib.py (right): http://codereview.appspot.com/2827/diff/1/8#newcode435 Line 435: result = urllib.parse.quote(given, encoding="latin-1") Wouldn't it be better to throw a TypeError when an encoding is given for a bytes string?
Sign in to reply to this message.
parse.py.patch8
Sign in to reply to this message.
http://codereview.appspot.com/2827/diff/17/28 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/17/28#newcode275 Line 275: except KeyError: If this is supposed to catch non-hex characters, it should be ValueError because it's what bytes.fromhex() raises, not KeyError. http://codereview.appspot.com/2827/diff/17/28#newcode335 Line 335: """safe: May be either a string or bytes object.""" This is contradicted by the fact that you expect iteration on safe to yield numbers, not strings.
Sign in to reply to this message.
New code review. Also please address Antoine's comments. http://codereview.appspot.com/2827/diff/1/2 File Doc/library/urllib.parse.rst (right): http://codereview.appspot.com/2827/diff/1/2#newcode198 Line 198: replaced by a placeholder character. On 2008/08/07 12:01:31, mgiuca wrote: > Okay. I did errors='replace' because URLs often come from users (eg. typed into > a web browser) so it's most likely we're just going to want to ignore errors. > > I have no strong opinion on this though. I do. :-) Like everywhere else, we should default to strict, and let the application change the setting if they have a use for a more lenient setting. Fail-fast is Pythonic behavior. (It's bad enough that unrecognized % escapes don't raise an exception, but that's always been like that. I doubt if it's useful though.) http://codereview.appspot.com/2827/diff/1/2#newcode223 Line 223: Replace ``%xx`` escapes by their single-character equivalent. On 2008/08/07 12:01:31, mgiuca wrote: > Currently I think unquote only accepts a str. Should it also accept a > bytes/bytearray? Since the encoded form is ASCII, and there is uniform confusion about the input and output types possible anyway, I think so. http://codereview.appspot.com/2827/diff/1/8 File Lib/test/test_urllib.py (right): http://codereview.appspot.com/2827/diff/1/8#newcode435 Line 435: result = urllib.parse.quote(given, encoding="latin-1") On 2008/08/07 12:49:11, Antoine Pitrou wrote: > Wouldn't it be better to throw a TypeError when an encoding is given for a bytes > string? > (Please reply.) http://codereview.appspot.com/2827/diff/17/29 File Lib/email/utils.py (right): http://codereview.appspot.com/2827/diff/17/29#newcode222 Line 222: s = urllib.parse.quote(s, safe='', encoding=charset) Antoine's suggestion was to use encoding=(charset or 'ascii') http://codereview.appspot.com/2827/diff/17/28 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/17/28#newcode275 Line 275: except KeyError: On 2008/08/07 17:29:40, Antoine Pitrou wrote: > If this is supposed to catch non-hex characters, it should be ValueError because > it's what bytes.fromhex() raises, not KeyError. > Right, the KeyError was from the previous incarnation which used a dict. Please fix.
Sign in to reply to this message.
Oh dear, it seems I never published my comments. I made all of these before I submitted patch 8. Very sorry! http://codereview.appspot.com/2827/diff/1/2 File Doc/library/urllib.parse.rst (right): http://codereview.appspot.com/2827/diff/1/2#newcode198 Line 198: replaced by a placeholder character. On 2008/08/07 12:01:31, mgiuca wrote: > Okay. I did errors='replace' because URLs often come from users (eg. typed into > a web browser) so it's most likely we're just going to want to ignore errors. > > I have no strong opinion on this though. Done. http://codereview.appspot.com/2827/diff/1/2#newcode215 Line 215: An alias for :func:`quote`, intended for use with a :class:`bytes` object On 2008/08/07 12:01:31, mgiuca wrote: > Yup, OK. I posted exactly that to the mailing list awhile back. I'll put that in > the next patch. Done. http://codereview.appspot.com/2827/diff/1/2#newcode242 Line 242: .. function:: unquote_to_bytes(string) On 2008/08/06 21:39:34, GvR wrote: > Again, add what the argument type is. Done. http://codereview.appspot.com/2827/diff/1/4 File Lib/email/utils.py (right): http://codereview.appspot.com/2827/diff/1/4#newcode224 Line 224: except: If it's going to be ASCII, might as well be UTF-8, I think. In that case, I've removed the try/except block and just pass charset (quote can now handle charset=None). If you really think it should be ASCII-only, then your code-wise suggestion is good. http://codereview.appspot.com/2827/diff/1/8 File Lib/test/test_urllib.py (right): http://codereview.appspot.com/2827/diff/1/8#newcode435 Line 435: result = urllib.parse.quote(given, encoding="latin-1") I think you're right, it probably should throw an error. Not fixed in upcoming patch (8), since I'm expecting changes to quote anyway. http://codereview.appspot.com/2827/diff/1/3 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/1/3#newcode10 Line 10: "urlsplit", "urlunsplit"] On 2008/08/07 12:01:31, mgiuca wrote: > OK. Done. http://codereview.appspot.com/2827/diff/1/3#newcode265 Line 265: # Maps lowercase and uppercase variants (but not mixed case). On 2008/08/07 12:01:31, mgiuca wrote: > Yes, I know!! I was going to suggest this as a separate fix, since it's a > separate bug. But since you bring it up, I'll correct it in this patch. Done. http://codereview.appspot.com/2827/diff/1/3#newcode283 Line 283: def unquote(s, encoding = "utf-8", errors = "replace"): On 2008/08/07 12:01:31, mgiuca wrote: > All noted. Done. http://codereview.appspot.com/2827/diff/1/3#newcode382 Line 382: safe = safe.encode('ascii', 'ignore') Oh, I remembered the reasoning behind it (as explained in test_urllib.py) - if we allow non-ASCII characters to be escaped, then we allow quote to generate invalid URIs (URIs are only allowed to have ASCII characters). It's one thing for unquote to accept such URIs, but I think we shouldn't be producing them. Albeit, it only produces an invalid URI if you explicitly request it. So I'm happy to make the change to allow any character to be safe, but I'll let it go to discussion first. http://codereview.appspot.com/2827/diff/1/3#newcode399 Line 399: if ' ' in s: On 2008/08/07 12:01:31, mgiuca wrote: > Good point. Done. http://codereview.appspot.com/2827/diff/17/29 File Lib/email/utils.py (right): http://codereview.appspot.com/2827/diff/17/29#newcode222 Line 222: s = urllib.parse.quote(s, safe='', encoding=charset) encoding=charset falls back to UTF-8, which covers ASCII. I'm happy to explicitly fall back to ASCII though.
Sign in to reply to this message.
http://codereview.appspot.com/2827/diff/17/28 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/17/28#newcode275 Line 275: except KeyError: Good catch. No test case caught it either! Will amend that.
Sign in to reply to this message.
http://codereview.appspot.com/2827/diff/1/2 File Doc/library/urllib.parse.rst (right): http://codereview.appspot.com/2827/diff/1/2#newcode223 Line 223: Replace ``%xx`` escapes by their single-character equivalent. I disagree further now. If you read my reply here: http://bugs.python.org/msg70958 While it's hazy what the output of unquote/input of quote should be, it's VERY clear that the input of unquote/output of quote (a URI) should be a str. Allowing a bytes as input is just perpetuating a misunderstanding that a URI is a byte sequence. Also then you have to make another abitrary decision about how to encode the bytes to a str (ascii? latin-1? utf-8? use the encoding parameter?)
Sign in to reply to this message.
http://codereview.appspot.com/2827/diff/17/28 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/17/28#newcode275 Line 275: except KeyError: On 2008/08/10 05:53:52, mgiuca wrote: > Good catch. No test case caught it either! Will amend that. Done. http://codereview.appspot.com/2827/diff/17/28#newcode335 Line 335: """safe: May be either a string or bytes object.""" On 2008/08/07 17:29:40, Antoine Pitrou wrote: > This is contradicted by the fact that you expect iteration on safe to yield > numbers, not strings. > Done.
Sign in to reply to this message.
patch 9
Sign in to reply to this message.
Here's finally my code review. Most stuff was already discussed in the tracker however. http://codereview.appspot.com/2827/diff/82/83 File Doc/library/urllib.parse.rst (right): http://codereview.appspot.com/2827/diff/82/83#newcode185 Line 185: .. function:: quote(string[, safe[, encoding[, errors]]]) Note that the argument is called 's' in the code. It's preferable to keep the names synced, since they *can* be used as keyword args. I also wonder if we shouldn't use the new '*' feature to force encoding and errors to be keyword args in the call syntax. http://codereview.appspot.com/2827/diff/82/83#newcode219 Line 219: Example: ``quote_from_bytes(b'/El Ni\xc3\xb1o/')`` yields I wonder if it wouldn't be better to use a Latin-1 example here, making it absolutely clear that there is no UTF-8 requirement. http://codereview.appspot.com/2827/diff/82/83#newcode220 Line 220: ``'/El%20Ni%C3%B1o/'``. Perhaps it's useful to explain that quote(s, enc, err) == quote_from_bytes(s.encode(enc, err)), with some handwaving about default values? (I think this would benefit the implementation too...) http://codereview.appspot.com/2827/diff/82/85 File Lib/email/utils.py (right): http://codereview.appspot.com/2827/diff/82/85#newcode277 Line 277: s = urllib.parse.unquote(s, encoding="latin-1") Can you ask Barry to comment on this? http://codereview.appspot.com/2827/diff/82/84 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/82/84#newcode265 Line 265: def unquote_to_bytes(s): Based on Bill's latest email quoting Erik van der Poel, I think we'll also need to have a unquote_bytes_to_bytes() API, because IE apparently can send unquoted non-ASCII characters, and assuming these represent UTF8 is not correct. Servers wishing to support this must extract the URL as bytes and then deal with them as bytes-to-bytes. http://codereview.appspot.com/2827/diff/82/84#newcode269 Line 269: res = s.split('%') If you make this res = s.encode('utf-8').split() then you won't have to encode each piece separately. It also makes is easier to move the encoding into a separate function. http://codereview.appspot.com/2827/diff/82/84#newcode275 Line 275: res[i] = bytes.fromhex(item[:2]) + item[2:].encode('utf-8') If you are following my advice above, you have a minor problem here. Since you can't call bytes.fromhex() with bytes input, you would have to use bytes(int(item[:2], 16)) -- fortunately int() *does* support bytes input. http://codereview.appspot.com/2827/diff/82/84#newcode292 Line 292: # pct_sequence: contiguous sequence of percent-encoded bytes, decoded I wonder if it wouldn't be better to replace the implementation by something like return unquote_to_bytes(s).encode(encoding, errors) http://codereview.appspot.com/2827/diff/82/84#newcode335 Line 335: class Quoter: As discussed in the tracker, please replace this with something based on defaultdict. http://codereview.appspot.com/2827/diff/82/84#newcode381 Line 381: errors='strict' (unsupported characters raise a UnicodeEncodeError). strict -> replace. http://codereview.appspot.com/2827/diff/82/84#newcode388 Line 388: s = s.encode(encoding or 'utf-8', errors or 'strict') strict -> replace. Also, I'm not super comfortable with interpreting '' as the default. IMO these should just error out, like they do in other places, as the invalid encoding / errors value they are. http://codereview.appspot.com/2827/diff/82/84#newcode390 Line 390: if encoding: if encoding is not None: http://codereview.appspot.com/2827/diff/82/84#newcode392 Line 392: if errors: if errors is not None: http://codereview.appspot.com/2827/diff/82/84#newcode394 Line 394: try: I'd replace the rest with a call to quote_from_bytes(). http://codereview.appspot.com/2827/diff/82/84#newcode413 Line 413: def quote_from_bytes(s, safe='/'): Needs a docstring. http://codereview.appspot.com/2827/diff/82/84#newcode426 Line 426: return ''.join(res) I'd collapse the latter two lines into return ''.join(map(quoter, s)) especially since map() returns an iterator, not a list!
Sign in to reply to this message.
BTW, I also had a look at Bill's patch. The one good idea I hope you can steal from him is to have some "work-horse" functions that implement the basic algorithms using bytes (in most cases) and without defaults, and then to simply define quote, quote_plus, unquote, unquote_plus in terms of those combined with an encode() or decode() call. This looks pretty clean.
Sign in to reply to this message.
Hi guys, thanks for these detailed reviews. Here are my comments to accompany patch 10. (Sorry Guido, I just submitted patch 10 to the tracker again; I think me creating a new Rietveld issue at this stage would just make things more confusing). Matt http://codereview.appspot.com/2827/diff/82/83 File Doc/library/urllib.parse.rst (right): http://codereview.appspot.com/2827/diff/82/83#newcode185 Line 185: .. function:: quote(string[, safe[, encoding[, errors]]]) OK I will change the argument to "string" (in the code). We could force them to be kwargs, but note that the string.encode method forces them as positional args (you can't supply keywords), so this would be directly the opposite of str.encode (and I think you are wanting them to be the same interface). http://codereview.appspot.com/2827/diff/82/83#newcode214 Line 214: .. function:: quote_from_bytes(bytes[, safe]) Note that here (in the code), I didn't change the arg name to "bytes" - just "b" (to avoid conflict with global bytes). Could hack around it, but it seemed more trouble than it was worth. http://codereview.appspot.com/2827/diff/82/83#newcode219 Line 219: Example: ``quote_from_bytes(b'/El Ni\xc3\xb1o/')`` yields Do you mean just for the quote_from_bytes example? (Leaving the quote example as utf-8 to show the default in action)? Perhaps this example should be some random byte string instead, to show that it isn't in fact characters at all. (Done). http://codereview.appspot.com/2827/diff/82/83#newcode220 Line 220: ``'/El%20Ni%C3%B1o/'``. Yes. (Done). http://codereview.appspot.com/2827/diff/82/85 File Lib/email/utils.py (right): http://codereview.appspot.com/2827/diff/82/85#newcode277 Line 277: s = urllib.parse.unquote(s, encoding="latin-1") Well I wasn't *entirely* sure what was happening here, but I think this string eventually gets used by collapse_rfc2231_value as raw bytes. By adding the encoding="latin-1", this should behave *precisely* as it did before the 3300 patch. So I shouldn't be breaking anything. I sent Barry an email asking for clarification. http://codereview.appspot.com/2827/diff/82/84 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/82/84#newcode265 Line 265: def unquote_to_bytes(s): Hm. OK we can support this by simply allowing unquote_to_bytes to accept a bytes OR a str, as we were originally talking about doing for unquote. I still don't think it makes sense for unquote to accept a bytes (since you then need to DEcode those bytes somehow, since it's bytes->str), but it does make more sense for unquote_to_bytes to be bytes->bytes, and it's a very easy change. Note that this is a compelling reason to use unquote, since it preserves unquoted Unicode characters without this issue ever coming up. (Done). http://codereview.appspot.com/2827/diff/82/84#newcode269 Line 269: res = s.split('%') On 2008/08/13 17:50:14, GvR wrote: > If you make this > > res = s.encode('utf-8').split() > > then you won't have to encode each piece separately. It also makes is easier to > move the encoding into a separate function. Done. http://codereview.appspot.com/2827/diff/82/84#newcode275 Line 275: res[i] = bytes.fromhex(item[:2]) + item[2:].encode('utf-8') Has to be: bytes([int(item[:2], 16)]) Done. http://codereview.appspot.com/2827/diff/82/84#newcode280 Line 280: def unquote(s, encoding='utf-8', errors='strict'): errors='replace' now, on both unquote and unquote_plus, as discussed on tracker. ("bogus data is better than an exception") http://codereview.appspot.com/2827/diff/82/84#newcode292 Line 292: # pct_sequence: contiguous sequence of percent-encoded bytes, decoded Ah yes, I originally had something like that, but this makes it impossible to get BOTH a) allow arbitrary unescaped non-ASCII characters to be unchanged, AND b) allow non-UTF-8 encodings. (The reason being that if you unquote_to_bytes, it encodes the unescaped characters in UTF-8, and un-percent-encodes the percent-encoded characters in "encoding" - so there is a mismatch unless encoding="utf-8". If you encode unescaped characters with "encoding" instead, you can't represent characters outside of encoding's repertoire). I originally had what you suggested, but realised this and refactored to what is admittedly more complicated code. Is there a simpler way to achieve both the above goals? (This implementation is very careful to never encode the unescaped characters, so there's no chance of them getting mangled). http://codereview.appspot.com/2827/diff/82/84#newcode335 Line 335: class Quoter: On 2008/08/13 17:50:14, GvR wrote: > As discussed in the tracker, please replace this with something based on > defaultdict. Done. (See tracker for speed tests). http://codereview.appspot.com/2827/diff/82/84#newcode381 Line 381: errors='strict' (unsupported characters raise a UnicodeEncodeError). > strict -> replace. You mean to put this on unquote, right? Not quote. (As discussed on the tracker). That's done. Leaving this as strict. http://codereview.appspot.com/2827/diff/82/84#newcode388 Line 388: s = s.encode(encoding or 'utf-8', errors or 'strict') On 2008/08/13 17:50:14, GvR wrote: > strict -> replace. Leaving strict. > Also, I'm not super comfortable with interpreting '' as the default. IMO these > should just error out, like they do in other places, as the invalid encoding / > errors value they are. Done (only None causes default behaviour). http://codereview.appspot.com/2827/diff/82/84#newcode390 Line 390: if encoding: On 2008/08/13 17:50:14, GvR wrote: > if encoding is not None: Done. http://codereview.appspot.com/2827/diff/82/84#newcode392 Line 392: if errors: On 2008/08/13 17:50:14, GvR wrote: > if errors is not None: Done. http://codereview.appspot.com/2827/diff/82/84#newcode394 Line 394: try: On 2008/08/13 17:50:14, GvR wrote: > I'd replace the rest with a call to quote_from_bytes(). Done. http://codereview.appspot.com/2827/diff/82/84#newcode413 Line 413: def quote_from_bytes(s, safe='/'): On 2008/08/13 17:50:14, GvR wrote: > Needs a docstring. Done. http://codereview.appspot.com/2827/diff/82/84#newcode426 Line 426: return ''.join(res) On 2008/08/13 17:50:14, GvR wrote: > I'd collapse the latter two lines into > > return ''.join(map(quoter, s)) > > especially since map() returns an iterator, not a list! Done.
Sign in to reply to this message.
patch 10
Sign in to reply to this message.
Hey Matt, I have a bunch of tiny nits, but there's no need for you to address these; I'll fix these as I check it in. Have you submitted a contributor form to the PSF yet? If not, please do so, using these instructions: http://python.org/psf/contrib/ http://codereview.appspot.com/2827/diff/82/83 File Doc/library/urllib.parse.rst (right): http://codereview.appspot.com/2827/diff/82/83#newcode185 Line 185: .. function:: quote(string[, safe[, encoding[, errors]]]) On 2008/08/14 15:29:00, mgiuca wrote: > OK I will change the argument to "string" (in the code). > > We could force them to be kwargs, but note that the string.encode method forces > them as positional args (you can't supply keywords), so this would be directly > the opposite of str.encode (and I think you are wanting them to be the same > interface). OK, leave them as positional. http://codereview.appspot.com/2827/diff/82/83#newcode219 Line 219: Example: ``quote_from_bytes(b'/El Ni\xc3\xb1o/')`` yields On 2008/08/14 15:29:00, mgiuca wrote: > Do you mean just for the quote_from_bytes example? (Leaving the quote example as > utf-8 to show the default in action)? Perhaps this example should be some random > byte string instead, to show that it isn't in fact characters at all. > > (Done). Perfect. http://codereview.appspot.com/2827/diff/79/103 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/79/103#newcode277 Line 277: if item == b'': raise ValueError You don't need this line. int(b'', 16) will raise ValueError for you, at no extra cost. ;-) http://codereview.appspot.com/2827/diff/79/103#newcode302 Line 302: if item == '': raise ValueError if not item: ... http://codereview.appspot.com/2827/diff/79/103#newcode307 Line 307: if len(rest) == 0: if not rest: http://codereview.appspot.com/2827/diff/79/103#newcode317 Line 317: if len(pct_sequence) > 0: if pct_sequence: http://codereview.appspot.com/2827/diff/79/103#newcode319 Line 319: # res[-1] will always be empty if pct_sequence != [] Express this with an assert? E.g. assert not res[-1] http://codereview.appspot.com/2827/diff/79/103#newcode332 Line 332: always_safe = frozenset( Now that this is really a constant, why not name it with a leading underscore and all caps? I.e. _ALWAYS_SAFE http://codereview.appspot.com/2827/diff/79/103#newcode340 Line 340: A mapping from bytes (in range(0,256)) to strings. I'd prefer to stick to the PEP-8 docstring formatting rules. http://codereview.appspot.com/2827/diff/79/103#newcode381 Line 381: s and safe may be either str or bytes objects. encoding must not be s -> string here and in the next line http://codereview.appspot.com/2827/diff/79/103#newcode416 Line 416: not perform string-to-bytes encoding. It always returns an ASCII string. http://codereview.appspot.com/2827/diff/79/103#newcode430 Line 430: return ''.join(quoter[c] for c in bs) return ''.join(map(quoter.__getitem__, bs))
Sign in to reply to this message.
Submitted as r65838, with these changes. http://codereview.appspot.com/2827/diff/79/103 File Lib/urllib/parse.py (right): http://codereview.appspot.com/2827/diff/79/103#newcode277 Line 277: if item == b'': raise ValueError On 2008/08/18 20:20:32, GvR wrote: > You don't need this line. int(b'', 16) will raise ValueError for you, at no > extra cost. ;-) Done. http://codereview.appspot.com/2827/diff/79/103#newcode302 Line 302: if item == '': raise ValueError On 2008/08/18 20:20:32, GvR wrote: > if not item: ... Done. http://codereview.appspot.com/2827/diff/79/103#newcode307 Line 307: if len(rest) == 0: On 2008/08/18 20:20:32, GvR wrote: > if not rest: Done. http://codereview.appspot.com/2827/diff/79/103#newcode317 Line 317: if len(pct_sequence) > 0: On 2008/08/18 20:20:32, GvR wrote: > if pct_sequence: Done. http://codereview.appspot.com/2827/diff/79/103#newcode319 Line 319: # res[-1] will always be empty if pct_sequence != [] On 2008/08/18 20:20:32, GvR wrote: > Express this with an assert? E.g. > > assert not res[-1] Done. http://codereview.appspot.com/2827/diff/79/103#newcode332 Line 332: always_safe = frozenset( On 2008/08/18 20:20:32, GvR wrote: > Now that this is really a constant, why not name it with a leading underscore > and all caps? I.e. _ALWAYS_SAFE Done. http://codereview.appspot.com/2827/diff/79/103#newcode340 Line 340: A mapping from bytes (in range(0,256)) to strings. On 2008/08/18 20:20:32, GvR wrote: > I'd prefer to stick to the PEP-8 docstring formatting rules. Done. http://codereview.appspot.com/2827/diff/79/103#newcode381 Line 381: s and safe may be either str or bytes objects. encoding must not be On 2008/08/18 20:20:32, GvR wrote: > s -> string here and in the next line Done. http://codereview.appspot.com/2827/diff/79/103#newcode416 Line 416: not perform string-to-bytes encoding. On 2008/08/18 20:20:32, GvR wrote: > It always returns an ASCII string. Done. http://codereview.appspot.com/2827/diff/79/103#newcode430 Line 430: return ''.join(quoter[c] for c in bs) On 2008/08/18 20:20:32, GvR wrote: > return ''.join(map(quoter.__getitem__, bs)) Done.
Sign in to reply to this message.
|