Issue 2362041: email5 bytes parsing patch

Can't Edit
Can't Publish+Mail
Start Review

Created:
14 years, 10 months ago by r.david.murray

Modified:
14 years, 9 months ago

Reviewers:
Antoine Pitrou, barry

Base URL:
http://svn.python.org/view/*checkout*/python/branches/py3k/

Visibility:
Public.

Patch Set 1 #

Total comments: 26

Created: 14 years, 10 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Stats (+515 lines, -78 lines)			Patch
M	Doc/library/email.rst	View	2 chunks	+33 lines, -1 line	0 comments	Download
M	Doc/library/email.generator.rst	View	2 chunks	+19 lines, -0 lines	0 comments	Download
M	Doc/library/email.message.rst	View	2 chunks	+15 lines, -3 lines	0 comments	Download
M	Doc/library/email.parser.rst	View	3 chunks	+23 lines, -4 lines	2 comments	Download
M	Lib/email/__init__.py	View	2 chunks	+9 lines, -1 line	0 comments	Download
M	Lib/email/generator.py	View	12 chunks	+126 lines, -43 lines	8 comments	Download
M	Lib/email/message.py	View	6 chunks	+72 lines, -25 lines	11 comments	Download
M	Lib/email/parser.py	View	1 chunk	+11 lines, -0 lines	0 comments	Download
M	Lib/email/test/test_email.py	View	3 chunks	+207 lines, -1 line	5 comments	Download

Messages

Total messages: 4

Expand All Messages | Collapse All Messages

Antoine Pitrou

This is a quick review. http://codereview.appspot.com/2362041/diff/1/Lib/email/message.py File Lib/email/message.py (right): http://codereview.appspot.com/2362041/diff/1/Lib/email/message.py#newcode29 Lib/email/message.py:29: has_surrogates = re.compile( Do ...

14 years, 10 months ago (2010-10-03 00:28:19 UTC) #1

barry

http://codereview.appspot.com/2362041/diff/1/Doc/library/email.parser.rst File Doc/library/email.parser.rst (right): http://codereview.appspot.com/2362041/diff/1/Doc/library/email.parser.rst#newcode81 Doc/library/email.parser.rst:81: error handler; this data will be recognized as binary ...

14 years, 10 months ago (2010-10-04 17:34:35 UTC) #2

r.david.murray

14 years, 9 months ago (2010-10-06 01:20:39 UTC) #3

http://codereview.appspot.com/2362041/diff/1/Doc/library/email.parser.rst
File Doc/library/email.parser.rst (right):

http://codereview.appspot.com/2362041/diff/1/Doc/library/email.parser.rst#new...
Doc/library/email.parser.rst:81: error handler; this data will be recognized as
binary data by the model.
On 2010/10/04 17:34:35, barry wrote:
> Do you want to add a "new in python 3.2" note for this and other changes to
> email 5.1?

Yes.  Though in this case I'm going to make a BytesFeedParser instead of
documenting surrogateescape.

http://codereview.appspot.com/2362041/diff/1/Lib/email/generator.py
File Lib/email/generator.py (right):

http://codereview.appspot.com/2362041/diff/1/Lib/email/generator.py#newcode153
Lib/email/generator.py:153:
self.write(v.encode(maxlinelen=self._maxheaderlen)+'\n')
On 2010/10/04 17:34:35, barry wrote:
> It might be useful to explain in a comment why '\n' is used here (shouldn't it
> be global NL in any case?) instead of self._NL.  IIUC, the reason is that
> headers must be strings but payload can be bytes (hence the parameterization
for
> the BytesGenerator subclass).
> 
> Since it's not obvious from reading the code, a few comments would be helpful
to
> other readers.

I added a block comment at the start of the class explaining the class constants
and how write/_fp.write are used.  The actual distinction is whether we are
working with already-written data or are writing new data.  Making that
distinction leads to the smallest change set for the code, but *not* the
clearest code.  This should be rewritten for clarity in email6 :)

http://codereview.appspot.com/2362041/diff/1/Lib/email/generator.py#newcode214
Lib/email/generator.py:214: self.write(msg.preamble + '\n')
On 2010/10/04 17:34:35, barry wrote:
> s/'\n'/NL/g

Yes, did that.

http://codereview.appspot.com/2362041/diff/1/Lib/email/generator.py#newcode284
Lib/email/generator.py:284: # This used to be a a module level function, we use
a classmethod for
On 2010/10/04 17:34:35, barry wrote:
> Why is it a classmethod now?  (I wanted the comment to explain that, but it
> actually doesn't ;)

So that the module level _make_boundary = Generator._make_boundary works and
provides the backward compatibility.  Added that the comment.

http://codereview.appspot.com/2362041/diff/1/Lib/email/message.py
File Lib/email/message.py (right):

http://codereview.appspot.com/2362041/diff/1/Lib/email/message.py#newcode29
Lib/email/message.py:29: has_surrogates = re.compile(
On 2010/10/03 00:28:19, Antoine Pitrou wrote:
> Do you want to make it public? Or would it be better as _has_surrogates?

Yes, it would be.

http://codereview.appspot.com/2362041/diff/1/Lib/email/message.py#newcode30
Lib/email/message.py:30:
'([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)').search
On 2010/10/03 00:28:19, Antoine Pitrou wrote:
> I think I don't understand the last part of the regex:
> ([^\udc00-\udfff]|\Z)
> 
> It shouldn't matter what comes after the lone surrogate, although it does no
> harm (if there's another surrogate just after, it will be matched inside of
the
> first one).
> 
> Also, in the context of surrogateescape, you can narrow the middle search to:
> [\udc80-\udcff]

I have very little understanding of what this regex does, I pieced it together
from things that you and someone else said on IRC.  Please suggest a correct
regex and I will use it.

http://codereview.appspot.com/2362041/diff/1/Lib/email/message.py#newcode43
Lib/email/message.py:43: return original_bytes.decode('ascii',
'replace').replace('�', '?')
On 2010/10/04 17:34:35, barry wrote:
> What's that black-diamond-surrounding-? look like in the original code?  Looks
> like a non-ascii literal to me.  If it's a Rietveld artifact fine, but if not
> please use the \x escape instead.

Right, you can't tell that it's literally the character used by the replace
error handler :)  Fixed.

http://codereview.appspot.com/2362041/diff/1/Lib/email/message.py#newcode238
Lib/email/message.py:238: payload =
bpayload.decode(str(self.get_param('charset', 'ascii')), 'replace')
On 2010/10/03 00:28:19, Antoine Pitrou wrote:
> Is the str() call superfluous?

Ah yes, you pointed out that I'd forgotten to remove that before and I forgot to
fix it again.

http://codereview.appspot.com/2362041/diff/1/Lib/email/test/test_email.py
File Lib/email/test/test_email.py (right):

http://codereview.appspot.com/2362041/diff/1/Lib/email/test/test_email.py#new...
Lib/email/test/test_email.py:2719: self.assertEqual(msg.get_payload(),
"pöstal\n")
On 2010/10/03 00:28:19, Antoine Pitrou wrote:
> Don't you want to test get_payload(decode=True) as well?
> (same in the other test methods)

Yes, you are correct.

http://codereview.appspot.com/2362041/diff/1/Lib/email/test/test_email.py#new...
Lib/email/test/test_email.py:2719: self.assertEqual(msg.get_payload(),
"pöstal\n")
On 2010/10/03 00:28:19, Antoine Pitrou wrote:
> Don't you want to test get_payload(decode=True) as well?
> (same in the other test methods)

As far as the code is concerned they are mostly redundant, but they might as
well be there for completeness.

http://codereview.appspot.com/2362041/diff/1/Lib/email/test/test_email.py#new...
Lib/email/test/test_email.py:2870: lines = out.getvalue().split(b'\n')
On 2010/10/03 00:28:19, Antoine Pitrou wrote:
> Just a question, is it normal to get '\n' rather than '\r\n'?

Yes.  This is one of those things email6 is going to fix :)

Currently other python code (ex: smtplib) "fixes" the linendings when writing to
the wire (which is has to do for data produced naively by user code), so email
produces \n.  There's an open issue calling for documenting this design
decision; but  email6 will provide a way to produce \r\n directly from a
Generator when desired.  (There are a couple of open issues about this, by the
way.)

Antoine Pitrou

14 years, 9 months ago (2010-10-06 15:41:23 UTC) #4

http://codereview.appspot.com/2362041/diff/1/Lib/email/message.py
File Lib/email/message.py (right):

http://codereview.appspot.com/2362041/diff/1/Lib/email/message.py#newcode30
Lib/email/message.py:30:
'([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)').search
On 2010/10/06 01:20:39, r.david.murray wrote:
> I have very little understanding of what this regex does, I pieced it together
> from things that you and someone else said on IRC.  Please suggest a correct
> regex and I will use it.

I think the following would be enough:

([^\ud800-\udbff]|\A)[\udc80-\udcff]

Expand All Messages | Collapse All Messages