|
|
Created:
13 years, 10 months ago by vineethrp Modified:
13 years, 10 months ago Reviewers:
Mike Schwartz CC:
gs-team_google.com, boto-users+noreply_googlegroups.com Visibility:
Public. |
DescriptionThis Changeset brings in the chunked transfer support in boto along with API for uploading a data stream to cloud.
New API set_contents_fromstream() is added. This method accepts an fp, which is a stream and usually not seekable and total size is not known before hand. As of now, Google Storage and OpenStack Swift supports chunked transfer. The method checks for chunked transfer support and then sends the stream on the cloud with 'Transfer-Encoding:chunked'
Updated the Test suite also to test the chunked transfer.
Patch Set 1 #
Total comments: 29
Patch Set 2 : Minor fix #Patch Set 3 : New Patch set with changes for code-review comments #
Total comments: 2
Patch Set 4 : Remove set_contents_from_stream from GS Key class #MessagesTotal messages: 8
Hi all, This is a review request for chunked transfer support for boto. There are requirements where we need to back up data streams to cloud where stream size cannot be calculated before hand. Google Storage and OpenStack Swift current supports chunked transfer. S3 doesn't support it as of now, but hopefully will.. This changeset adds a new API to transfer data stream to cloud if provider supports chunked transfer. More details are in the description section of the Issue. Please let me know your comments on the same! Thanks, Vineeth
Sign in to reply to this message.
Hi - it looks like your review request only includes one file. If you have any newly added files you'll need to do something like this to make them show up in the review: Pick the most current github revision ID, then use it in the upload command: ~/bin/upload.py -i 4515170 --rev=8afe641774385ddd5cd0949ac748ff2645864799
Sign in to reply to this message.
Hi Mike, There are two patch sets in the review. The second patchset is a minor fix that I missed during the initial pass.. It is showing both the patchsets for me.. Patchset 1 contains all the major changes.. Can you please recheck and confirm? Thanks, Vineeth On Wed, Jun 1, 2011 at 3:52 PM, <mfschwartz@google.com> wrote: > Hi - it looks like your review request only includes one file. If you > have any newly added files you'll need to do something like this to make > them show up in the review: > > > Pick the most current github revision ID, then use it in the upload > command: > ~/bin/upload.py -i 4515170 > --rev=8afe641774385ddd5cd0949ac748ff2645864799 > > > > > http://codereview.appspot.com/4515170/ > -- Cheers, ~Vineeth ## "Its not the load that breaks you, but the way u carry it!" ##
Sign in to reply to this message.
Thanks for the CL, Vineeth. Did you run all the boto tests (including the resumable upload test and also the S3 tests) on the code with your mods in place? http://codereview.appspot.com/4515170/diff/1/boto/connection.py File boto/connection.py (right): http://codereview.appspot.com/4515170/diff/1/boto/connection.py#newcode145 boto/connection.py:145: # chunked Transfer-Encoding should act only on PUT request. You're silently ignoring what looks to me like a caller error. Why not raise an exception in this case instead of just removing the offending header? http://codereview.appspot.com/4515170/diff/1/boto/gs/key.py File boto/gs/key.py (right): http://codereview.appspot.com/4515170/diff/1/boto/gs/key.py#newcode110 boto/gs/key.py:110: def set_contents_from_stream(self, fp, headers=None, replace=True, I don't understand why you have this method defined both in the gs Key subclass and in the parent class. It looks like the parent implementation handles everything and the current method isn't needed. http://codereview.appspot.com/4515170/diff/1/boto/gs/key.py#newcode111 boto/gs/key.py:111: cb=None, num_cb=10, policy=None, query_args=None): can you please wrap at 80 chars? (throughout) http://codereview.appspot.com/4515170/diff/1/boto/gs/key.py#newcode114 boto/gs/key.py:114: Google Storage and the contents of the data stream pointed to by 'fp' as the "in Google Storage" part is implicit in the fact that this is a method in the gs Key class. http://codereview.appspot.com/4515170/diff/1/boto/provider.py File boto/provider.py (right): http://codereview.appspot.com/4515170/diff/1/boto/provider.py#newcode209 boto/provider.py:209: if self.name == 'aws': is AWS really the only one that doesn't support chunked transfer encoding? It seems safer to use a whitelist than a blacklist approach - i.e., check if the provider is one of the ones that you know does support it. (I'm guessing there are other providers that use boto that we don't even know about) http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py File boto/s3/key.py (right): http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode458 boto/s3/key.py:458: m = md5() you could move this under the 'if chunked_transfer' block (line 463), since it's only used for that case (which I think would then make it clearer why you need it -- compute md5 on the fly for streams) http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode464 boto/s3/key.py:464: fp.seek(0) Should this be if not chunked_transfer? You can't seek on a stream, and also with this mod you'll never seek for regular files http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode475 boto/s3/key.py:475: # For chunked Transfer, if size is not known, we call Did you mean "For chunked transfer, size is not known so we call..." ? As it's writting it sounds like there's a case where the size might be known (and the code below doesn't have any such logic) http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode489 boto/s3/key.py:489: http_conn.send('%x;\r\n' %len(l)) could you please add a space after the second "%" ? http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode519 boto/s3/key.py:519: response.getheader('location') and seekable: Do you need parens? I think the logic is: if ((response.status == 500 or response.status == 503 or response.getheader('location')) and seekable): (and with parens you can drop the line-end backslash) http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode520 boto/s3/key.py:520: # we'll try again if fp is seekable I suggest dropping the "if fp is seekable" part, since the actual condition is more involved and is clearly expressed in the "if" statement above. http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode595 boto/s3/key.py:595: The stream object is usually not seekable. is it ever? http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode596 boto/s3/key.py:596: It strikes me that there's an additional advantage and disadvantage of streaming uploads, which might be worth mentioning in this comment block: advantage: no long delay at start of uploading when computing the MD5, as happens when using non-streaming uploads (I've seen it be hours in the case of multi-TB uploads) disadvantage: there's no way to make an integrity check to determine that an uploaded file changed during the upload http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode636 boto/s3/key.py:636: raise BotoClientError("Provider does not support chunked transfer") How about: "%s does not support chunked transfer" % provider.get_provider_name() http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode640 boto/s3/key.py:640: raise BotoClientError("Cannot determine the name of the stream") How about "Cannot determine the destination object name for the given stream"
Sign in to reply to this message.
Hi Mike, I have incorporated all the comments from you and tested the mods. I have done the following tests: #python setup.py test -s tests.s3.test_connection #python setup.py test -s tests.s3.test_gsconnection #python tests/s3/test_resumable_downloads.py #python tests/s3/test_resumable_uploads.py I am unable to do the rest of the tests as I do not have the MFA device. The new changes are in patchset 3. Thanks, Vineeth http://codereview.appspot.com/4515170/diff/1/boto/connection.py File boto/connection.py (right): http://codereview.appspot.com/4515170/diff/1/boto/connection.py#newcode145 boto/connection.py:145: # chunked Transfer-Encoding should act only on PUT request. Actually, this logic is in place, keeping in mind that users can also specify 'Transfer-Encoding:chunked' using '-h'. When they do so, some times, gsutil does multiple requests(for eg: POST) before actually doing the PUT. So I wanted to eliminate this header for every request except PUT. If we decide not to let user specify this option, then I can change this to raise an exception On 2011/06/02 04:53:28, Mike Schwartz wrote: > You're silently ignoring what looks to me like a caller error. > Why not raise an exception in this case instead of just removing the offending > header? http://codereview.appspot.com/4515170/diff/1/boto/gs/key.py File boto/gs/key.py (right): http://codereview.appspot.com/4515170/diff/1/boto/gs/key.py#newcode110 boto/gs/key.py:110: def set_contents_from_stream(self, fp, headers=None, replace=True, I wanted this function to be a place holder now. You had mentioned earlier that, there is a way for resumable uploads for streams. Once we have that logic in boto, we would need this method in gs.Key class for supporting that. I shall remove this if you feel this as redundant. On 2011/06/02 04:53:28, Mike Schwartz wrote: > I don't understand why you have this method defined both in the gs Key subclass > and in the parent class. It looks like the parent implementation handles > everything and the current method isn't needed. http://codereview.appspot.com/4515170/diff/1/boto/gs/key.py#newcode111 boto/gs/key.py:111: cb=None, num_cb=10, policy=None, query_args=None): On 2011/06/02 04:53:28, Mike Schwartz wrote: > can you please wrap at 80 chars? > (throughout) Done http://codereview.appspot.com/4515170/diff/1/boto/gs/key.py#newcode114 boto/gs/key.py:114: Google Storage and the contents of the data stream pointed to by 'fp' as Changed.. On 2011/06/02 04:53:28, Mike Schwartz wrote: > the "in Google Storage" part is implicit in the fact that this is a method in > the gs Key class. http://codereview.appspot.com/4515170/diff/1/boto/provider.py File boto/provider.py (right): http://codereview.appspot.com/4515170/diff/1/boto/provider.py#newcode209 boto/provider.py:209: if self.name == 'aws': I have changed the logic here to a more scalable one. On 2011/06/02 04:53:28, Mike Schwartz wrote: > is AWS really the only one that doesn't support chunked transfer encoding? > > It seems safer to use a whitelist than a blacklist approach - i.e., check if the > provider is one of the ones that you know does support it. > > (I'm guessing there are other providers that use boto that we don't even know > about) http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py File boto/s3/key.py (right): http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode458 boto/s3/key.py:458: m = md5() Done.. On 2011/06/02 04:53:28, Mike Schwartz wrote: > you could move this under the 'if chunked_transfer' block (line 463), since it's > only used for that case (which I think would then make it clearer why you need > it -- compute md5 on the fly for streams) http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode464 boto/s3/key.py:464: fp.seek(0) Changed.. On 2011/06/02 04:53:28, Mike Schwartz wrote: > Should this be if not chunked_transfer? You can't seek on a stream, and also > with this mod you'll never seek for regular files http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode475 boto/s3/key.py:475: # For chunked Transfer, if size is not known, we call The comments are reminiscent of my older design where users could specify chunked transfer for files as well. Changed the comment now On 2011/06/02 04:53:28, Mike Schwartz wrote: > Did you mean "For chunked transfer, size is not known so we call..." ? > As it's writting it sounds like there's a case where the size might be known > (and the code below doesn't have any such logic) http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode489 boto/s3/key.py:489: http_conn.send('%x;\r\n' %len(l)) Done On 2011/06/02 04:53:28, Mike Schwartz wrote: > could you please add a space after the second "%" ? http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode519 boto/s3/key.py:519: response.getheader('location') and seekable: Thanks for noting this.. Changed.. On 2011/06/02 04:53:28, Mike Schwartz wrote: > Do you need parens? I think the logic is: > > if ((response.status == 500 or response.status == 503 or > response.getheader('location')) and seekable): > > (and with parens you can drop the line-end backslash) http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode520 boto/s3/key.py:520: # we'll try again if fp is seekable Done.. On 2011/06/02 04:53:28, Mike Schwartz wrote: > I suggest dropping the "if fp is seekable" part, since the actual condition is > more involved and is clearly expressed in the "if" statement above. http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode595 boto/s3/key.py:595: The stream object is usually not seekable. Again, the comment is a left over of my old design. Changed the comment On 2011/06/02 04:53:28, Mike Schwartz wrote: > is it ever? http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode636 boto/s3/key.py:636: raise BotoClientError("Provider does not support chunked transfer") Done.. On 2011/06/02 04:53:28, Mike Schwartz wrote: > How about: > > "%s does not support chunked transfer" % provider.get_provider_name() http://codereview.appspot.com/4515170/diff/1/boto/s3/key.py#newcode640 boto/s3/key.py:640: raise BotoClientError("Cannot determine the name of the stream") Done.. On 2011/06/02 04:53:28, Mike Schwartz wrote: > How about "Cannot determine the destination object name for the given stream"
Sign in to reply to this message.
http://codereview.appspot.com/4515170/diff/5002/boto/connection.py File boto/connection.py (right): http://codereview.appspot.com/4515170/diff/5002/boto/connection.py#newcode144 boto/connection.py:144: self.headers = headers.copy() Is it possible that headers is None at this point? There's no default value... http://codereview.appspot.com/4515170/diff/5002/boto/gs/key.py File boto/gs/key.py (right): http://codereview.appspot.com/4515170/diff/5002/boto/gs/key.py#newcode111 boto/gs/key.py:111: cb=None, num_cb=10, policy=None, query_args=None): Were you going to remove this copy of the code, since the parent class already handles it?
Sign in to reply to this message.
Hi Mike, Removed the set_contents_from_stream() from GS Key Class.. I had gone through the code briefly to figure out whether HTTPRequest gets None for header, but could not find any.. But as you suggested, it would be good to add this check as future uses might pass in headers.. I have added the check as well. Patch Set 4 contains the two changes mentioned above.. Please let me know! Many Thanks, Vineeth On 2011/06/03 01:15:51, Mike Schwartz wrote: > http://codereview.appspot.com/4515170/diff/5002/boto/connection.py > File boto/connection.py (right): > > http://codereview.appspot.com/4515170/diff/5002/boto/connection.py#newcode144 > boto/connection.py:144: self.headers = headers.copy() > Is it possible that headers is None at this point? There's no default value... > > http://codereview.appspot.com/4515170/diff/5002/boto/gs/key.py > File boto/gs/key.py (right): > > http://codereview.appspot.com/4515170/diff/5002/boto/gs/key.py#newcode111 > boto/gs/key.py:111: cb=None, num_cb=10, policy=None, query_args=None): > Were you going to remove this copy of the code, since the parent class already > handles it?
Sign in to reply to this message.
|