|
|
Created:
6 years, 4 months ago by onager Modified:
6 years, 3 months ago Reviewers:
Joachim Metz CC:
kiddi, log2timeline-dev_googlegroups.com Visibility:
Public. |
Description[dfvfs] Support multiple-member gzip files
Patch Set 1 #
Total comments: 26
Patch Set 2 : Changes after review #
Total comments: 14
Patch Set 3 : Changes after review #Patch Set 4 : Changes after review #
Total comments: 30
Patch Set 5 : Changes after review #Patch Set 6 : Cache decompressor state to improve sequential read perf #
Total comments: 25
Patch Set 7 : Changed after review. #
Total comments: 24
Patch Set 8 : Changes after review #Patch Set 9 : Changes after review #
Total comments: 7
Patch Set 10 : Changes after review #
MessagesTotal messages: 29
Some initial comments, I need to have a closer look at the member caching and offset calculations. https://codereview.appspot.com/331420043/diff/1/dfvfs/compression/zlib_decomp... File dfvfs/compression/zlib_decompressor.py (right): https://codereview.appspot.com/331420043/diff/1/dfvfs/compression/zlib_decomp... dfvfs/compression/zlib_decompressor.py:34: - white line https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/compressed_stre... File dfvfs/file_io/compressed_stream_io.py (right): https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/compressed_stre... dfvfs/file_io/compressed_stream_io.py:179: print(exception) please remove this debugging catch and print https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:17: """Implementation of a single gzip member. "Implementation of a single gzip member" => "Gzip member" https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:20: comment (str|None): comment stored in the member trailing dot IMHO we could leave out None. "(str|None)" => "(str)" https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:103: def GetCacheSize(self): What about splitting the concern of a cache value and that of a gzip member? Also why have the cache in the member not in the file? https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:127: def read(self, offset, size=None): Let's not redefine a read() in a different way. Either use: "def read(self, size=None):" Or rename the method https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:288: The gzip file format is defined in RFC1952: http://www.zlib.org/rfc-gzip.html + Attributes: uncompressed_data_size https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:309: """The original filenames stored in the gzip file.""" list[str]: original filenames stored in the gzip file. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:314: """The modification times stored in the gzip file.""" + type https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:319: """The operating system values stored in the gzip file.""" + type https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:324: """The comments in the gzip file.""" + type https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:344: member.uncompressed_data_size: Don't use a trailing \ instead use bounding (). if (offset < member.uncompressed_data_offset + member.uncompressed_data_size): https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:356: IOError: if the seek failed or the file has not been opened.. - trailing dot
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/1/dfvfs/compression/zlib_decomp... File dfvfs/compression/zlib_decompressor.py (right): https://codereview.appspot.com/331420043/diff/1/dfvfs/compression/zlib_decomp... dfvfs/compression/zlib_decompressor.py:34: On 2017/12/08 07:04:02, Joachim Metz wrote: > - white line Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/compressed_stre... File dfvfs/file_io/compressed_stream_io.py (right): https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/compressed_stre... dfvfs/file_io/compressed_stream_io.py:179: print(exception) On 2017/12/08 07:04:02, Joachim Metz wrote: > please remove this debugging catch and print Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:17: """Implementation of a single gzip member. On 2017/12/08 07:04:03, Joachim Metz wrote: > "Implementation of a single gzip member" => "Gzip member" Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:20: comment (str|None): comment stored in the member On 2017/12/08 07:04:02, Joachim Metz wrote: > trailing dot > > IMHO we could leave out None. > > "(str|None)" => "(str)" > Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:103: def GetCacheSize(self): On 2017/12/08 07:04:02, Joachim Metz wrote: > What about splitting the concern of a cache value and that of a gzip member? > > Also why have the cache in the member not in the file? Cache is in the member because all members need to be read during the gzip file open. I'm not sure that implementing a separate cache object is going to make this simpler, but I'll keep thinking about it. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:127: def read(self, offset, size=None): On 2017/12/08 07:04:02, Joachim Metz wrote: > Let's not redefine a read() in a different way. > > Either use: > "def read(self, size=None):" > > Or rename the method Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:288: The gzip file format is defined in RFC1952: http://www.zlib.org/rfc-gzip.html On 2017/12/08 07:04:02, Joachim Metz wrote: > + Attributes: > uncompressed_data_size Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:309: """The original filenames stored in the gzip file.""" On 2017/12/08 07:04:03, Joachim Metz wrote: > list[str]: original filenames stored in the gzip file. Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:314: """The modification times stored in the gzip file.""" On 2017/12/08 07:04:02, Joachim Metz wrote: > + type Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:319: """The operating system values stored in the gzip file.""" On 2017/12/08 07:04:03, Joachim Metz wrote: > + type Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:324: """The comments in the gzip file.""" On 2017/12/08 07:04:02, Joachim Metz wrote: > + type Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:344: member.uncompressed_data_size: On 2017/12/08 07:04:02, Joachim Metz wrote: > Don't use a trailing \ instead use bounding (). > > if (offset < member.uncompressed_data_offset + > member.uncompressed_data_size): Done. https://codereview.appspot.com/331420043/diff/1/dfvfs/file_io/gzip_file_io.py... dfvfs/file_io/gzip_file_io.py:356: IOError: if the seek failed or the file has not been opened.. On 2017/12/08 07:04:02, Joachim Metz wrote: > - trailing dot Done.
Sign in to reply to this message.
Code updated.
Sign in to reply to this message.
Please also issue a PR for this change. https://codereview.appspot.com/331420043/diff/20001/dfvfs/compression/zlib_de... File dfvfs/compression/zlib_decompressor.py (right): https://codereview.appspot.com/331420043/diff/20001/dfvfs/compression/zlib_de... dfvfs/compression/zlib_decompressor.py:32: """Data passed in past the end of the compressed data.""" start docstring with type, "bytes: " https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:17: """Single gzip member. "Single" IMHO add no additional information. https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:97: # Initialize the member with data add a trailing dot (for consistency) or move this into method _InitializeData() or equiv to make the comment obsolete https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:114: """Convenience method to check whether the uncompressed data cache is full. "Convenience method to check" => "Checks" https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:144: IOError: if the read failed. + ValueError https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:221: """Reads the file header. file => member https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:340: uncompressed data. + 4 space indent
Sign in to reply to this message.
Code updated.
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/20001/dfvfs/compression/zlib_de... File dfvfs/compression/zlib_decompressor.py (right): https://codereview.appspot.com/331420043/diff/20001/dfvfs/compression/zlib_de... dfvfs/compression/zlib_decompressor.py:32: """Data passed in past the end of the compressed data.""" On 2017/12/25 09:15:03, Joachim Metz wrote: > start docstring with type, "bytes: " Done. https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:17: """Single gzip member. On 2017/12/25 09:15:03, Joachim Metz wrote: > "Single" IMHO add no additional information. Acknowledged. https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:97: # Initialize the member with data On 2017/12/25 09:15:04, Joachim Metz wrote: > add a trailing dot (for consistency) or move this into method _InitializeData() > or equiv to make the comment obsolete Done. https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:114: """Convenience method to check whether the uncompressed data cache is full. On 2017/12/25 09:15:04, Joachim Metz wrote: > "Convenience method to check" => "Checks" Done. https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:144: IOError: if the read failed. On 2017/12/25 09:15:04, Joachim Metz wrote: > + ValueError Done. https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:221: """Reads the file header. On 2017/12/25 09:15:04, Joachim Metz wrote: > file => member Done. https://codereview.appspot.com/331420043/diff/20001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:340: uncompressed data. On 2017/12/25 09:15:04, Joachim Metz wrote: > + 4 space indent Done.
Sign in to reply to this message.
Code updated.
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/60001/dfvfs/compression/zlib_de... File dfvfs/compression/zlib_decompressor.py (right): https://codereview.appspot.com/331420043/diff/60001/dfvfs/compression/zlib_de... dfvfs/compression/zlib_decompressor.py:32: """bytes: Data passed in past the end of the compressed data.""" consistency nit: Data -> data maybe also reword to: bytes: data past the end of the compressed data https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:63: """Initializes a file-like object. => Initializes a gzip member. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:67: the beginning of the Gzip Member. consistency nit: Gzip Member => gzip member the the => the also explain why "positioned to the the beginning of the Gzip Member" is important or why not pass the offset of the member instead and make the class api rely less on the state of the file-like object? https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:91: self.member_start_offset = file_object.tell() consistency nit: tell() => get_offset() https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:104: """Gets the size of the uncompressed data cache. Gets => Retrieves or Determines https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:158: if not size: how should a size of 0 be handled? https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:159: return self._cache[cache_offset:] Shouldn't self._cache_start_offset be updated? https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:177: minimum_offset (init): offset into this member's uncompressed data at (init) => (int) https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:179: read_all_data (bool): whether to read all the compressed data from the => True if all the compressed data should be read from the member. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:182: # We always need to start from the beginning of the compressed data. move this into the docstring and maybe say why you need to do this. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:187: while not (self.IsCacheFull() and not read_all_data): => while not self.IsCacheFull() or read_all_data: also why not calculate the "to cache" offset first, read and ignore the data upto that point, then read the remaining data into the cache? https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:212: # If there's no more data in the member, the unused_data value is maybe mention this in the docstring that this method will always read the data in increments of of deflate compressed blocks https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:283: self.member_end_offset = file_object.tell() consistency nit: tell() => get_offset() https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:348: if (offset < member.uncompressed_data_offset + you could make "member.uncompressed_data_offset + member.uncompressed_data_size" the key and _members a dict so you only have to calculate it once https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:462: while self._gzip_file_object.tell() < self._gzip_file_object.get_size(): note: keep this tell() seeing its a different type of object
Sign in to reply to this message.
Code updated.
Sign in to reply to this message.
I've addressed the comments, but as I was doing so, I realized there's a potential optimization for sequential reads on files with large gzip members. I'm refactoring this now, so please stand by. https://codereview.appspot.com/331420043/diff/60001/dfvfs/compression/zlib_de... File dfvfs/compression/zlib_decompressor.py (right): https://codereview.appspot.com/331420043/diff/60001/dfvfs/compression/zlib_de... dfvfs/compression/zlib_decompressor.py:32: """bytes: Data passed in past the end of the compressed data.""" On 2017/12/27 20:06:35, Joachim Metz wrote: > consistency nit: Data -> data > > maybe also reword to: > bytes: data past the end of the compressed data Done. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:63: """Initializes a file-like object. On 2017/12/27 20:06:36, Joachim Metz wrote: > => Initializes a gzip member. Done. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:67: the beginning of the Gzip Member. On 2017/12/27 20:06:36, Joachim Metz wrote: > consistency nit: Gzip Member => gzip member > > the the => the > > also explain why "positioned to the the beginning of the Gzip Member" is > important > > or why not pass the offset of the member instead and make the class api rely > less on the state of the file-like object? There's no index, so the only way to find the offsets for the start of each of the members is to read each one sequentially. That's why this member class exists, to encapsulate the reading through a single member in a gzip file with 1 or more members. I've added some more information to the class docstring, but the argument description here is to allow someone to make a valid call to the method, so is the wrong place for a discussion of why things are implemented in this way. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:91: self.member_start_offset = file_object.tell() On 2017/12/27 20:06:36, Joachim Metz wrote: > consistency nit: tell() => get_offset() Done. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:104: """Gets the size of the uncompressed data cache. On 2017/12/27 20:06:36, Joachim Metz wrote: > Gets => Retrieves or Determines Done. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:158: if not size: On 2017/12/27 20:06:35, Joachim Metz wrote: > how should a size of 0 be handled? Done. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:159: return self._cache[cache_offset:] On 2017/12/27 20:06:35, Joachim Metz wrote: > Shouldn't self._cache_start_offset be updated? No, the cache isn't changing here. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:177: minimum_offset (init): offset into this member's uncompressed data at On 2017/12/27 20:06:36, Joachim Metz wrote: > (init) => (int) Done. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:179: read_all_data (bool): whether to read all the compressed data from the On 2017/12/27 20:06:36, Joachim Metz wrote: > => True if all the compressed data should be read from the member. Done. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:182: # We always need to start from the beginning of the compressed data. On 2017/12/27 20:06:36, Joachim Metz wrote: > move this into the docstring and maybe say why you need to do this. Changed the comment here, but this is just to clarify why the seek is here, I don't think it's necessary to add that to the docstring. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:187: while not (self.IsCacheFull() and not read_all_data): On 2017/12/27 20:06:36, Joachim Metz wrote: > => while not self.IsCacheFull() or read_all_data: > > also why not calculate the "to cache" offset first, read and ignore the data > upto that point, then read the remaining data into the cache? Not sure what you mean - minimum_offset is the "to cache" offset, and this method does in fact read and discard data up to that point. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:212: # If there's no more data in the member, the unused_data value is On 2017/12/27 20:06:36, Joachim Metz wrote: > maybe mention this in the docstring that this method will always read the data > in increments of of deflate compressed blocks I'm not sure what you mean - unused_data is only populated at the end of the gzip member. Unused data and the extra compressed data are different things. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:283: self.member_end_offset = file_object.tell() On 2017/12/27 20:06:35, Joachim Metz wrote: > consistency nit: tell() => get_offset() Done. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:348: if (offset < member.uncompressed_data_offset + On 2017/12/27 20:06:35, Joachim Metz wrote: > you could make "member.uncompressed_data_offset + member.uncompressed_data_size" > the key and _members a dict so you only have to calculate it once Done. https://codereview.appspot.com/331420043/diff/60001/dfvfs/file_io/gzip_file_i... dfvfs/file_io/gzip_file_io.py:462: while self._gzip_file_object.tell() < self._gzip_file_object.get_size(): On 2017/12/27 20:06:36, Joachim Metz wrote: > note: keep this tell() seeing its a different type of object Acknowledged.
Sign in to reply to this message.
> I'm refactoring this now, so please stand by. Ack, drop me a PTAL when you're done
Sign in to reply to this message.
Code updated.
Sign in to reply to this message.
On 2017/12/28 13:29:29, onager wrote: > Code updated. PTAL
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:11: from dfvfs.resolver import resolver nit: alphabetical ordering https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:14: + 1 white line https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:16: """Deflate decompressor wrapper. class name says Gzip docstring says Deflate? https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:31: """Initializes a gzip member decompressor state object.""" - object - member + Args What does stream_start None represent? https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:41: file_object (FileIO): parent file object. - parent file object that contains the compressed stream https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:57: """Retrieves any uncompressed data read by the decompressor. uncompressed data ? or is the compressed data not decompressed ? https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:64: the gzip file. 4 space continuation indentation https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:67: + 1 white line https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:121: gzip member. . => , since this method will start with reading the member header. > I've added some more information to the class docstring, but the argument description here is to allow someone to make a valid call to the method, so is the wrong place for a discussion of why things are implemented in this way. I'm not asking for a discussion. Now you implicitly rely on file_object to point to the member offset, but you are not explaining why, namely that you'll read the header at the current offset. An alternative is to pass member_start_offset as an explicit argument. This provides for a more explicit interface. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:158: self._compressed_data_start) What does self._compressed_data_start (stream_start) None represent? https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:213: if size is not None and size <= 0: Why is read of size 0 invalid? I opt to properly handle it instead.
Sign in to reply to this message.
Code updated.
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:11: from dfvfs.resolver import resolver On 2017/12/28 15:08:40, Joachim Metz wrote: > nit: alphabetical ordering Done. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:14: On 2017/12/28 15:08:40, Joachim Metz wrote: > + 1 white line Done. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:16: """Deflate decompressor wrapper. On 2017/12/28 15:08:40, Joachim Metz wrote: > class name says Gzip docstring says Deflate? > Yes, this class is for handling deflate-compressed data in a gzip member. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:31: """Initializes a gzip member decompressor state object.""" On 2017/12/28 15:08:41, Joachim Metz wrote: > - object > - member > > + Args > > What does stream_start None represent? Done. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:41: file_object (FileIO): parent file object. On 2017/12/28 15:08:41, Joachim Metz wrote: > - parent > > file object that contains the compressed stream Done. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:57: """Retrieves any uncompressed data read by the decompressor. On 2017/12/28 15:08:40, Joachim Metz wrote: > uncompressed data ? or is the compressed data not decompressed ? Done. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:64: the gzip file. On 2017/12/28 15:08:41, Joachim Metz wrote: > 4 space continuation indentation Done. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:67: On 2017/12/28 15:08:41, Joachim Metz wrote: > + 1 white line Done. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:121: gzip member. On 2017/12/28 15:08:41, Joachim Metz wrote: > . => , since this method will start with reading the member header. > > > I've added some more information to the class docstring, but the argument > description here is to allow someone to make a valid call to the method, so is > the wrong place for a discussion of why things are implemented in this way. > > I'm not asking for a discussion. Now you implicitly rely on file_object to point > to the member offset, but you are not explaining why, namely that you'll read > the header at the current offset. > > An alternative is to pass member_start_offset as an explicit argument. This > provides for a more explicit interface. Acknowledged. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:158: self._compressed_data_start) On 2017/12/28 15:08:41, Joachim Metz wrote: > What does self._compressed_data_start (stream_start) None represent? I'm not sure what you're asking for here - please be explicit. https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:213: if size is not None and size <= 0: On 2017/12/28 15:08:40, Joachim Metz wrote: > Why is read of size 0 invalid? I opt to properly handle it instead. "Read no data from this member" seems like a strange, likely erroneous to ask for, hence the error. A 0 length reads seems to be mostly supported in the rest of the codebase though, so changed this.
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:158: self._compressed_data_start) This was the same remark as line 31. I see now that self._compressed_data_start is set in line 302 why not set it in this method after self._ReadHeader(file_object). IMHO this will make it easier to observe the usage of this value.
Sign in to reply to this message.
Code updated.
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:158: self._compressed_data_start) On 2017/12/28 15:58:22, Joachim Metz wrote: > This was the same remark as line 31. > > I see now that self._compressed_data_start is set in line 302 why not set it in > this method after self._ReadHeader(file_object). IMHO this will make it easier > to observe the usage of this value. Done.
Sign in to reply to this message.
Some minor remaining nits, but the logic is easier to follow now https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/100001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:213: if size is not None and size <= 0: a read of 0 byte is commonly allowed, though strange, so I opt keep the behavior of read consistent. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:16: class _GzipDecompressorState(object): Seeing that Gzip in theory support different compression methods I opt to rename this class. => _GzipDeflateDecompressorState https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:17: """Deflate decompressor wrapper. Deflate decompressor wrapper for reading a gzip member. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:36: file object. 4 space continuation indentation. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:39: ValueError: if an invalid stream start value is provided. please remove the need for a ValueError. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:42: if stream_start is None: please remove this check, I was confused because the offset was set somewhere unrelated. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:57: file_object.seek(self.last_read) consistency nit: + , os.SEEK_SET https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:72: Unused data is present when data containing the end of a compressed stream => Unused data can be any bytes after a Deflate compressed block (or chunk). https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:164: # Offset to the beginning of the compressed data in the file object. Move this and next line to after line 170 and replace (what currently is) line 165 by line 322 self._compressed_data_start = file_object.get_offset() https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:233: if size == 0: or offset >= self.uncompressed_data_size: https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:322: self._compressed_data_start = file_object.get_offset() Move this into __init__, also see earlier comment about this https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:374: self.member_end_offset = file_object.get_offset() Move this into __init__ https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:557: while self._gzip_file_object.tell() < self._gzip_file_object.get_size(): while next_member_offset < self._gzip_file_object.get_size():
Sign in to reply to this message.
Code updated.
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:16: class _GzipDecompressorState(object): On 2017/12/28 16:23:23, Joachim Metz wrote: > Seeing that Gzip in theory support different compression methods I opt to rename > this class. > > => _GzipDeflateDecompressorState > I don't think this is worthwhile, since there hasn't been a non-DEFLATE compression added to gzip in the last 21 years. There's unlikely to be one added, imo. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:17: """Deflate decompressor wrapper. On 2017/12/28 16:23:23, Joachim Metz wrote: > Deflate decompressor wrapper for reading a gzip member. Done. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:36: file object. On 2017/12/28 16:23:23, Joachim Metz wrote: > 4 space continuation indentation. Done. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:39: ValueError: if an invalid stream start value is provided. On 2017/12/28 16:23:22, Joachim Metz wrote: > please remove the need for a ValueError. Done. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:42: if stream_start is None: On 2017/12/28 16:23:23, Joachim Metz wrote: > please remove this check, I was confused because the offset was set somewhere > unrelated. Done. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:57: file_object.seek(self.last_read) On 2017/12/28 16:23:23, Joachim Metz wrote: > consistency nit: > > + , os.SEEK_SET Done. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:72: Unused data is present when data containing the end of a compressed stream On 2017/12/28 16:23:23, Joachim Metz wrote: > => Unused data can be any bytes after a Deflate compressed block (or chunk). Done. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:164: # Offset to the beginning of the compressed data in the file object. On 2017/12/28 16:23:23, Joachim Metz wrote: > Move this and next line to after line 170 and replace (what currently is) line > 165 by line 322 > > self._compressed_data_start = file_object.get_offset() Acknowledged. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:233: if size == 0: On 2017/12/28 16:23:23, Joachim Metz wrote: > or offset >= self.uncompressed_data_size: Done. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:322: self._compressed_data_start = file_object.get_offset() On 2017/12/28 16:23:23, Joachim Metz wrote: > Move this into __init__, also see earlier comment about this Acknowledged. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:374: self.member_end_offset = file_object.get_offset() On 2017/12/28 16:23:23, Joachim Metz wrote: > Move this into __init__ Acknowledged. https://codereview.appspot.com/331420043/diff/110008/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:557: while self._gzip_file_object.tell() < self._gzip_file_object.get_size(): On 2017/12/28 16:23:23, Joachim Metz wrote: > while next_member_offset < self._gzip_file_object.get_size(): Done.
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:162: self._compressed_data_start = self._ReadHeader(file_object) This does not improve the readability. 1. It is not clear from ReadHeader that it would return the offset of the compressed data 2. read functions in Python typically return data Please change also for ReadFooter. https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:222: if size is not None and size < 0: Add: if offset < 0: raise ValueError('Invalid size value smaller than zero.') https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:223: raise ValueError('Invalid size value smaller than one.') smaller than one => smaller than zero
Sign in to reply to this message.
Code updated.
Sign in to reply to this message.
https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:162: self._compressed_data_start = self._ReadHeader(file_object) On 2017/12/28 17:18:35, Joachim Metz wrote: > This does not improve the readability. > > 1. It is not clear from ReadHeader that it would return the offset of the > compressed data > 2. read functions in Python typically return data > > Please change also for ReadFooter. Done, I think. 1) Changed names and refactored. 2) These functions were called Read* before I touched them. https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:222: if size is not None and size < 0: On 2017/12/28 17:18:36, Joachim Metz wrote: > Add: > > if offset < 0: > raise ValueError('Invalid size value smaller than zero.') Acknowledged. https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:223: raise ValueError('Invalid size value smaller than one.') On 2017/12/28 17:18:35, Joachim Metz wrote: > smaller than one => smaller than zero Acknowledged.
Sign in to reply to this message.
LGTM https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... File dfvfs/file_io/gzip_file_io.py (right): https://codereview.appspot.com/331420043/diff/150001/dfvfs/file_io/gzip_file_... dfvfs/file_io/gzip_file_io.py:162: self._compressed_data_start = self._ReadHeader(file_object) Ack, primary was referring to the return of offset, but alternative name LGTM.
Sign in to reply to this message.
Changes have been merged with master branch. To close the review and clean up the feature branch you can run: python ./utils/review.py close gzip
Sign in to reply to this message.
|