|
|
Created:
13 years, 7 months ago by mpl Modified:
13 years, 3 months ago CC:
golang-dev Visibility:
Public. |
Descriptionfirst draft for a Lempel-Ziv-Welch compression package.
Patch Set 1 #Patch Set 2 : code review 2510041: first draft for a Lempel-Ziv-Welch compression package. #Patch Set 3 : code review 2510041: first draft for a Lempel-Ziv-Welch compression package. #Patch Set 4 : code review 2510041: first draft for a Lempel-Ziv-Welch compression package. #Patch Set 5 : code review 2510041: first draft for a Lempel-Ziv-Welch compression package. #Patch Set 6 : code review 2510041: first draft for a Lempel-Ziv-Welch compression package. #
Total comments: 15
Patch Set 7 : diff -r 8280f07d4272 https://go.googlecode.com/hg/ #Patch Set 8 : diff -r 8280f07d4272 https://go.googlecode.com/hg/ #
Total comments: 8
MessagesTotal messages: 31
Hello golang-dev@googlegroups.com (cc: golang-dev@googlegroups.com), I'd like you to review this change.
Sign in to reply to this message.
I know there's room for improvement. in particular, it could do some buffering and the table lookups could be smarter. I'd like to know if the overall design is ok though; ie if I'm on the right track with the reader/writer way, and if it's a good idea to have one goroutine to encode/decode communicating with another one to read/write the data in the right format.
Sign in to reply to this message.
Do you know that there are multiple file formats that use the same encoding, to make it worth having a separate package (for example compress/flate is used by zip, zlib, gzip, png, ...). Wikipedia suggests that gif, tiff, and pdf all support lzw but that the bit order differs. At the least it seems you'd want to allow for that somehow. (It would be fine to make a pass over the data at the end to do a table-based substitution. GIF is probably the common case since I believe tiff and pdf more commonly use flate.) Russ
Sign in to reply to this message.
As a first step, you should run hg gofmt hg upload 2510041 There are also various ways in which you could be using Go more effectively or efficiently. Just scanning through: * the map[string]uint16 is overkill * each Read and Write should not spawn two goroutines * the buffering is not right: if do single-byte Writes, i will get no compression at all. * consider using io.Pipe to address the last two. * the use of bytes.Add in the writer causes unnecessary mallocs * the writer uses range over a map testing k == foo instead of writing m[foo]. * the reader should find a way to avoid the range scans too, probably by reversing the key and elem in dict or replacing it with an array. * when prev is a string, you can write prev[i] instead of []byte(prev)[i], avoiding the expensive copying of the entire string that []byte(prev) does. You might find it useful to read the source code for the compress/flate package, which must handle many of the same issues. Russ
Sign in to reply to this message.
I agree with everything Russ said, and addressing his concerns would probably involve re-writing the code that the comments below apply to, but modulo that, here are a few more style issues, in no particular order: A method's receiver name is usually just one letter: w instead of lzww. Similarly, if you implement io.Reader and io.Writer, the []byte argument is typically called p and not out or data. Delete the commented-out prints. They're fine for your own debugging but they're not for submission or review. The package doc comment is erroneously duplicated in all four .go files. Also, there should be no blank line between it and the package line. Run "godoc compress/lzw" to check it. Consecutive var statements can be put in a block, so: var i int var s string var b bool can be written as var ( i int s string b bool ) and the result usually looks less cluttered, especially if the variables are related. A cleaner way to write var words []byte = make([]byte, 2) is just words := make([]byte, 2) or if the size is constant, just var words [2]byte I'd drop the "= 0" out of "var shift uint = 0". Variables are zero-initialized by default. Both dictSizeIni and endSignal should be constants, not variables. Imports should be sorted alphabetically. "for ;; { ... }" can be just "for { ... }", but since you're really iterating over a channel use "for word := range lzwr.c { ... }" and close the channel rather than sending the endSignal sentinal. Finally, if you're adding a new package, your change should also modify src/pkg/Makefile.
Sign in to reply to this message.
Thanks for the comments, I'm on it. I've got MSB packing order working as well now and I'm using an exposed setter function to let the user set the packing order. Should the variable representing that be a global var or rather a member of the reader/writer struct? Thanks, Mathieu On 2010/10/23 00:41:26, nigeltao_gnome wrote: > I agree with everything Russ said, and addressing his concerns would > probably involve re-writing the code that the comments below apply to, > but modulo that, here are a few more style issues, in no particular > order: > > A method's receiver name is usually just one letter: w instead of > lzww. Similarly, if you implement io.Reader and io.Writer, the []byte > argument is typically called p and not out or data. > > Delete the commented-out prints. They're fine for your own debugging > but they're not for submission or review. > > The package doc comment is erroneously duplicated in all four .go > files. Also, there should be no blank line between it and the package > line. Run "godoc compress/lzw" to check it. > > Consecutive var statements can be put in a block, so: > var i int > var s string > var b bool > can be written as > var ( > i int > s string > b bool > ) > and the result usually looks less cluttered, especially if the > variables are related. > > A cleaner way to write > var words []byte = make([]byte, 2) > is just > words := make([]byte, 2) > or if the size is constant, just > var words [2]byte > > I'd drop the "= 0" out of "var shift uint = 0". Variables are > zero-initialized by default. > > Both dictSizeIni and endSignal should be constants, not variables. > > Imports should be sorted alphabetically. > > "for ;; { ... }" can be just "for { ... }", but since you're really > iterating over a channel use "for word := range lzwr.c { ... }" and > close the channel rather than sending the endSignal sentinal. > > > Finally, if you're adding a new package, your change should also > modify src/pkg/Makefile.
Sign in to reply to this message.
On 24 October 2010 03:13, <mathieu.lonjaret@gmail.com> wrote: > I've got MSB packing order working as well now and I'm using an exposed > setter function to let the user set the packing order. Should the > variable representing that be a global var or rather a member of the > reader/writer struct? Since GIF and PDF use different packing orders, and it's not inconceivable that a program could write both sorts of output, this should not be a global variable.
Sign in to reply to this message.
Hello nigeltao_gnome, rsc (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
this is a new version of the reader which, I think addresses all of your comments. the writer is in the works. I have 2 other questions: 1) for the same reason you gave me about the packing order variable, I think I should change the wordsize variable as well to be a member of the reader instead of a global var. right? 2) at no point do I set/return an error but I bet there are some cases where the read operations should return one. any advice on that? Cheers, Mathieu
Sign in to reply to this message.
Hello nigeltao_gnome, rsc (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
I have many concerns about the basic efficiency of your implementation and general programming style (e.g. lots of duplicated code). I would strongly recommend that you try to study the compress/{flate,gzip,zlib} code until you understand how it works. Their general design is pretty solid and should work equally well for the LZW algorithm. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/Makefile File src/pkg/compress/lzw/Makefile (right): http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/Makefil... src/pkg/compress/lzw/Makefile:1: # Copyright 2009 The Go Authors. All rights reserved. s/2009/2010/ http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.go File src/pkg/compress/lzw/reader.go (right): http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:16: const dictSizeIni uint16 = 256 Drop the uint16. Constants are typically ideal numbers. Similarly for writer.go. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:22: err os.Error r.err is never re-used between different methods. Why is it a field and not a local variable? http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:23: dict map[uint16]string If dict is a dense map with entries from 0 up to some n, then a map is overkill and inefficient. Just use a []string and append. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:30: func NewReader(r io.Reader, ws uint8, order string) (io.ReadCloser, os.Error) { All public functions need comments. For example, what does ws and order mean? What are their valid values? http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:31: lzwr := new(reader) Rather than assigning each field in a separate statement, do lzwr := &reader{ r: r, wordsize: ws, // etcetera. } http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:58: func (r *reader) Read(p []byte) (n int, err os.Error) { Rather than reader implementing ReadCloser, it would be simpler if NewReader just returned the read end of the pipe, exactly the same as what compress/flate does. You don't need to use csync to signal the write end that the read end is ready. The io.Pipe already does that. The writer goroutine should just write to the pipe, and it will block until the reader goroutine is ready. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:75: input []byte = make([]byte, 1) Reading from a Reader one byte at a time is terribly inefficient. Use a bufio.Reader and ReadByte. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:122: func (r *reader) readBitsMSB() { There is a lot of copy-and-paste between readBitsMSB and readBitsLSB. I think there's a lot of opportunity to refactor out some duplicated code. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:191: b := make([]byte, toRead) Allocating a new buffer each time is needless garbage. Just allocate one buffer (possibly a bytes.Buffer) and re-use it. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:245: entry = prev + temp[0:1] Growing a string one character at a time is O(N^2), which seems needlessly expensive. You should really be using []byte instead of string, since each conversion between one and the other involves an allocation, a copy, and a garbage cost. In fact, I don't think you should be using strings at all. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader_... File src/pkg/compress/lzw/reader_test.go (right): http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader_... src/pkg/compress/lzw/reader_test.go:13: func TestDecompressorLSB_11_whole(t *testing.T) { Rather than having a separate TestFooBar function for each different case, with a lot of copy-and-pasted code, I would prefer a data-driven test with one TestReader function that ranged over a test suite. Look at the tests in compress/gzip and compress/zlib for what I mean. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader_... src/pkg/compress/lzw/reader_test.go:34: func TestDecompressorLSB_11_byte(t *testing.T) { I don't think the _whole versus _byte distinction is winning you much. I'd much rather have inputs of different sizes to give me confidence that buffer boundaries are tested. http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/writer.go File src/pkg/compress/lzw/writer.go (right): http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/writer.... src/pkg/compress/lzw/writer.go:5: package lzw I haven't looked closely at writer.go but a lot of the general comments I made about reader.go apply. For example efficiency, copy-and-pasted code, the need for comments, simpler use of the io.Pipe... http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/writer_... File src/pkg/compress/lzw/writer_test.go (right): http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/writer_... src/pkg/compress/lzw/writer_test.go:12: func TestCompressorLSB_11_whole(t *testing.T) { IIUC there can be more than one correct encoding of any particular string. Rather than testing that this implementation returns a particular golden value, I'd rather test that wrapping a lzw.Writer and lzw.Reader around the two ends of an io.Pipe is a no-op. Again, look at the tests in compress/{gzip,zlib}/*_test.go for what I mean.
Sign in to reply to this message.
Thanks for the comments, I'll work on them. There's one point I'm not sure about though: On 2010/11/23 07:14:32, nigeltao wrote: > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... > src/pkg/compress/lzw/reader.go:122: func (r *reader) readBitsMSB() { > There is a lot of copy-and-paste between readBitsMSB and readBitsLSB. I think > there's a lot of opportunity to refactor out some duplicated code. I agree some of their parts are similar but I felt they were already complicated and different enough that trying to merge them would result in something quite unreadable. Maybe I can at least isolate a common denominator and put it in another function though. I'll see what I can do, but I'm not confident about that one.
Sign in to reply to this message.
Regarding Nigel's comment about duplication, please pick one byte order - the most common one, so probably what GIF uses - and hard-code that one. Code that wants the opposite byte order can run the bytes through a [256]byte lookup table as a post-processing step (for compression) or during the reading loop (for uncompression) Russ
Sign in to reply to this message.
I remember you already mentionned such a technique, so I thought about it for a while but I didn't really understand what you meant. Like, when and from what does one build this lookup table in the first place? I'll think about it some more then, thanks. On 2010/11/29 16:04:20, rsc wrote: > Regarding Nigel's comment about duplication, > please pick one byte order - the most common one, > so probably what GIF uses - and hard-code that one. > Code that wants the opposite byte order can run the > bytes through a [256]byte lookup table as a post-processing step > (for compression) or during the reading loop (for uncompression) > > Russ
Sign in to reply to this message.
On Mon, Nov 29, 2010 at 11:19, <mathieu.lonjaret@gmail.com> wrote: > I remember you already mentionned such a technique, so I thought about > it for a while but I didn't really understand what you meant. Like, when > and from what does one build this lookup table in the first place? > I'll think about it some more then, thanks. The table is in src/pkg/compress/flate/reverse_bits.go. If b is msb-first, then reverseByte[b] is lsb-first and vice versa. Russ
Sign in to reply to this message.
On Mon, Nov 29, 2010 at 6:04 PM, Russ Cox <rsc@golang.org> wrote: > The table is in src/pkg/compress/flate/reverse_bits.go. > If b is msb-first, then reverseByte[b] is lsb-first and vice versa. > > Russ > Sorry, I don't get it. Maybe I understood it wrongly but from what I've read, if we have, say (encoded on 11 bits words) foo = 00000101 01101001 and then bar = 00000011 10110011 to pack, the stream will be: with LSB: 01101001 10011|101 XX011101 XXXXXXXX with MSB: 10101101 001|01110 110011XX XXXXXXXX (the pipe is placed an the junction of foo and bar) I don't think reversing the bits will allow to go from one to the other, at least not directly in one step after compression. What am I missing here?
Sign in to reply to this message.
Ah. You are saying that the MSB/LSB setting affects both the order in which bits are inserted into the bit stream and the order in which 8-bit chunks of that bit stream are packed into bytes. I expected it to affect only the latter. I have no idea which is the actual behavior. The thing to do is to make a tiny GIF LZW stream and make a tiny PDF LZW stream and make sure you can decode both. Russ
Sign in to reply to this message.
Hello, Another question: when the writer is created with a wordsize (and hence a max dictionary size) smaller than the optimum, do you think I should abort and report it as an error (so that the caller can retry with a superior size), or rather switch to a mode where the algorithm (when it has reached the max dict size) keeps using the existing codes without adding new entries to the dict? I think this is kindof a non-issue with an advanced enough gif implementation since it's supposed to use a variable word/dict size, but I'm not there yet. Mathieu
Sign in to reply to this message.
> when the writer is created with a wordsize (and hence a max dictionary > size) smaller than the optimum, I don't quite understand. What is the optimum? How would the caller know? Russ
Sign in to reply to this message.
On 2011/01/11 16:48:24, rsc wrote: > > when the writer is created with a wordsize (and hence a max dictionary > > size) smaller than the optimum, > > I don't quite understand. > What is the optimum? > How would the caller know? > > Russ Say your input is such that the encoding would generate 600 codes for it. If your codes start at 0, continuously until 599, that means they all fit on a 10 bits word, so you can pack them on the final stream as 10 bits words (and not waste 6 bits for each word). 11 would be a waste for all words, and 9 not enough because some of the words would not fit, hence in that case 10 would be the best. With the variable wordsize improvement in gifs, you would start with a very small wordsize, say 5, and increase it by steps of one as much as you need. But I think that's an improvement that can come later, and the fixed wordsize case has to be there anyway since variable is not always used (I think). The caller indeed can't know in advance if the chosen wordsize is too small for the input text. That's why I'm proposing to either report a meaningful error (so the wordsize can be increased by the caller), or to keep going without creating additional codes when the dictionary gets full. This last solution can of course be pretty inefficient in terms of compression, and I can't think of a case where anyone would intentionally want that behaviour. As to the contrary (wordsize bigger than the optimum, hence bits wasted when packing the words) I'm not sure how to cope with that. I could make an automatic mode where there's a first full run of the encoder only to determine what's the optimum wordsize, and then a normal run of decoder+bitswriter. How about that? Mathieu
Sign in to reply to this message.
after some more reading, it seems like the original LZW used 12 bits words from the start, and simply stopped writing to the dictionary when its 4096 entries are filled, just using it as is for lookups. So I'm going to leave the possibility for the caller to set another wordsize but I'm going to have it default to 12 and not error when the dict is filled but rather keep on going with the dict as is. That's for a start. Then possible improvements (apart from variable wordsize) are to simply flush the dict entries when it's filled, or even better: discard entries that are not used often. Does that sound ok?
Sign in to reply to this message.
On 20 January 2011 20:36, <mathieu.lonjaret@gmail.com> wrote: > after some more reading, it seems like the original LZW used 12 bits > words from the start, and simply stopped writing to the dictionary when > its 4096 entries are filled, just using it as is for lookups. > So I'm going to leave the possibility for the caller to set another > wordsize but I'm going to have it default to 12 and not error when the > dict is filled but rather keep on going with the dict as is. That sounds OK to me. It might be worth investigating what other open source lzw (or gif of pdf) libraries do.
Sign in to reply to this message.
Hello nigeltao_gnome, nigeltao (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
On 2010/11/23 07:14:32, nigeltao wrote: > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.go > File src/pkg/compress/lzw/reader.go (right): > > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... > src/pkg/compress/lzw/reader.go:16: const dictSizeIni uint16 = 256 > Drop the uint16. Constants are typically ideal numbers. > > Similarly for writer.go. done. >http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.go#newcode22 > src/pkg/compress/lzw/reader.go:22: err os.Error > r.err is never re-used between different methods. Why is it a field and not a > local variable? done. >http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.go#newcode23 > src/pkg/compress/lzw/reader.go:23: dict map[uint16]string > If dict is a dense map with entries from 0 up to some n, then a map is overkill > and inefficient. Just use a []string and append. done. > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... > src/pkg/compress/lzw/reader.go:30: func NewReader(r io.Reader, ws uint8, order > string) (io.ReadCloser, os.Error) { > All public functions need comments. For example, what does ws and order mean? > What are their valid values? done. > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... > src/pkg/compress/lzw/reader.go:31: lzwr := new(reader) > Rather than assigning each field in a separate statement, do > lzwr := &reader{ > r: r, > wordsize: ws, > // etcetera. > } done. > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... > src/pkg/compress/lzw/reader.go:58: func (r *reader) Read(p []byte) (n int, err > os.Error) { > Rather than reader implementing ReadCloser, it would be simpler if NewReader > just returned the read end of the pipe, exactly the same as what compress/flate > does. You don't need to use csync to signal the write end that the read end is > ready. The io.Pipe already does that. The writer goroutine should just write to > the pipe, and it will block until the reader goroutine is ready. I tried going that way, but I needed to implement an explicit Read/Write to do the "breaking" of large inputs into reasonably sized chunks (but maybe there's a better way?). So NewReader/NewWriter still returns a *reader/*writer. > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... > src/pkg/compress/lzw/reader.go:75: input []byte = make([]byte, 1) > Reading from a Reader one byte at a time is terribly inefficient. Use a > bufio.Reader and ReadByte. done. > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... > src/pkg/compress/lzw/reader.go:122: func (r *reader) readBitsMSB() { > There is a lot of copy-and-paste between readBitsMSB and readBitsLSB. I think > there's a lot of opportunity to refactor out some duplicated code. done. merging the two implied adding a lot of branches (switch) so it's probably (a bit?) slower than what I had before, but it is indeed more readable now. > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... > src/pkg/compress/lzw/reader.go:191: b := make([]byte, toRead) > Allocating a new buffer each time is needless garbage. Just allocate one buffer > (possibly a bytes.Buffer) and re-use it. done. > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader.... > src/pkg/compress/lzw/reader.go:245: entry = prev + temp[0:1] > Growing a string one character at a time is O(N^2), which seems needlessly > expensive. You should really be using []byte instead of string, since each > conversion between one and the other involves an allocation, a copy, and a > garbage cost. In fact, I don't think you should be using strings at all. done. I'm now using []byte whenever possible. Please let me know if it can be improved further regarding speed and allocations. > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader_... > File src/pkg/compress/lzw/reader_test.go (right): > > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader_... > src/pkg/compress/lzw/reader_test.go:13: func TestDecompressorLSB_11_whole(t > *testing.T) { > Rather than having a separate TestFooBar function for each different case, with > a lot of copy-and-pasted code, I would prefer a data-driven test with one > TestReader function that ranged over a test suite. Look at the tests in > compress/gzip and compress/zlib for what I mean. > > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/reader_... > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/writer_... > File src/pkg/compress/lzw/writer_test.go (right): > > http://codereview.appspot.com/2510041/diff/32001/src/pkg/compress/lzw/writer_... > src/pkg/compress/lzw/writer_test.go:12: func TestCompressorLSB_11_whole(t > *testing.T) { > IIUC there can be more than one correct encoding of any particular string. > Rather than testing that this implementation returns a particular golden value, > I'd rather test that wrapping a lzw.Writer and lzw.Reader around the two ends of > an io.Pipe is a no-op. done. and I also mimicked a lot of the behavior of the other compression packages tests as you advised.
Sign in to reply to this message.
Hello nigeltao_gnome, nigeltao (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/Makefile File src/pkg/compress/lzw/Makefile (right): http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/Makefil... src/pkg/compress/lzw/Makefile:1: # Copyright 2010 The Go Authors. All rights reserved. s/2010/2011/ and similarly in other files in this change. http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/lzw_tes... File src/pkg/compress/lzw/lzw_test.go (right): http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/lzw_tes... src/pkg/compress/lzw/lzw_test.go:89: _, err = compressor.Write(tt.raw) It would be cleaner if you always used io.Copy, with the Reader either being a file reader or a bytes.Buffer. http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/lzw_tes... src/pkg/compress/lzw/lzw_test.go:126: func TestIdentity(t *testing.T) { You have some tests that read(write(x)) is the identity function, but I'd also like some reading-only tests of binary data, similar to the other test cases in compress/{flate,gzip,zlib}. For example, only testing for identity doesn't catch the case where both reading and writing have the same bug. http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/reader.go File src/pkg/compress/lzw/reader.go (right): http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:7: compressed with the Lempel–Ziv–Welch algorithm. Is there a canonical reference for the algorithm? I did a quick search and could not find an RFC. The Welch (1984) "A Technique for High-Performance Data Compression" paper might be the best reference, but I didn't look too deeply. http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:33: c chan uint16 Sending the uint16 values one at a time over a channel sounds inefficient to me. Batching up slices of uint16 would be better. The analogy is that an io.Pipe is based on a chan []byte, not a chan byte. http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:107: <-r.c Why do you have to wait? Also, I don't like how both goroutines are both writing to and reading from r.c. It would feel a lot cleaner if one end only writes, and the other end only reads. http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:132: case lsb: Switching on a string will be expensive inside an inner loop. You should do the order == "LSB" comparison once, and save the result as a bool variable. http://codereview.appspot.com/2510041/diff/57001/src/pkg/compress/lzw/reader.... src/pkg/compress/lzw/reader.go:254: if b.Len() >= bufSize { Wouldn't it be easier for b to be a bufio.Writer around r.pw, instead of managing the buffering yourself!?
Sign in to reply to this message.
After a bit of reading, it appears that both GIF and PDF use a maximum word size of 12 bits, since it enables an extremely efficient hash table representation as a []uint32, where each uint32 holds the 12 bit code, the 12 bit code of the immediate prefix, and the 8 bit byte that differentiates what the two codes represent. It is certainly feasible to do something similar in Go. Compared to that, using a map[string]uint16 to encode and a [][]byte to decode would involve many string and []byte allocations (and conversions between the two), and be very costly. I think that practicality trumps generality, and I think it's worth capping the word size at 12 bits, instead of 16 bits. I might have a crack at writing a GIF decoder later this week, which would help me understand what I would want from a LZW package.
Sign in to reply to this message.
On 2011/02/15 11:56:01, nigeltao_gnome wrote: > After a bit of reading, it appears that both GIF and PDF use a maximum > word size of 12 bits, since it enables an extremely efficient hash > table representation as a []uint32, where each uint32 holds the 12 bit > code, the 12 bit code of the immediate prefix, and the 8 bit byte that > differentiates what the two codes represent. It is certainly feasible > to do something similar in Go. Compared to that, using a > map[string]uint16 to encode and a [][]byte to decode would involve > many string and []byte allocations (and conversions between the two), > and be very costly. I think that practicality trumps generality, and I > think it's worth capping the word size at 12 bits, instead of 16 bits. This is especially true in that case since I don't see lzw being used for many things other than pdf and gif. (although I saw at least one audio format I think, I'll try to dig it back). > I might have a crack at writing a GIF decoder later this week, which > would help me understand what I would want from a LZW package. Ok, I'll wait for some more input from you then, thanks. I'm still gonna try and work on your previous comments in the meanwhile though, in case some of the code can be kept.
Sign in to reply to this message.
On Tue, Feb 15, 2011 at 14:45, <mathieu.lonjaret@gmail.com> wrote: > On 2011/02/15 11:56:01, nigeltao_gnome wrote: > This is especially true in that case since I don't see lzw being used > for many things other than pdf and gif. (although I saw at least one > audio format I think, I'll try to dig it back). tiff! I am writing a tiff decoder, and I am waiting for the LZW package to be added so that we will be able to read LZW-compressed tiff images. > http://codereview.appspot.com/2510041/ --Benny. -- The first essential in chemistry is that you should perform practical work and conduct experiments, for he who performs not practical work nor makes experiments will never attain the least degree of mastery. -- Abu Musa Jabir ibn Hayyan (721-815)
Sign in to reply to this message.
|