doc/articles/gobs_of_data.html - Issue 5834043: code review 5834043: doc: add Gobs of data article

Side by Side Diff: doc/articles/gobs_of_data.html

Issue 5834043: code review 5834043: doc: add Gobs of data article (Closed)

Patch Set: diff -r 59b7d79b616f https://code.google.com/p/go Created 13 years ago

Left:
Right:

Use n/p to move between diff chunks; N/P to move between comments. Please Sign in to add in-line comments.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 <!--{

	2 "Title": "Gobs of data",

	3 "Template": true

	4 }-->

	5

	6 <p>

	7 To transmit a data structure across a network or to store it in a file, it must

	8 be encoded and then decoded again. There are many encodings available, of

	9 course: <a href="http://www.json.org/">JSON</a>,

	10 <a href="http://www.w3.org/XML/">XML</a>, Google's

	11 <a href="http://code.google.com/p/protobuf">protocol buffers</a>, and more.

	12 And now there's another, provided by Go's <a href="/pkg/encoding/gob/">gob</a>

	13 package.

	14 </p>

	15

	16 <p>

	17 Why define a new encoding? It's a lot of work and redundant at that. Why not

	18 just use one of the existing formats? Well, for one thing, we do! Go has

	19 <a href="/pkg/">packages</a> supporting all the encodings just mentioned (the

	20 <a href="http://code.google.com/p/goprotobuf">protocol buffer package</a> is in

	21 a separate repository but it's one of the most frequently downloaded). And for

	22 many purposes, including communicating with tools and systems written in other

	23 languages, they're the right choice.

	24 </p>

	25

	26 <p>

	27 But for a Go-specific environment, such as communicating between two servers

	28 written in Go, there's an opportunity to build something much easier to use and

	29 possibly more efficient.

	30 </p>

	31

	32 <p>

	33 Gobs work with the language in a way that an externally-defined,

	34 language-independent encoding cannot. At the same time, there are lessons to be

	35 learned from the existing systems.

	36 </p>

	37

	38 <p>

	39 <b>Goals</b>

	40 </p>

	41

	42 <p>

	43 The gob package was designed with a number of goals in mind.

	44 </p>

	45

	46 <p>

	47 First, and most obvious, it had to be very easy to use. First, because Go has

	48 reflection, there is no need for a separate interface definition language or

	49 "protocol compiler". The data structure itself is all the package should need

	50 to figure out how to encode and decode it. On the other hand, this approach

	51 means that gobs will never work as well with other languages, but that's OK:

	52 gobs are unashamedly Go-centric.

	53 </p>

	54

	55 <p>

	56 Efficiency is also important. Textual representations, exemplified by XML and

	57 JSON, are too slow to put at the center of an efficient communications network.

	58 A binary encoding is necessary.

	59 </p>

	60

	61 <p>

	62 Gob streams must be self-describing. Each gob stream, read from the beginning,

	63 contains sufficient information that the entire stream can be parsed by an

	64 agent that knows nothing a priori about its contents. This property means that

	65 you will always be able to decode a gob stream stored in a file, even long

	66 after you've forgotten what data it represents.

	67 </p>

	68

	69 <p>

	70 There were also some things to learn from our experiences with Google protocol

	71 buffers.

	72 </p>

	73

	74 <p>

	75 <b>Protocol buffer misfeatures</b>

	76 </p>

	77

	78 <p>

	79 Protocol buffers had a major effect on the design of gobs, but have three

	80 features that were deliberately avoided. (Leaving aside the property that

	81 protocol buffers aren't self-describing: if you don't know the data definition

	82 used to encode a protocol buffer, you might not be able to parse it.)

	83 </p>

	84

	85 <p>

	86 First, protocol buffers only work on the data type we call a struct in Go. You

	87 can't encode an integer or array at the top level, only a struct with fields

	88 inside it. That seems a pointless restriction, at least in Go. If all you want

	89 to send is an array of integers, why should you have to put put it into a

	90 struct first?

	91 </p>

	92

	93 <p>

	94 Next, a protocol buffer definition may specify that fields <code>T.x</code> and

	95 <code>T.y</code> are required to be present whenever a value of type

	96 <code>T</code> is encoded or decoded. Although such required fields may seem

	97 like a good idea, they are costly to implement because the codec must maintain a

	98 separate data structure while encoding and decoding, to be able to report when

	99 required fields are missing. They're also a maintenance problem. Over time, one

	100 may want to modify the data definition to remove a required field, but that may

	101 cause existing clients of the data to crash. It's better not to have them in the

	102 encoding at all. (Protocol buffers also have optional fields. But if we don't

	103 have required fields, all fields are optional and that's that. There will be

	104 more to say about optional fields a little later.)

	105 </p>

	106

	107 <p>

	108 The third protocol buffer misfeature is default values. If a protocol buffer

	109 omits the value for a "defaulted" field, then the decoded structure behaves as

	110 if the field were set to that value. This idea works nicely when you have

	111 getter and setter methods to control access to the field, but is harder to

	112 handle cleanly when the container is just a plain idiomatic struct. Required

	113 fields are also tricky to implement: where does one define the default values,

	114 what types do they have (is text UTF-8? uninterpreted bytes? how many bits in a

	115 float?) and despite the apparent simplicity, there were a number of

	116 complications in their design and implementation for protocol buffers. We

	117 decided to leave them out of gobs and fall back to Go's trivial but effective

	118 defaulting rule: unless you set something otherwise, it has the "zero value"

	119 for that type - and it doesn't need to be transmitted.

	120 </p>

	121

	122 <p>

	123 So gobs end up looking like a sort of generalized, simplified protocol buffer.

	124 How do they work?

	125 </p>

	126

	127 <p>

	128 <b>Values</b>

	129 </p>

	130

	131 <p>

	132 The encoded gob data isn't about <code>int8s</code> and <code>uint16s</code>.
	adg 2012/03/15 05:56:19 <code>int8</code>s <code>uint16</code>s <code>int8</code>s <code>uint16</code>s fss 2012/03/15 14:26:46 Done. Show quoted text On 2012/03/15 05:56:19, adg wrote: > <code>int8</code>s > <code>uint16</code>s Done.
	133 Instead, somewhat analogous to constants in Go, its integer values are abstract,

	134 sizeless numbers, either signed or unsigned. When you encode an

	135 <code>int8</code>, its value is transmitted as an unsized, variable-length

	136 integer. When you encode an <code>int64</code>, its value is also transmitted as

	137 an unsized, variable-length integer. (Signed and unsigned are treated

	138 distinctly, but the same unsized-ness applies to unsigned values too.) If both

	139 have the value 7, the bits sent on the wire will be identical. When the receiver

	140 decodes that value, it puts it into the receiver's variable, which may be of

	141 arbitrary integer type. Thus an encoder may send a 7 that came from an

	142 <code>int8</code>, but the receiver may store it in an <code>int64</code>. This

	143 is fine: the value is an integer and as a long as it fits, everything works. (If

	144 it doesn't fit, an error results.) This decoupling from the size of the variable

	145 gives some flexibility to the encoding: we can expand the type of the integer

	146 variable as the software evolves, but still be able to decode old data.

	147 </p>

	148

	149 <p>

	150 This flexibility also applies to pointers. Before transmission, all pointers are

	151 flattened. Values of type <code>int8</code>, <code>*int8</code>,

	152 <code>int8</code>, <code>**int8</code>, etc. are all transmitted as an

	153 integer value, which may then be stored in <code>int</code> of any size, or

	154 <code>int</code>, or <code>*****int</code>, etc. Again, this allows for

	155 flexibility.

	156 </p>

	157

	158 <p>

	159 Flexibility also happens because, when decoding a struct, only those fields

	160 that are sent by the encoder are stored in the destination. Given the value

	161 </p>

	162

	163 {{code "/doc/progs/gobs1.go" `/type T/` `/STOP/`}}

	164

	165 <p>

	166 the encoding of <code>t</code> sends only the 7 and 8. Because it's zero, the

	167 value of <code>Y</code> isn't even sent; there's no need to send a zero value.

	168 </p>

	169

	170 <p>

	171 The receiver could instead decode the value into this structure:

	172 </p>

	173

	174 {{code "/doc/progs/gobs1.go" `/type U/` `/STOP/`}}

	175

	176 <p>

	177 and acquire a value of <code>u</code> with only <code>X</code> set (to the

	178 address of an <code>int8</code> variable set to 7); the <code>Z</code> field is

	179 ignored - where would you put it? When decoding structs, fields are matched by

	180 name and compatible type, and only fields that exist in both are affected. This

	181 simple approach finesses the "optional field" problem: as the type

	182 <code>T</code> evolves by adding fields, out of date receivers will still

	183 function with the part of the type they recognize. Thus gobs provide the

	184 important result of optional fields - extensibility - without any additional

	185 mechanism or notation.

	186 </p>

	187

	188 <p>

	189 From integers we can build all the other types: <code>bytes</code>,
	adg 2012/03/15 05:56:19 drop the <code> on these drop the <code> on these fss 2012/03/15 14:26:46 Done. Show quoted text On 2012/03/15 05:56:19, adg wrote: > drop the <code> on these Done.
	190 <code>strings</code>, <code>arrays</code>, <code>slices</code>,

	191 <code>maps</code>, even <code>floats</code>. Floating-point values are

	192 represented by their IEEE 754 floating-point bit pattern, stored as an integer,

	193 which works fine as long as you know their type, which we always do. By the way,

	194 that integer is sent in byte-reversed order because common values of

	195 floating-point numbers, such as small integers, have a lot of zeros at the low

	196 end that we can avoid transmitting.

	197 </p>

	198

	199 <p>

	200 One nice feature of gobs that Go makes possible is that they allow you to define

	201 your own encoding by having your type satisfy the

	202 <a href="/pkg/encoding/gob/#GobEncoder">GobEncoder</a> and

	203 <a href="/pkg/encoding/gob/#GobDecoder">GobDecoder</a> interfaces, in a manner

	204 analogous to the <a href="/pkg/encoding/json/">JSON</a> package's

	205 <a href="/pkg/encoding/json/#Marshaler">Marshaler</a> and

	206 <a href="/pkg/encoding/json/#Unmarshaler">Unmarshaler</a> and also to the

	207 <a href="/pkg/fmt/#Stringer">Stringer</a> interface from

	208 <a href="/pkg/fmt/">package fmt</a>. This facility makes it possible to

	209 represent special features, enforce constraints, or hide secrets when you

	210 transmit data. See the <a href="/pkg/encoding/gob/">documentation</a> for

	211 details.

	212 </p>

	213

	214 <p>

	215 <b>Types on the wire</b>

	216 </p>

	217

	218 <p>

	219 The first time you send a given type, the gob package includes in the data

	220 stream a description of that type. In fact, what happens is that the encoder is

	221 used to encode, in the standard gob encoding format, an internal struct that

	222 describes the type and gives it a unique number. (Basic types, plus the layout

	223 of the type description structure, are predefined by the software for

	224 bootstrapping.) After the type is described, it can be referenced by its type

	225 number.

	226 </p>

	227

	228 <p>

	229 Thus when we send our first type <code>T</code>, the gob encoder sends a

	230 description of <code>T</code> and tags it with a type number, say 127. All

	231 values, including the first, are then prefixed by that number, so a stream of

	232 <code>T</code> values looks like:

	233 </p>

	234

	235 <pre>

	236 ("define type id" 127, definition of type T)(127, T value)(127, T value), ...

	237 </pre>

	238

	239 <p>

	240 These type numbers make it possible to describe recursive types and send values

	241 of those types. Thus gobs can encode types such as trees:

	242 </p>

	243

	244 {{code "/doc/progs/gobs1.go" `/type Node/` `/STOP/`}}

	245

	246 <p>

	247 (It's an exercise for the reader to discover how the zero-defaulting rule makes

	248 this work, even though gobs don't represent pointers.)

	249 </p>

	250

	251 <p>

	252 With the type information, a gob stream is fully self-describing except for the

	253 set of bootstrap types, which is a well-defined starting point.

	254 </p>

	255

	256 <p>

	257 <b>Compiling a machine</b>

	258 </p>

	259

	260 <p>

	261 The first time you encode a value of a given type, the gob package builds a

	262 little interpreted machine specific to that data type. It uses reflection on

	263 the type to construct that machine, but once the machine is built it does not

	264 depend on reflection. The machine uses package unsafe and some trickery to

	265 convert the data into the encoded bytes at high speed. It could use reflection

	266 and avoid unsafe, but would be significantly slower. (A similar high-speed

	267 approach is taken by the protocol buffer support for Go, whose design was

	268 influenced by the implementation of gobs.) Subsequent values of the same type

	269 use the already-compiled machine, so they can be encoded right away.

	270 </p>

	271

	272 <p>

	273 Decoding is similar but harder. When you decode a value, the gob package holds

	274 a byte slice representing a value of a given encoder-defined type to decode,

	275 plus a Go value into which to decode it. The gob package builds a machine for

	276 that pair: the gob type sent on the wire crossed with the Go type provided for

	277 decoding. Once that decoding machine is built, though, it's again a

	278 reflectionless engine that uses unsafe methods to get maximum speed.

	279 </p>

	280

	281 <p>

	282 <b>Use</b>

	283 </p>

	284

	285 <p>

	286 There's a lot going on under the hood, but the result is an efficient,

	287 easy-to-use encoding system for transmitting data. Here's a complete example

	288 showing differing encoded and decoded types. Note how easy it is to send and

	289 receive values; all you need to do is present values and variables to the

	290 <a href="/pkg/encoding/gob/">gob package</a> and it does all the work.

	291 </p>

	292

	293 {{code "/doc/progs/gobs2.go" `/package main/` `/STOP/`}}

	294

	295 <p>

	296 You can compile and run this example code by copying it into the
	adg 2012/03/15 05:56:19 s/by copying it into/in/ s/by copying it into/in/ fss 2012/03/15 14:26:46 Done. Show quoted text On 2012/03/15 05:56:19, adg wrote: > s/by copying it into/in/ Done.
	297 <a href="http://play.golang.org/p/_-OJV-rwMq">Go Playground</a>.

	298 </p>

	299

	300 <p>

	301 The <a href="/pkg/net/rpc/">rpc package</a> builds on gobs to turn this

	302 encode/decode automation into transport for method calls across the network.

	303 That's a subject for another post.
	adg 2012/03/15 05:56:19 s/post/article/ s/post/article/ fss 2012/03/15 14:26:46 Done. Show quoted text On 2012/03/15 05:56:19, adg wrote: > s/post/article/ Done.
	304 </p>

	305

	306 <p>

	307 <b>Details</b>

	308 </p>

	309

	310 <p>

	311 The <a href="/pkg/encoding/gob/">gob package documentation</a>, especially the

	312 file <a href="/src/pkg/encoding/gob/doc.go">doc.go</a>, expands on many of the

	313 details described here and includes a full worked example showing how the

	314 encoding represents data. If you are interested in the innards of the gob

	315 implementation, that's a good place to start.

	316 </p>

OLD	NEW

« no previous file with comments | « doc/Makefile ('k') | doc/docs.html » ('j') | doc/progs/gobs2.go » ('J')