Left: | ||
Right: |
OLD | NEW |
---|---|
(Empty) | |
1 <!--{ | |
2 "Title": "Gobs of data", | |
3 "Template": true | |
4 }--> | |
5 | |
6 <p> | |
7 To transmit a data structure across a network or to store it in a file, it must | |
8 be encoded and then decoded again. There are many encodings available, of | |
9 course: <a href="http://www.json.org/">JSON</a>, | |
10 <a href="http://www.w3.org/XML/">XML</a>, Google's | |
11 <a href="http://code.google.com/p/protobuf">protocol buffers</a>, and more. | |
12 And now there's another, provided by Go's <a href="/pkg/encoding/gob/">gob</a> | |
13 package. | |
14 </p> | |
15 | |
16 <p> | |
17 Why define a new encoding? It's a lot of work and redundant at that. Why not | |
18 just use one of the existing formats? Well, for one thing, we do! Go has | |
19 <a href="/pkg/">packages</a> supporting all the encodings just mentioned (the | |
20 <a href="http://code.google.com/p/goprotobuf">protocol buffer package</a> is in | |
21 a separate repository but it's one of the most frequently downloaded). And for | |
22 many purposes, including communicating with tools and systems written in other | |
23 languages, they're the right choice. | |
24 </p> | |
25 | |
26 <p> | |
27 But for a Go-specific environment, such as communicating between two servers | |
28 written in Go, there's an opportunity to build something much easier to use and | |
29 possibly more efficient. | |
30 </p> | |
31 | |
32 <p> | |
33 Gobs work with the language in a way that an externally-defined, | |
34 language-independent encoding cannot. At the same time, there are lessons to be | |
35 learned from the existing systems. | |
36 </p> | |
37 | |
38 <p> | |
39 <b>Goals</b> | |
40 </p> | |
41 | |
42 <p> | |
43 The gob package was designed with a number of goals in mind. | |
44 </p> | |
45 | |
46 <p> | |
47 First, and most obvious, it had to be very easy to use. First, because Go has | |
48 reflection, there is no need for a separate interface definition language or | |
49 "protocol compiler". The data structure itself is all the package should need | |
50 to figure out how to encode and decode it. On the other hand, this approach | |
51 means that gobs will never work as well with other languages, but that's OK: | |
52 gobs are unashamedly Go-centric. | |
53 </p> | |
54 | |
55 <p> | |
56 Efficiency is also important. Textual representations, exemplified by XML and | |
57 JSON, are too slow to put at the center of an efficient communications network. | |
58 A binary encoding is necessary. | |
59 </p> | |
60 | |
61 <p> | |
62 Gob streams must be self-describing. Each gob stream, read from the beginning, | |
63 contains sufficient information that the entire stream can be parsed by an | |
64 agent that knows nothing a priori about its contents. This property means that | |
65 you will always be able to decode a gob stream stored in a file, even long | |
66 after you've forgotten what data it represents. | |
67 </p> | |
68 | |
69 <p> | |
70 There were also some things to learn from our experiences with Google protocol | |
71 buffers. | |
72 </p> | |
73 | |
74 <p> | |
75 <b>Protocol buffer misfeatures</b> | |
76 </p> | |
77 | |
78 <p> | |
79 Protocol buffers had a major effect on the design of gobs, but have three | |
80 features that were deliberately avoided. (Leaving aside the property that | |
81 protocol buffers aren't self-describing: if you don't know the data definition | |
82 used to encode a protocol buffer, you might not be able to parse it.) | |
83 </p> | |
84 | |
85 <p> | |
86 First, protocol buffers only work on the data type we call a struct in Go. You | |
87 can't encode an integer or array at the top level, only a struct with fields | |
88 inside it. That seems a pointless restriction, at least in Go. If all you want | |
89 to send is an array of integers, why should you have to put put it into a | |
90 struct first? | |
91 </p> | |
92 | |
93 <p> | |
94 Next, a protocol buffer definition may specify that fields <code>T.x</code> and | |
95 <code>T.y</code> are required to be present whenever a value of type | |
96 <code>T</code> is encoded or decoded. Although such required fields may seem | |
97 like a good idea, they are costly to implement because the codec must maintain a | |
98 separate data structure while encoding and decoding, to be able to report when | |
99 required fields are missing. They're also a maintenance problem. Over time, one | |
100 may want to modify the data definition to remove a required field, but that may | |
101 cause existing clients of the data to crash. It's better not to have them in the | |
102 encoding at all. (Protocol buffers also have optional fields. But if we don't | |
103 have required fields, all fields are optional and that's that. There will be | |
104 more to say about optional fields a little later.) | |
105 </p> | |
106 | |
107 <p> | |
108 The third protocol buffer misfeature is default values. If a protocol buffer | |
109 omits the value for a "defaulted" field, then the decoded structure behaves as | |
110 if the field were set to that value. This idea works nicely when you have | |
111 getter and setter methods to control access to the field, but is harder to | |
112 handle cleanly when the container is just a plain idiomatic struct. Required | |
113 fields are also tricky to implement: where does one define the default values, | |
114 what types do they have (is text UTF-8? uninterpreted bytes? how many bits in a | |
115 float?) and despite the apparent simplicity, there were a number of | |
116 complications in their design and implementation for protocol buffers. We | |
117 decided to leave them out of gobs and fall back to Go's trivial but effective | |
118 defaulting rule: unless you set something otherwise, it has the "zero value" | |
119 for that type - and it doesn't need to be transmitted. | |
120 </p> | |
121 | |
122 <p> | |
123 So gobs end up looking like a sort of generalized, simplified protocol buffer. | |
124 How do they work? | |
125 </p> | |
126 | |
127 <p> | |
128 <b>Values</b> | |
129 </p> | |
130 | |
131 <p> | |
132 The encoded gob data isn't about <code>int8s</code> and <code>uint16s</code>. | |
adg
2012/03/15 05:56:19
<code>int8</code>s
<code>uint16</code>s
fss
2012/03/15 14:26:46
Done.
| |
133 Instead, somewhat analogous to constants in Go, its integer values are abstract, | |
134 sizeless numbers, either signed or unsigned. When you encode an | |
135 <code>int8</code>, its value is transmitted as an unsized, variable-length | |
136 integer. When you encode an <code>int64</code>, its value is also transmitted as | |
137 an unsized, variable-length integer. (Signed and unsigned are treated | |
138 distinctly, but the same unsized-ness applies to unsigned values too.) If both | |
139 have the value 7, the bits sent on the wire will be identical. When the receiver | |
140 decodes that value, it puts it into the receiver's variable, which may be of | |
141 arbitrary integer type. Thus an encoder may send a 7 that came from an | |
142 <code>int8</code>, but the receiver may store it in an <code>int64</code>. This | |
143 is fine: the value is an integer and as a long as it fits, everything works. (If | |
144 it doesn't fit, an error results.) This decoupling from the size of the variable | |
145 gives some flexibility to the encoding: we can expand the type of the integer | |
146 variable as the software evolves, but still be able to decode old data. | |
147 </p> | |
148 | |
149 <p> | |
150 This flexibility also applies to pointers. Before transmission, all pointers are | |
151 flattened. Values of type <code>int8</code>, <code>*int8</code>, | |
152 <code>**int8</code>, <code>****int8</code>, etc. are all transmitted as an | |
153 integer value, which may then be stored in <code>int</code> of any size, or | |
154 <code>*int</code>, or <code>******int</code>, etc. Again, this allows for | |
155 flexibility. | |
156 </p> | |
157 | |
158 <p> | |
159 Flexibility also happens because, when decoding a struct, only those fields | |
160 that are sent by the encoder are stored in the destination. Given the value | |
161 </p> | |
162 | |
163 {{code "/doc/progs/gobs1.go" `/type T/` `/STOP/`}} | |
164 | |
165 <p> | |
166 the encoding of <code>t</code> sends only the 7 and 8. Because it's zero, the | |
167 value of <code>Y</code> isn't even sent; there's no need to send a zero value. | |
168 </p> | |
169 | |
170 <p> | |
171 The receiver could instead decode the value into this structure: | |
172 </p> | |
173 | |
174 {{code "/doc/progs/gobs1.go" `/type U/` `/STOP/`}} | |
175 | |
176 <p> | |
177 and acquire a value of <code>u</code> with only <code>X</code> set (to the | |
178 address of an <code>int8</code> variable set to 7); the <code>Z</code> field is | |
179 ignored - where would you put it? When decoding structs, fields are matched by | |
180 name and compatible type, and only fields that exist in both are affected. This | |
181 simple approach finesses the "optional field" problem: as the type | |
182 <code>T</code> evolves by adding fields, out of date receivers will still | |
183 function with the part of the type they recognize. Thus gobs provide the | |
184 important result of optional fields - extensibility - without any additional | |
185 mechanism or notation. | |
186 </p> | |
187 | |
188 <p> | |
189 From integers we can build all the other types: <code>bytes</code>, | |
adg
2012/03/15 05:56:19
drop the <code> on these
fss
2012/03/15 14:26:46
Done.
| |
190 <code>strings</code>, <code>arrays</code>, <code>slices</code>, | |
191 <code>maps</code>, even <code>floats</code>. Floating-point values are | |
192 represented by their IEEE 754 floating-point bit pattern, stored as an integer, | |
193 which works fine as long as you know their type, which we always do. By the way, | |
194 that integer is sent in byte-reversed order because common values of | |
195 floating-point numbers, such as small integers, have a lot of zeros at the low | |
196 end that we can avoid transmitting. | |
197 </p> | |
198 | |
199 <p> | |
200 One nice feature of gobs that Go makes possible is that they allow you to define | |
201 your own encoding by having your type satisfy the | |
202 <a href="/pkg/encoding/gob/#GobEncoder">GobEncoder</a> and | |
203 <a href="/pkg/encoding/gob/#GobDecoder">GobDecoder</a> interfaces, in a manner | |
204 analogous to the <a href="/pkg/encoding/json/">JSON</a> package's | |
205 <a href="/pkg/encoding/json/#Marshaler">Marshaler</a> and | |
206 <a href="/pkg/encoding/json/#Unmarshaler">Unmarshaler</a> and also to the | |
207 <a href="/pkg/fmt/#Stringer">Stringer</a> interface from | |
208 <a href="/pkg/fmt/">package fmt</a>. This facility makes it possible to | |
209 represent special features, enforce constraints, or hide secrets when you | |
210 transmit data. See the <a href="/pkg/encoding/gob/">documentation</a> for | |
211 details. | |
212 </p> | |
213 | |
214 <p> | |
215 <b>Types on the wire</b> | |
216 </p> | |
217 | |
218 <p> | |
219 The first time you send a given type, the gob package includes in the data | |
220 stream a description of that type. In fact, what happens is that the encoder is | |
221 used to encode, in the standard gob encoding format, an internal struct that | |
222 describes the type and gives it a unique number. (Basic types, plus the layout | |
223 of the type description structure, are predefined by the software for | |
224 bootstrapping.) After the type is described, it can be referenced by its type | |
225 number. | |
226 </p> | |
227 | |
228 <p> | |
229 Thus when we send our first type <code>T</code>, the gob encoder sends a | |
230 description of <code>T</code> and tags it with a type number, say 127. All | |
231 values, including the first, are then prefixed by that number, so a stream of | |
232 <code>T</code> values looks like: | |
233 </p> | |
234 | |
235 <pre> | |
236 ("define type id" 127, definition of type T)(127, T value)(127, T value), ... | |
237 </pre> | |
238 | |
239 <p> | |
240 These type numbers make it possible to describe recursive types and send values | |
241 of those types. Thus gobs can encode types such as trees: | |
242 </p> | |
243 | |
244 {{code "/doc/progs/gobs1.go" `/type Node/` `/STOP/`}} | |
245 | |
246 <p> | |
247 (It's an exercise for the reader to discover how the zero-defaulting rule makes | |
248 this work, even though gobs don't represent pointers.) | |
249 </p> | |
250 | |
251 <p> | |
252 With the type information, a gob stream is fully self-describing except for the | |
253 set of bootstrap types, which is a well-defined starting point. | |
254 </p> | |
255 | |
256 <p> | |
257 <b>Compiling a machine</b> | |
258 </p> | |
259 | |
260 <p> | |
261 The first time you encode a value of a given type, the gob package builds a | |
262 little interpreted machine specific to that data type. It uses reflection on | |
263 the type to construct that machine, but once the machine is built it does not | |
264 depend on reflection. The machine uses package unsafe and some trickery to | |
265 convert the data into the encoded bytes at high speed. It could use reflection | |
266 and avoid unsafe, but would be significantly slower. (A similar high-speed | |
267 approach is taken by the protocol buffer support for Go, whose design was | |
268 influenced by the implementation of gobs.) Subsequent values of the same type | |
269 use the already-compiled machine, so they can be encoded right away. | |
270 </p> | |
271 | |
272 <p> | |
273 Decoding is similar but harder. When you decode a value, the gob package holds | |
274 a byte slice representing a value of a given encoder-defined type to decode, | |
275 plus a Go value into which to decode it. The gob package builds a machine for | |
276 that pair: the gob type sent on the wire crossed with the Go type provided for | |
277 decoding. Once that decoding machine is built, though, it's again a | |
278 reflectionless engine that uses unsafe methods to get maximum speed. | |
279 </p> | |
280 | |
281 <p> | |
282 <b>Use</b> | |
283 </p> | |
284 | |
285 <p> | |
286 There's a lot going on under the hood, but the result is an efficient, | |
287 easy-to-use encoding system for transmitting data. Here's a complete example | |
288 showing differing encoded and decoded types. Note how easy it is to send and | |
289 receive values; all you need to do is present values and variables to the | |
290 <a href="/pkg/encoding/gob/">gob package</a> and it does all the work. | |
291 </p> | |
292 | |
293 {{code "/doc/progs/gobs2.go" `/package main/` `/STOP/`}} | |
294 | |
295 <p> | |
296 You can compile and run this example code by copying it into the | |
adg
2012/03/15 05:56:19
s/by copying it into/in/
fss
2012/03/15 14:26:46
Done.
| |
297 <a href="http://play.golang.org/p/_-OJV-rwMq">Go Playground</a>. | |
298 </p> | |
299 | |
300 <p> | |
301 The <a href="/pkg/net/rpc/">rpc package</a> builds on gobs to turn this | |
302 encode/decode automation into transport for method calls across the network. | |
303 That's a subject for another post. | |
adg
2012/03/15 05:56:19
s/post/article/
fss
2012/03/15 14:26:46
Done.
| |
304 </p> | |
305 | |
306 <p> | |
307 <b>Details</b> | |
308 </p> | |
309 | |
310 <p> | |
311 The <a href="/pkg/encoding/gob/">gob package documentation</a>, especially the | |
312 file <a href="/src/pkg/encoding/gob/doc.go">doc.go</a>, expands on many of the | |
313 details described here and includes a full worked example showing how the | |
314 encoding represents data. If you are interested in the innards of the gob | |
315 implementation, that's a good place to start. | |
316 </p> | |
OLD | NEW |