1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
 Zooko O'Whielacronx
November 2002
edited for clarity February 2007
Corrected November 2009
humanoriented base32 encoding
INTRO
The base32 encoding implemented in this library differs from that described in
RFC 3548 [1] in several ways. This document describes why we made each
different choice, and also includes a section at the end on "COMPATIBILITY AND
INTEROPERATION".
This encoding is implemented in a project named zbase32 [2].
This is version 0.9.4.6 of this document. The latest version should always be
available at:
http://zooko.com/repos/zbase32/base32/DESIGN
RATIONALE
The rationale for base32 encoding in RFC 3548 [1] is as written therein: "The
Base 32 encoding is designed to represent arbitrary sequences of octets in a
form that needs to be case insensitive but need not be humanly readable.".
The rationale for our encoding is different  it is to represent arbitrary
sequences of octets in a form that is as convenient as possible for human
users to manipulate. In particular, zbase32 was created in order to serve
the Mnet project [3], where 30octet cryptographic values are encoded into
URIs for humans to manipulate. Anticipated uses of these URIs include cut
andpaste, text editing (e.g. in HTML files), manual transcription via a
keyboard, manual transcription via penandpaper, vocal transcription over
phone or radio, etc.
The desiderata for such an encoding are:
* minimizing transcription errors  e.g. the wellknown problem of confusing
`0' with `O'
* embedding into other structures  e.g. search engines, structured or
markedup text, file systems, command shells
* brevity  Shorter URLs are better than longer ones.
* ergonomics  Human users (especially nontechnical ones) should find the
URIs as easy and pleasant as possible. The uglier the URI looks, the worse.
DESIGN
Base
The first decision we made was to use base32 instead of base64. An earlier
version of this project used base64, but a discussion on the p2phackers
mailing list [4] convinced us that the added length of base32 encoding was
worth the added clarity provided by: caseinsensitivity, the absence of non
alphanumeric characters, and the ability to omit a few of the most troublesome
alphanumeric characters.
In particular, we realized that it would probably be faster and more comfortable
to vocally read off a base32 encoded 30octet string (48 characters, case
insensitive, no nonalphanumeric characters) than a base64 encoded one
(40 characters, casesensitive, plus two nonalphanumeric characters).
Alphabet
There are 26 alphabet characters and 10 digits, for a total of 36 characters
available. We need only 32 characters for our base32 alphabet, so we can
choose four characters to exclude. This is where we part company with
traditional base32 encodings. For example [1] eliminates `0', `1', `8', and
`9'. This choice eliminates two characters that are relatively unambiguous
(`8' and `9') while retaining others that are potentially confusing. Others
have suggested eliminating `0', `1', `O', and `L', which is likewise suboptimal.
Our choice of confusing characters to eliminate is: `0', `l', `v', and `2'. Our
reasoning is that `0' is potentially mistaken for `o', that `l' is potentially
mistaken for `1' or `i', that `v' is potentially mistaken for `u' or `r'
(especially in handwriting) and that `2' is potentially mistaken for `z'
(especially in handwriting).
Note that we choose to focus on typed and written transcription more than on
vocal, since humans already have a wellestablished system of disambiguating
spoken alphanumerics, such as the United States military's "Alpha Bravo Charlie
Delta" and telephone operators' "Is that 'd' as in 'dog'?".
Order of Alphabet
Some of the alphabet characters will appear more frequently than others in the
final position of the encoded strings, assuming an even distribution of binary
inputs. (This is true whether the lengths of your inputs are evenly distributed
over all integer numbers of bits *or* evenly distributed over all integer
numbers of octets.) Here is a table showing which lengthinbits (modulo 5)
results in which possible trailing characters in the
"abcdefghijklmnopqrstuvwxyz234567" encoding:
1: aq
2: aqiy
3: aqiyemu4
4: aqiyemu4cgkosw26
0: abcdefghijklmnopqrstuvwxyz234567
Here is the same table for our alphabet:
1: yo
2: yoea
3: yoearcwh
4: yoearcwhngkq1s46
0: ybndrfg8ejkmcpqxot1uwisza345h769 (the whole alphabet)
We have permuted the alphabet to make the more commonly occuring characters also
be those that we think are easier to read, write, speak, and remember.
Length Encoding and SubOctet Data
Suppose you have 10 bits of data to transmit, and the recipient (the decoder) is
expecting 10 bits of data. All previous base32 encoding schemes assume that
the binary data to be encoded is in 8bit octets, so you would have to pad the
data out to 2 octets and encode it in base32, resulting in a string
4characters long. The decoder will decode that into 2 octets (16 bits),
ignoring the least significant 4 bits of the encoded string, and then ignore the
least significant 6 bits of the decoded data.
In the base32 encoding described here, if the encoder and decoder both know the
exact length of the data in bits (modulo 40), then they can use this shared
information to optimize the size of the transmitted (encoded) string. In the
example that you have 10 bits of data to transmit, zbase32 allows you to
transmit the optimal encoded string: two characters. If the encoder and
decoder aren't both designed with the requirement that they will know the
exact length of the data in bits, then they can of course assume that it is
"the smallest number of octets which would have required this many quintets to
encode", which is simple and unambiguous but can occasionally cost an extra
byte or two of encoded size.
You can always use this encoding the same way you would use the other
encodings  with an "input is in 8bit octets" assumption. This would be
appropriate if the length in bits is always a multiple of 8, if both sides are
not sure of the length in bits modulo 40, or if this encoding is being used in a
way that optimizing one or two characters out of the encoded string isn't worth
the potential confusion.
Padding
Traditionally base32 encodings have specified trailing padding to round out the
number of characters to an even multiple of 8. This is apparently intended as
an error detection code, but we do not consider the error detection capabilities
of this code to be worth the increased length of the encoded strings, so we do
not do this.
Letter case
Lower case is easier to read. That's why people have been using it
preferentially since around the 9th century CE. Isn't it about time that
software engineers took advantage of mankind's milleniaold knowledge of
typography instead of thoughtlessly aping the hardware limitations of
terminals used in the second half of the 20th century?
EXAMPLES
#bits base2 base32 base64 zbase32
    
1 0 AA====== AA== y
1 1 QA====== gA== o
2 01 IA====== QA== e
2 11 QA====== gA== a
10 0000000000 AAAA==== AAA= yy
10 1000000010 QCAA==== gIA= on
20 10001011100010001000 BC4IQ=== CLiI tqre
24 111100001011111111000111 6C74O=== 8L/H 6n9hq
24 110101000111101000000100 2R5AI=== 1HoE 4t7ye
30 111101010101011110111101000011 HVK66QY= PVXvQw== 6im5sd
A NOTE ON COMPATIBILITY AND INTEROPERATION
If your application could interoperate with an extant standard, then you
should use RFC 3548 base32 in order to facilitate interoperation by encoding
semantically identical objects into syntactically identical representations.
For example, many current systems include the SHA1 hash of the contents of a
file, and this hash value can be represented for user or programmatic sharing
in base32 encoded form [5, 6, 7, 8]. These four systems all use RFC 3548
base32 encoding. If your system will expose the SHA1 hash of the contents
of a file, then you should make sure those hash values are easily exchangeable
with those systems by using the same encoding (including base, alphabet,
permutation of alphabet, lengthencoding, padding, treatment of illegal
characters and linebreaks).
If, however, the semantic meaning of the objects that you are exposing is not
something that can be understood by another extant system, due to semantic
differences, then you gain nothing with regard to interoperation by using the
same ASCII encoding, and in fact by doing so you *hamper* interoperation by
making it impossible for the applications to use syntactic features to
disambiguate between semantic features.
Lucas Gonze has suggested [9] that different schemes could in fact
deliberately add characters which would be illegal in another scheme in order
to enable syntactic differentiation. (This would be morally similar to the
"check digit" included in most credit card numbers.)
Obviously the more reliable semantic differentiation is an unambiguous one
that is transmitted outofband (outside of the encoded string, that is), such
as URI scheme names (e.g.: SHA1:blahblahblah or mnet://blahblahblah).
However, users might not always pay the cost to preserve those.
REFERENCES
[1] http://www.faqs.org/rfcs/rfc3548.html
[2] http://zooko.com/repos/zbase32
[3] http://mnetproject.org/
[4] http://zgp.org/pipermail/p2phackers/2001October/
[5] Gnutella [need URL for SHA1 and base32 encoding stuff]
[6] Bitzi [need URL for specification stuff]
[7] CAW [need URL]
[8] http://opencontent.net/specs/draftjchapweskethex01.html
[9] http://zgp.org/pipermail/p2phackers/2002November/000924.html
[10] http://zgp.org/pipermail/p2phackers/2002November/000927.html
NEEDED TO ADD
* possible new design element: optional check characters (e.g. Luhnlike
algorithm)
* Full spec, including the issues named by draftjosefssonbaseencoding04.txt
such as treatment of illegal chars, etc.
+ also issues of lengthencoding, schemeencoding (e.g. URIs), etc.
* Explanation of why we avoid nonalphanumerics.
* Mention of the myriad other clarity issues such as those Gojomo posted?
