GPG Web Key Directory

2024-02-15 2024-02-29 约 1762 字预计阅读 4 分钟次阅读条评论

警告

本文最后更新于 2024-02-29，文中内容可能已过时。

关于 WKD 电子邮件的映射本地部分 32 位字符串计算方法

试图弄明白 0@ftls.xyz WKD 为什么是 https://ftls.xyz/.well-known/openpgpkey/hu/s3cj9timbzrn3hjyu8ehfiykzgkqooec

WKD 服务草案 https://datatracker.ietf.org/doc/draft-koch-openpgp-webkey-service/

The request URL looks like
https://example.org/.well-known/openpgpkey/hu/XXXX
With XXXX being a 32-char long string constructed of the mapped local part of the email, SHA-1 hashed and z-Base-32 encoded.
The reason for using this encoding instead of a standard hex encoding is to visually distinguish such an item from a fingerprint. Furthermore, in contrast to Base-64 and other Base-32 encodings, z-Base-32 has been optimized for easier human use.

from https://wiki.gnupg.org/EasyGpg2016/PubkeyDistributionConcept

根据描述，对 0@ftls.xyz 其中 0 进行 sha1 ，然后 z-Base-32 编码就行了。

python

import hashlib  
import zbase32
# pip install z-base-32
# https://pypi.org/project/z-base-32/
# https://philzimmermann.com/docs/human-oriented-base-32-encoding.txt
# 要哈希和编码的字符串  
input_string = "0"  
# 生成 SHA-1 哈希  
sha1_hash = hashlib.sha1(input_string.encode('utf-8')).digest()  
print("SHA-1 Hash (Hex):", sha1_hash.hex())  
# 将哈希值编码为z-Base-32  
# Encode binary data to z-base32
binary_data = sha1_hash
encoded_data = zbase32.encode(binary_data)
print("Encoded:", encoded_data)
# s3cj9timbzrn3hjyu8ehfiykzgkqooec

z-base32 算法参考 from https://philzimmermann.com/docs/human-oriented-base-32-encoding.txt

txt

                                                             Zooko O'Whielacronx
                                                                   November 2002
                                                edited for clarity February 2007
                                                         Corrected November 2009



                    human-oriented base-32 encoding

INTRO

The base-32 encoding implemented in this library differs from that described in 
RFC 3548 [1] in several ways.  This document describes why we made each 
different choice, and also includes a section at the end on "COMPATIBILITY AND 
INTEROPERATION".

This encoding is implemented in a project named z-base-32 [2].

This is version 0.9.4.6 of this document.  The latest version should always be 
available at:

http://zooko.com/repos/z-base-32/base32/DESIGN


RATIONALE

The rationale for base-32 encoding in RFC 3548 [1] is as written therein: "The 
Base 32 encoding is designed to represent arbitrary sequences of octets in a 
form that needs to be case insensitive but need not be humanly readable.".

The rationale for our encoding is different -- it is to represent arbitrary 
sequences of octets in a form that is as convenient as possible for human 
users to manipulate.  In particular, z-base-32 was created in order to serve 
the Mnet project [3], where 30-octet cryptographic values are encoded into 
URIs for humans to manipulate.  Anticipated uses of these URIs include cut-
and-paste, text editing (e.g. in HTML files), manual transcription via a 
keyboard, manual transcription via pen-and-paper, vocal transcription over 
phone or radio, etc.

The desiderata for such an encoding are:

 * minimizing transcription errors -- e.g. the well-known problem of confusing 
   `0' with `O'
 * embedding into other structures -- e.g. search engines, structured or 
   marked-up text, file systems, command shells
 * brevity -- Shorter URLs are better than longer ones.
 * ergonomics -- Human users (especially non-technical ones) should find the 
   URIs as easy and pleasant as possible.  The uglier the URI looks, the worse.


DESIGN

Base

The first decision we made was to use base-32 instead of base-64.  An earlier 
version of this project used base-64, but a discussion on the p2p-hackers 
mailing list [4] convinced us that the added length of base-32 encoding was 
worth the added clarity provided by: case-insensitivity, the absence of non- 
alphanumeric characters, and the ability to omit a few of the most troublesome 
alphanumeric characters.

In particular, we realized that it would probably be faster and more comfortable 
to vocally read off a base-32 encoded 30-octet string (48 characters, case- 
insensitive, no non-alphanumeric characters) than a base-64 encoded one 
(40 characters, case-sensitive, plus two non-alphanumeric characters).

Alphabet

There are 26 alphabet characters and 10 digits, for a total of 36 characters 
available.  We need only 32 characters for our base-32 alphabet, so we can 
choose four characters to exclude.  This is where we part company with 
traditional base-32 encodings.  For example [1] eliminates `0', `1', `8', and 
`9'.  This choice eliminates two characters that are relatively unambiguous 
(`8' and `9') while retaining others that are potentially confusing.  Others 
have suggested eliminating `0', `1', `O', and `L', which is likewise suboptimal.

Our choice of confusing characters to eliminate is: `0', `l', `v', and `2'.  Our 
reasoning is that `0' is potentially mistaken for `o', that `l' is potentially 
mistaken for `1' or `i', that `v' is potentially mistaken for `u' or `r' 
(especially in handwriting) and that `2' is potentially mistaken for `z' 
(especially in handwriting).

Note that we choose to focus on typed and written transcription more than on 
vocal, since humans already have a well-established system of disambiguating 
spoken alphanumerics, such as the United States military's "Alpha Bravo Charlie 
Delta" and telephone operators' "Is that 'd' as in 'dog'?".

Order of Alphabet

Some of the alphabet characters will appear more frequently than others in the 
final position of the encoded strings, assuming an even distribution of binary 
inputs.  (This is true whether the lengths of your inputs are evenly distributed 
over all integer numbers of bits *or* evenly distributed over all integer 
numbers of octets.)  Here is a table showing which length-in-bits (modulo 5) 
results in which possible trailing characters in the 
"abcdefghijklmnopqrstuvwxyz234567" encoding:

 1:  aq
 2:  aqiy
 3:  aqiyemu4
 4:  aqiyemu4cgkosw26
 0:  abcdefghijklmnopqrstuvwxyz234567

Here is the same table for our alphabet:

 1:  yo
 2:  yoea
 3:  yoearcwh
 4:  yoearcwhngkq1s46
 0:  ybndrfg8ejkmcpqxot1uwisza345h769 (the whole alphabet)

We have permuted the alphabet to make the more commonly occuring characters also 
be those that we think are easier to read, write, speak, and remember.

Length Encoding and Sub-Octet Data

Suppose you have 10 bits of data to transmit, and the recipient (the decoder) is 
expecting 10 bits of data.  All previous base-32 encoding schemes assume that 
the binary data to be encoded is in 8-bit octets, so you would have to pad the 
data out to 2 octets and encode it in base-32, resulting in a string 
4-characters long.  The decoder will decode that into 2 octets (16 bits), 
ignoring the least significant 4 bits of the encoded string, and then ignore the 
least significant 6 bits of the decoded data.

In the base-32 encoding described here, if the encoder and decoder both know the 
exact length of the data in bits (modulo 40), then they can use this shared 
information to optimize the size of the transmitted (encoded) string.  In the 
example that you have 10 bits of data to transmit, z-base-32 allows you to 
transmit the optimal encoded string: two characters.  If the encoder and 
decoder aren't both designed with the requirement that they will know the 
exact length of the data in bits, then they can of course assume that it is 
"the smallest number of octets which would have required this many quintets to 
encode", which is simple and unambiguous but can occasionally cost an extra 
byte or two of encoded size.

You can always use this encoding the same way you would use the other 
encodings -- with an "input is in 8-bit octets" assumption.  This would be 
appropriate if the length in bits is always a multiple of 8, if both sides are 
not sure of the length in bits modulo 40, or if this encoding is being used in a 
way that optimizing one or two characters out of the encoded string isn't worth 
the potential confusion.

Padding

Traditionally base-32 encodings have specified trailing padding to round out the 
number of characters to an even multiple of 8.  This is apparently intended as 
an error detection code, but we do not consider the error detection capabilities 
of this code to be worth the increased length of the encoded strings, so we do 
not do this.

Letter case

Lower case is easier to read.  That's why people have been using it 
preferentially since around the 9th century CE.  Isn't it about time that 
software engineers took advantage of mankind's millenia-old knowledge of 
typography instead of thoughtlessly aping the hardware limitations of 
terminals used in the second half of the 20th century?


EXAMPLES

#bits base-2                           base32     base64     z-base-32
----- ------                           ------     ------     ---------
1     0                                AA======   AA==       y
1     1                                QA======   gA==       o
2     01                               IA======   QA==       e
2     11                               QA======   gA==       a
10    0000000000                       AAAA====   AAA=       yy
10    1000000010                       QCAA====   gIA=       on
20    10001011100010001000             BC4IQ===   CLiI       tqre
24    111100001011111111000111         6C74O===   8L/H       6n9hq
24    110101000111101000000100         2R5AI===   1HoE       4t7ye
30    111101010101011110111101000011   HVK66QY=   PVXvQw==   6im5sd


A NOTE ON COMPATIBILITY AND INTEROPERATION

If your application could interoperate with an extant standard, then you 
should use RFC 3548 base-32 in order to facilitate interoperation by encoding 
semantically identical objects into syntactically identical representations.  
For example, many current systems include the SHA-1 hash of the contents of a 
file, and this hash value can be represented for user or programmatic sharing 
in base-32 encoded form [5, 6, 7, 8].  These four systems all use RFC 3548 
base-32 encoding.  If your system will expose the SHA-1 hash of the contents 
of a file, then you should make sure those hash values are easily exchangeable 
with those systems by using the same encoding (including base, alphabet, 
permutation of alphabet, length-encoding, padding, treatment of illegal 
characters and line-breaks).

If, however, the semantic meaning of the objects that you are exposing is not 
something that can be understood by another extant system, due to semantic 
differences, then you gain nothing with regard to interoperation by using the 
same ASCII encoding, and in fact by doing so you *hamper* interoperation by 
making it impossible for the applications to use syntactic features to 
disambiguate between semantic features. 

Lucas Gonze has suggested [9] that different schemes could in fact 
deliberately add characters which would be illegal in another scheme in order 
to enable syntactic differentiation.  (This would be morally similar to the 
"check digit" included in most credit card numbers.)

Obviously the more reliable semantic differentiation is an unambiguous one 
that is transmitted out-of-band (outside of the encoded string, that is), such 
as URI scheme names (e.g.: SHA1:blahblahblah or mnet://blahblahblah).  
However, users might not always pay the cost to preserve those.


REFERENCES

[1]  http://www.faqs.org/rfcs/rfc3548.html
[2]  http://zooko.com/repos/z-base-32
[3]  http://mnetproject.org/
[4]  http://zgp.org/pipermail/p2p-hackers/2001-October/
[5]  Gnutella [need URL for SHA1 and base-32 encoding stuff]
[6]  Bitzi [need URL for specification stuff]
[7]  CAW [need URL]
[8]  http://open-content.net/specs/draft-jchapweske-thex-01.html
[9]  http://zgp.org/pipermail/p2p-hackers/2002-November/000924.html
[10] http://zgp.org/pipermail/p2p-hackers/2002-November/000927.html

NEEDED TO ADD

 * possible new design element: optional check characters  (e.g. Luhn-like 
    algorithm)

 * Full spec, including the issues named by draft-josefsson-base-encoding-04.txt 
    such as treatment of illegal chars, etc.
   + also issues of length-encoding, scheme-encoding (e.g. URIs), etc.

 * Explanation of why we avoid non-alphanumerics.

 * Mention of the myriad other clarity issues such as those Gojomo posted?