Investigation of Existing Encoding Formats from Garret Rieger on 2021-01-27 (public-webfonts-wg@w3.org from January 2021)

From: Garret Rieger <grieger@google.com>
Date: Wed, 27 Jan 2021 14:09:22 -0700
To: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
Message-ID: <CAM=OCWZxs+yyHEG9sVEwx98e9_SoPe0Zsx_7Tw0tGs+jvBaPUQ@mail.gmail.com>
The patch subset progressive font enrichment method operates by sending
encoded messages between the server and client. These messages can be
represented as key value pairs. For transfer over the network these
messages need to be encoded into bytes. Prior to standardization we need to
pick a specific data serialization format to use. There are many existing
serialization formats, so this document describes the specific requirements
for serialization in patch subset and then evaluates a number of
serialization formats against those requirements. Finally a recommendation
for which format to use for  standardization is made.
Requirements

   -

   Can encode objects built from key value pairs. Ideally keys should be
   allowed to be integers. This allows a more compact representation than
   string keys.
   -

   Compact encoding: one of the primary goals of PFE is to save bytes
   transferred over the network. So it follows that the serialization format
   we use should be as compact as possible.
   -

   Supports byte arrays: response messages will need to contain a large
   byte array with the font patch. So the serialization format should be able
   to encode byte arrays directly.
   -

   Standardized: since the serialization format will be referenced from a
   standard, it will itself need to be standardized through a standards body.
   -

   Messages encoded by a newer version of the message schema should still
   be decodable by a decoder with an older version of the message schema.
   -

   Performant: font loading is render blocking so message encoding and
   decoding should be fast.

Investigated Encoding FormatsProtocol Buffers

Protocol Buffers <https://developers.google.com/protocol-buffers>

A compact serialization protocol that uses schema’s for the data types.
This is what we used in the prototype version of patch subset. It meets all
of the requirements, except that it is not standardized.

Decision: can’t use, not standardized.
JSON

JSON <https://tools.ietf.org/html/rfc7159>

JSON is commonly used on the web to serialize messages. It’s a text based
encoding of key value objects. Because it is text based it has a few
drawbacks:

   -

   Keys are strings
   -

   The encoding is not compact. Even with compression applied it’s still
   larger than binary encodings.
   -

   Byte arrays can’t be encoded.
   -

   Slower performance compared to other binary encodings. For example
   numeric values need to be parsed as text and then converted to binary.


Decision: not compact enough and no binary support. Don’t use.
BSON

BSON <http://bsonspec.org/>

A binary version of JSON which eliminates some of the drawbacks of the text
based encoding. However, it still fails a few of the requirements:

   -

   Key’s must be strings
   -

   While it is more compact than JSON, it’s not as compact as some of the
   other binary encodings. In particular it does not have support for variable
   length integers.
   -

   It’s not standardized.


Decision: can’t use, not standardized.
UBJSON

UBJSON <https://ubjson.org/>

Another variant of binary JSON. It fails to meet the requirements just like
BSON:

   -

   Key’s must be strings
   -

   While it is more compact than JSON, it’s not as compact as some of the
   other binary encodings. In particular it does not have support for variable
   length integers.
   -

   It’s not standardized.


Decision: can’t use, not standardized.

CBOR (Concise Binary Object Representation)

CBOR - Wikipedia <https://en.wikipedia.org/wiki/CBOR>, rfc8949
<https://tools.ietf.org/html/rfc8949>

Uses a single control byte per value which encodes both the type and length
of the value. A compact encoding on par with protobuf.


   -

   Supports key value maps. Keys can be any type.
   -

   Has variable length integers (length prefix via the control byte)
   -

   Messages are fully decodable without a schema.
   -

   Standardized via IETF: https://tools.ietf.org/html/rfc8949


Decision: we can use, meets all requirements.
Message Pack

Message Pack Specification
<https://github.com/msgpack/msgpack/blob/master/spec.md>

Similar to CBOR (CBOR was inspired by Message Pack). Should have similar
compactness as CBOR. However, Message Pack is not standardized.

Decision: can’t use, not standardized.
Custom Encoding

One last option is to develop our own serialization format specifically for
use in patch subset. The developed encoding would be designed to meet the
above requirements and would be standardized as part of PFE standard.

This should only be used as a fallback option if no other existing encoding
can be found which meets our requirements. Developing and standardizing a
new encoding format will require extra specification work.

Decision: don’t use, existing format CBOR meets our requirements.
Final Recommendation

CBOR looks to be a very good fit. It’s a straightforward encoding and meets
all of our requirements. It’s unlikely we’d be able to significantly reduce
encoding size with a custom encoding.

So I recommend that we use CBOR encoding.


I'm going to draft a third version of the protocol design document based on
COBR and send that out soon.
Received on Wednesday, 27 January 2021 21:09:55 UTC