Re: Exploring how to encode code point sets from Behdad Esfahbod on 2019-07-25 (public-webfonts-wg@w3.org from July 2019)

From: Behdad Esfahbod <behdad@fb.com>
Date: Thu, 25 Jul 2019 18:13:13 +0000
To: Garret Rieger <grieger@google.com>, "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
Message-ID: <BYAPR15MB2310AA4E59B81F7D503CC29DBFC10@BYAPR15MB2310.namprd15.prod.outlook.com>

Hi Garret,

Thanks for the document!  Here's my thoughts:

1. I suggest avoiding generic compression at this level.  Would be nice if the entire request/response are compressed automatically, but I suggest we design without it.  Either browsers already have Brotli compression code or don't.  I don't think we should require it for the codepoint set, since as you discovered, is not a huge win anyway given the nature of data and the fact that we can design it to be efficient.

2. Since random-access is not required, one can use a multibyte encoding, which should make the delta-list pack much better.  I suggest just using the UTF-8 encoding.

3. ICU keeps such lists as an alternating "in-out" list.  Ie, if the list is "5,8,9,14", it will encode it as "5,6,8,10,14,15".  One can think of this as a list of ranges: (5,6),(8,10),(14,15).  You can try doing it this way and then take the deltas.  This will address the range use-case.  You can also try to come up with a hybrid encoding that can encode ranges efficiently without increasing cost for sparse sets significantly.

I think doing the above should get you a very simple-to-encode simple-to-decode and fairly-efficient encoding.

Cheers,
b
________________________________
From: Garret Rieger <grieger@google.com>
Sent: Wednesday, July 24, 2019 2:14 PM
To: w3c-webfonts-wg (public-webfonts-wg@w3.org) <public-webfonts-wg@w3.org>
Subject: Exploring how to encode code point sets

Recently I've been thinking about the specific design of the protocol for the subset and patch method since we'll need that for the analysis. One of the most important pieces is how to efficiently encode the code point sets that are transferred from the client to server on each request. If an inefficient encoding is used it could add a material amount of overhead to the requests.

So I came up with a list of potential methods for encoding the sets and tested them out on simulated code point sets. An overview of the analysis and the results can be found here<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_19K5MCElyjdUZknoxHepcC3s7tc-2Di4I8yK2M1Eo2IXFw_edit-3Fusp-3Dsharing&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=P9JUMpOWw22-3xIiv7QgGg&m=tJs0aisgmekqSE2yg_K1iPyNoOfI5-XadZh3YwA0d9w&s=UVBYUiwFxUG3uxu5OolSWKidRkGDY8_sBZWhO_t1uII&e=>.

Does anyone have other ideas on techniques/thoughts for efficiently encoding sets of codepoints?

Received on Thursday, 25 July 2019 18:15:16 UTC