Re: Exploring how to encode code point sets from Garret Rieger on 2019-07-25 (public-webfonts-wg@w3.org from July 2019)

From: Garret Rieger <grieger@google.com>
Date: Thu, 25 Jul 2019 13:14:31 -0700
To: Behdad Esfahbod <behdad@fb.com>
Cc: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
Message-ID: <CAM=OCWacc1sjP=Jr+RR8kBFyJqggoxTBaK8+eHi3zCB4eXtXLA@mail.gmail.com>

1. Agreed, for actual implementation we'd just let the http transport layer
apply compression around the entire payload.
2. I was thinking the same thing, I'll give that a try.
3. My idea for ranges is to use a union of ranges and the sparse bit set.
Basically first encode a set of ranges (say any runs of codepoints longer
than some threshold) and then encode everything left over using a sparse
bit set.

On Thu, Jul 25, 2019 at 11:13 AM Behdad Esfahbod <behdad@fb.com> wrote:

> Hi Garret,
>
> Thanks for the document!  Here's my thoughts:
>
> 1. I suggest avoiding generic compression at this level.  Would be nice if
> the entire request/response are compressed automatically, but I suggest we
> design without it.  Either browsers already have Brotli compression code or
> don't.  I don't think we should require it for the codepoint set, since as
> you discovered, is not a huge win anyway given the nature of data and the
> fact that we can design it to be efficient.
>
> 2. Since random-access is not required, one can use a multibyte encoding,
> which should make the delta-list pack much better.  I suggest just using
> the UTF-8 encoding.
>
> 3. ICU keeps such lists as an alternating "in-out" list.  Ie, if the list
> is "5,8,9,14", it will encode it as "5,6,8,10,14,15".  One can think of
> this as a list of ranges: (5,6),(8,10),(14,15).  You can try doing it this
> way and then take the deltas.  This will address the range use-case.  You
> can also try to come up with a hybrid encoding that can encode ranges
> efficiently without increasing cost for sparse sets significantly.
>
> I think doing the above should get you a very simple-to-encode
> simple-to-decode and fairly-efficient encoding.
>
> Cheers,
> b
> ------------------------------
> *From:* Garret Rieger <grieger@google.com>
> *Sent:* Wednesday, July 24, 2019 2:14 PM
> *To:* w3c-webfonts-wg (public-webfonts-wg@w3.org) <
> public-webfonts-wg@w3.org>
> *Subject:* Exploring how to encode code point sets
>
> Recently I've been thinking about the specific design of the protocol for
> the subset and patch method since we'll need that for the analysis. One of
> the most important pieces is how to efficiently encode the code point sets
> that are transferred from the client to server on each request. If an
> inefficient encoding is used it could add a material amount of overhead to
> the requests.
>
> So I came up with a list of potential methods for encoding the sets and
> tested them out on simulated code point sets. An overview of the analysis
> and the results can be found here
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_19K5MCElyjdUZknoxHepcC3s7tc-2Di4I8yK2M1Eo2IXFw_edit-3Fusp-3Dsharing&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=P9JUMpOWw22-3xIiv7QgGg&m=tJs0aisgmekqSE2yg_K1iPyNoOfI5-XadZh3YwA0d9w&s=UVBYUiwFxUG3uxu5OolSWKidRkGDY8_sBZWhO_t1uII&e=>
> .
>
> Does anyone have other ideas on techniques/thoughts for efficiently
> encoding sets of codepoints?
>

Received on Thursday, 25 July 2019 20:15:14 UTC