Re: Comparison of compression algorithms from Gunnar Andersson on 2020-09-01 (public-automotive@w3.org from September 2020)

From: Gunnar Andersson <gandersson@genivi.org>
Date: Tue, 1 Sep 2020 12:18:26 +0000
To: Ulf Bjorkengren <ulfbjorkengren@geotab.com>, public-automotive <public-automotive@w3.org>
Message-ID: <01010174499aa8f4-335dd729-c4b9-4984-a4f6-958ac385bc20-000000@us-west-2.amazonse>

On Wed, 2020-08-26 at 11:25 +0200, Ulf Bjorkengren wrote:
> I tried some online compression tools to see what kind of compression
> standard compression algorithms can achieve on a typical Gen2 response
> payload, shown below. 
> The results show that they do not perform well on this type of short
> payloads, and cannot compete with a tailormade algorithm. 
> As a comparison, version two of the proprietary algorithm I mentioned in
> the presentation will compress the same payload to 17 bytes. 
> If there is interest for it, this algorithm will be implemented and
> available on 
> https://github.com/MEAE-GOT/W3C_VehicleSignalInterfaceImpl 
> in both a Go impl and a JS impl. 
> 
> Payload:  
> {“action”:”get”, “timestamp”:”2020-08-25T13:37:00Z”, “value”:”123”,
> “requestId”:”999”}
> The above payload is 86 chars. 
> 
> http://www.txtwizard.net/compression  
> GZ: Execution time: 11875 us Compression ratio: 112 % Original size: 118
> bytes Result size: 105 bytes

I noticed that here it says the original size is 118, but above it is 86.
It doesn't change anything fundamental, just pointing it out.

It could be because something happened when pasting into a web tool.  There
might be some other encoding of the text going on.  I actually noticed when
I copied the example from your HTML-formatted email, that the above was
using left-and-right leaning " characters instead of a plain ASCII ", and I
then get a 120 byte file with UTF-8 encoding, which suggests there were
some multi-byte characters that snuck in.

After fixing that and running gzip on the command line on pure ASCII, gzip
causes an increase of the size from 86 to 100 bytes.

But none of those details really matter and the results are expected
behavior on very small files - we already agreed to that yesterday.

I don't want to waste time on the exact number of bytes in that comparison.
It is well known that much better compression is possible with any method
where a predefined dictionary exists than with a general-purpose
compression that is not allowed to agree on a lookup dictionary beforehand
(which you have done for the keywords, and kind of indirectly also for the
shortened UUID).

One thing that might still be useful, just to see if sticking to plain
HTTP-supported compression is an option (which would be A LOT easier), is
to perform a comparison of a large response to a large query.  I'm thinking
that is where compression is also most important?

If the goal is to truly minimize transfers, both large and small, then I
wonder why the approach is not to use a full binary encoding instead.  (Or
if there are other goals, let's discuss them?).  As you probably understand
I am sceptical to the idea of creating a custom "compression algorithm"
without clarifying what the point of that is.

It's might be a matter of definition, but I already tend to think about
what you created as an /alternative encoding/ more than compression, and
I'm thinking that a mind-shift towards that term might uncover even better
alternatives?

A super-tailored encoding for any task could be made optimal, but building
on standards is worthwhile.

It doesn't hurt to use a formal language to describe the message schema
anyway.  (The Gen2- specification might ought to do that for the original
JSON too?).  And as I mentioned on our previous call, Avro and Protobuf has
languages to describe such schemas.  If you then use the associated tooling
then the resulting binary encoding could be studied.  There are /MANY/
other options out there, many of which were also discussed in a previous
GENIVI project named "Generic Communication Protocols Evaluation".   I'm
sure this is not a new discussion in W3C either and I may have heard Ted
mention that also.

Sincerely,
- Gunnar

Received on Tuesday, 1 September 2020 12:18:41 UTC