Re: Comparison of compression algorithms from Ted Guild on 2020-09-01 (public-automotive@w3.org from September 2020)

From: Ted Guild <ted@w3.org>
Date: Tue, 01 Sep 2020 11:28:48 -0400
To: Gunnar Andersson <gandersson@genivi.org>, Ulf Bjorkengren <ulfbjorkengren@geotab.com>, public-automotive <public-automotive@w3.org>
Message-ID: <769badca5f15fdf6b2738d3f82647d366ff7a35e.camel@w3.org>
On the call I wondered about compression options being used on
websockets

An outdated stackoverflow thread points to a few different extensions
to Web Sockets done or under discussion (at the time) over at IETF.
Some seem abandoned.

https://stackoverflow.com/questions/19298651/how-does-websocket-compress-messages

https://mailarchive.ietf.org/arch/msg/hybi/_dWnwQrfIu2xdSI1WQI5Sx6zfZY/

Looking at Web Socket implementations (client and server),
libwebsockets and Boost.beast have permessage-deflate

https://en.wikipedia.org/wiki/Comparison_of_WebSocket_implementations

Not sure and asking colleagues but doesn't seem permessage-bzip2 et al
that were under discussion went anywhere. Seeing several server gzipped
files (eg nginx) and able to gzip on fly which we already know eats cpu
and isn't as good on smaller responses.

We can/should also explore alternate formats Gunnar suggested

https://en.wikipedia.org/wiki/Apache_Avro
https://en.wikipedia.org/wiki/Protocol_Buffers

As these messages being transmitted are fairly small to begin with and 
in-vehicle use case will have extremely low latency, I also want to try
to understand how/where this would be more useful as
encoding/compressing and decoding will cost time and depending on
method non-negligible cpu. If the client app is just sampling to
offboard and won't unpack (decode), then we probably should look at
Extended Vehicle and other solutions being used for off-boarding in
addition to formats. What problem[s] are we trying to solve here?

On Tue, 2020-09-01 at 12:18 +0000, Gunnar Andersson wrote:
> On Wed, 2020-08-26 at 11:25 +0200, Ulf Bjorkengren wrote:
> > I tried some online compression tools to see what kind of
> > compression
> > standard compression algorithms can achieve on a typical Gen2
> > response
> > payload, shown below. 
> > The results show that they do not perform well on this type of
> > short
> > payloads, and cannot compete with a tailormade algorithm. 
> > As a comparison, version two of the proprietary algorithm I
> > mentioned in
> > the presentation will compress the same payload to 17 bytes. 
> > If there is interest for it, this algorithm will be implemented and
> > available on 
> > https://github.com/MEAE-GOT/W3C_VehicleSignalInterfaceImpl 
> > in both a Go impl and a JS impl. 
> > 
> > Payload:  
> > {“action”:”get”, “timestamp”:”2020-08-25T13:37:00Z”, “value”:”123”,
> > “requestId”:”999”}
> > The above payload is 86 chars. 
> > 
> > http://www.txtwizard.net/compression  
> > GZ: Execution time: 11875 us Compression ratio: 112 % Original
> > size: 118
> > bytes Result size: 105 bytes
> 
> I noticed that here it says the original size is 118, but above it is
> 86.
> It doesn't change anything fundamental, just pointing it out.
> 
> It could be because something happened when pasting into a web
> tool.  There
> might be some other encoding of the text going on.  I actually
> noticed when
> I copied the example from your HTML-formatted email, that the above
> was
> using left-and-right leaning " characters instead of a plain ASCII ",
> and I
> then get a 120 byte file with UTF-8 encoding, which suggests there
> were
> some multi-byte characters that snuck in.
> 
> After fixing that and running gzip on the command line on pure ASCII,
> gzip
> causes an increase of the size from 86 to 100 bytes.
> 
> But none of those details really matter and the results are expected
> behavior on very small files - we already agreed to that yesterday.
> 
> I don't want to waste time on the exact number of bytes in that
> comparison.
> It is well known that much better compression is possible with any
> method
> where a predefined dictionary exists than with a general-purpose
> compression that is not allowed to agree on a lookup dictionary
> beforehand
> (which you have done for the keywords, and kind of indirectly also
> for the
> shortened UUID).
> 
> One thing that might still be useful, just to see if sticking to
> plain
> HTTP-supported compression is an option (which would be A LOT
> easier), is
> to perform a comparison of a large response to a large query.  I'm
> thinking
> that is where compression is also most important?
> 
> If the goal is to truly minimize transfers, both large and small,
> then I
> wonder why the approach is not to use a full binary encoding
> instead.  (Or
> if there are other goals, let's discuss them?).  As you probably
> understand
> I am sceptical to the idea of creating a custom "compression
> algorithm"
> without clarifying what the point of that is.
> 
> It's might be a matter of definition, but I already tend to think
> about
> what you created as an /alternative encoding/ more than compression,
> and
> I'm thinking that a mind-shift towards that term might uncover even
> better
> alternatives?
> 
> A super-tailored encoding for any task could be made optimal, but
> building
> on standards is worthwhile.
> 
> It doesn't hurt to use a formal language to describe the message
> schema
> anyway.  (The Gen2- specification might ought to do that for the
> original
> JSON too?).  And as I mentioned on our previous call, Avro and
> Protobuf has
> languages to describe such schemas.  If you then use the associated
> tooling
> then the resulting binary encoding could be studied.  There are
> /MANY/
> other options out there, many of which were also discussed in a
> previous
> GENIVI project named "Generic Communication Protocols
> Evaluation".   I'm
> sure this is not a new discussion in W3C either and I may have heard
> Ted
> mention that also.
> 
> Sincerely,
> - Gunnar
> 
> 
> 
> 
> 
> 
> 
> 
-- 
Ted Guild <ted@w3.org>
W3C Automotive Lead
https://www.w3.org/auto
Received on Tuesday, 1 September 2020 15:28:55 UTC