Re: CBOR-LD from Simon Steyskal on 2019-02-20 (public-json-ld-wg@w3.org from February 2019)

From: Simon Steyskal <simon.steyskal@wu.ac.at>
Date: Wed, 20 Feb 2019 09:57:15 +0100
To: Ivan Herman <ivan@w3.org>
Cc: W3C JSON-LD Working Group <public-json-ld-wg@w3.org>
Message-ID: <d51aaa4a0fc70585c2182cd1a4c968f5@wu.ac.at>
Hi!

Thanks for the summary Ivan!

> I was wondering about the compression ratio of CBOR; I have not found
> real data. I have tested some of my JSON-LD files and, on the average,
> the compression of the JSON data was around 50%. But my files are
> small, ie, this may not be significant. Anyone has some bigger data
> that one can test with?

fwiw, I used a combination of [1-2] where I replaced cbor with cbor2 and 
used a json dump as test dict to benchmark various serialization formats 
on a real-world ~100MB JSON dump of ours (alternatively, you can 
generate a synthetic dump via http://www.json-generator.com). Results + 
Python script can be found at [3].

TL&DR: cbor compression on ~100MB plain JSON with lots of strings was 
still at 80% of the original file size.


hth, simon


[1] https://gist.github.com/cactus/4073643/ (python 2)
[2] https://gist.github.com/crhan/f04ddb7373533bae4d051271497e080e 
(python 3)
[3] https://gist.github.com/simonstey/a499b98c088efce3f32264426fd2f495

---
DDipl.-Ing. Simon Steyskal
Institute for Information Business, WU Vienna

www: http://www.steyskal.info/  twitter: @simonsteys

Am 2019-02-19 18:05, schrieb Ivan Herman:
> Because we discussed it at the F2F meeting, I was looking at the CBOR
> spec[1] and some other documents to see what it would mean for us to
> write a note. The answer is: it is probably trivial:-). The simplest
> approach is not to refer to the abstract data model and concepts but
> start from the JSON serialization. Indeed, the spec includes a section
> on JSON<->CBOR conversion (section 4) and it would be foolish not to
> use that. (The RFC text says it is non-normative, but that is not a
> real problem for us, because we are considering a note only…)
> 
> [1] https://www.rfc-editor.org/rfc/rfc7049.txt
> 
> ## Conversions
> 
> ### Converting CBOR to JSON
> 
> This is the bit which is not 100% obvious, because CBOR is a superset
> of JSON in terms of expressivity. It allows the storage of binary
> data, allows for non-string and possibly repeated keys for
> dictionaries (maps, as they call it), etc. However, the section
> proposes describes a possible strategy to take care of each of those,
> and we can just simply say 'do what is in the RFC!'. (E.g., binary
> data is base64url encoded and stored as a string, non-string keys are
> dropped, etc.)
> 
> ### Converting JSON to CBOR
> 
> There are some notes there on how numbers should be stored; my
> impression that there is nothing special for our case, the issues are
> more how to choose among semantically equivalent representations of
> numbers.
> 
> ## Canonical CBOR
> 
> There is a concept of "canonical CBOR" (section 3.9): "…two encoder
> implementations starting with the same input data will produce the
> same CBOR output". (E.g., choose a specific number representation,
> order the keys, etc.)
> 
> I am not sure this is important for us, although maybe there are
> corner cases where roundtripping may require it (although nothing
> comes to my mind right now).
> 
> ## CBOR-LD as Binary RDF?
> 
> I was wondering about the compression ratio of CBOR; I have not found
> real data. I have tested some of my JSON-LD files and, on the average,
> the compression of the JSON data was around 50%. But my files are
> small, ie, this may not be significant. Anyone has some bigger data
> that one can test with?
> 
> As a comparison, a minified version for the same JSON files was about
> 70% of the original but a simple gzip was around 25%. (As far as I
> could see, Unicode character strings remain unchanged in the CBOR
> encoded file, which may explain this.) I.e., CBOR is not all that
> great in terms of compression; the noted goal in the CBOR spec is that
> they have put a higher priority on being able to write a very light
> coder/decoder that would require a very small processing footprint,
> even if that made the compression less efficient. I guess this would
> be of interest for our WoT friends, but may not make CBOR-LD very
> interesting for those who want to achieve better compression for
> JSON-LD data storage...
> 
> I am not sure where we would go from here…
> 
> Ivan
> 
> ----
> Ivan Herman, W3C
> Publishing@W3C Technical Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> ORCID ID: https://orcid.org/0000-0003-0782-2704
Received on Wednesday, 20 February 2019 08:57:49 UTC