- From: Simon Steyskal <simon.steyskal@wu.ac.at>
- Date: Wed, 20 Feb 2019 09:57:15 +0100
- To: Ivan Herman <ivan@w3.org>
- Cc: W3C JSON-LD Working Group <public-json-ld-wg@w3.org>
Hi! Thanks for the summary Ivan! > I was wondering about the compression ratio of CBOR; I have not found > real data. I have tested some of my JSON-LD files and, on the average, > the compression of the JSON data was around 50%. But my files are > small, ie, this may not be significant. Anyone has some bigger data > that one can test with? fwiw, I used a combination of [1-2] where I replaced cbor with cbor2 and used a json dump as test dict to benchmark various serialization formats on a real-world ~100MB JSON dump of ours (alternatively, you can generate a synthetic dump via http://www.json-generator.com). Results + Python script can be found at [3]. TL&DR: cbor compression on ~100MB plain JSON with lots of strings was still at 80% of the original file size. hth, simon [1] https://gist.github.com/cactus/4073643/ (python 2) [2] https://gist.github.com/crhan/f04ddb7373533bae4d051271497e080e (python 3) [3] https://gist.github.com/simonstey/a499b98c088efce3f32264426fd2f495 --- DDipl.-Ing. Simon Steyskal Institute for Information Business, WU Vienna www: http://www.steyskal.info/ twitter: @simonsteys Am 2019-02-19 18:05, schrieb Ivan Herman: > Because we discussed it at the F2F meeting, I was looking at the CBOR > spec[1] and some other documents to see what it would mean for us to > write a note. The answer is: it is probably trivial:-). The simplest > approach is not to refer to the abstract data model and concepts but > start from the JSON serialization. Indeed, the spec includes a section > on JSON<->CBOR conversion (section 4) and it would be foolish not to > use that. (The RFC text says it is non-normative, but that is not a > real problem for us, because we are considering a note only…) > > [1] https://www.rfc-editor.org/rfc/rfc7049.txt > > ## Conversions > > ### Converting CBOR to JSON > > This is the bit which is not 100% obvious, because CBOR is a superset > of JSON in terms of expressivity. It allows the storage of binary > data, allows for non-string and possibly repeated keys for > dictionaries (maps, as they call it), etc. However, the section > proposes describes a possible strategy to take care of each of those, > and we can just simply say 'do what is in the RFC!'. (E.g., binary > data is base64url encoded and stored as a string, non-string keys are > dropped, etc.) > > ### Converting JSON to CBOR > > There are some notes there on how numbers should be stored; my > impression that there is nothing special for our case, the issues are > more how to choose among semantically equivalent representations of > numbers. > > ## Canonical CBOR > > There is a concept of "canonical CBOR" (section 3.9): "…two encoder > implementations starting with the same input data will produce the > same CBOR output". (E.g., choose a specific number representation, > order the keys, etc.) > > I am not sure this is important for us, although maybe there are > corner cases where roundtripping may require it (although nothing > comes to my mind right now). > > ## CBOR-LD as Binary RDF? > > I was wondering about the compression ratio of CBOR; I have not found > real data. I have tested some of my JSON-LD files and, on the average, > the compression of the JSON data was around 50%. But my files are > small, ie, this may not be significant. Anyone has some bigger data > that one can test with? > > As a comparison, a minified version for the same JSON files was about > 70% of the original but a simple gzip was around 25%. (As far as I > could see, Unicode character strings remain unchanged in the CBOR > encoded file, which may explain this.) I.e., CBOR is not all that > great in terms of compression; the noted goal in the CBOR spec is that > they have put a higher priority on being able to write a very light > coder/decoder that would require a very small processing footprint, > even if that made the compression less efficient. I guess this would > be of interest for our WoT friends, but may not make CBOR-LD very > interesting for those who want to achieve better compression for > JSON-LD data storage... > > I am not sure where we would go from here… > > Ivan > > ---- > Ivan Herman, W3C > Publishing@W3C Technical Lead > Home: http://www.w3.org/People/Ivan/ > mobile: +31-641044153 > ORCID ID: https://orcid.org/0000-0003-0782-2704
Received on Wednesday, 20 February 2019 08:57:49 UTC