Re: Stream-based processing!? from Olivier Grisel on 2011-10-03 (public-linked-json@w3.org from October 2011)

From: Olivier Grisel <olivier.grisel@ensta.org>
Date: Mon, 3 Oct 2011 13:51:19 +0200
To: Ivan Herman <ivan@w3.org>
Cc: Markus Lanthaler <markus.lanthaler@gmx.net>, public-linked-json@w3.org
Message-ID: <CAFvE7K4ehwZjnVzcRk9o19zBpiLojeCpeCPCn-cMuNGtRZVL8w@mail.gmail.com>

2011/10/3 Ivan Herman <ivan@w3.org>:
>
> On Oct 2, 2011, at 22:33 , Markus Lanthaler wrote:
>>
>>
>>>> We could also require serializations ensure that @context is listed
>>>> first. If it isn't listed first, the processor has to save each
>>>> key-value pair until the @context is processed. This creates a memory
>>>> and complexity burden for one-pass processors.
>>
>> Agree. I think that would make a lot of sense since you can see the context
>> as a kind of header anyway.
>
> I must admit I do not really understand that, but that probably shows my ignorance of the wider JSON world.
>
> However... the standard JSON parser in Python parses a JSON object into a dictionary. However, at least in Python, you cannot rely on the order of the keys within the dictionary (it is determined by some hashing algorithm, if I am not mistaken, but that is internal to the interpreter anyway). Ie, whether @context appears first or last does not make any difference.
>
> Worse: if you then use such a structure to generate JSON using again the 'dump' feature of the standard Python parser, there is no way to control the order of those keys. In other words, if we impose such an order in JSON-LD, that means that a Python programmer must bypass the standard JSON library module and do the dump by hand. I do not think that would be acceptable...

In python 2.7 and 3.2+ it is possible to have a deterministic order by
using the collections.OrderedDict class from the standard library. In
that case the json.dump will respect that order. At parsing time it is
now possible to pass the OrderedDict class as "object_pairs_hook" to
avoid loosing the ordering information.

  http://docs.python.org/library/json.html

So I don't think this is such as use deal to enforce the @context node
as first position. But that will require a bit of communication effort
for documenting and advertising such good practices to JSON-LD library
developers.

IMHO it is very interesting to be able to do one pass / streaming
processing of huge JSON-LD dumps without having to load the payload in
memory.

For instance I would really like to be able to have JSON-LD dumps of
the full DBpedia that I could pre-filter in one-pass before loading it
to a CouchDB database or and ElasticSearch fulltext index. Such a dump
JSON-LD would be several tens of GB uncompressed and would probably
not fit in today computers' main memory.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Received on Monday, 3 October 2011 11:52:10 UTC