Re: JSON-LD Algorithms from Gregg Kellogg on 2013-01-04 (public-linked-json@w3.org from January 2013)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Fri, 4 Jan 2013 12:58:06 -0500
To: Markus Lanthaler <markus.lanthaler@gmx.net>
CC: "public-linked-json@w3.org" <public-linked-json@w3.org>
Message-ID: <7B2B6BD9-8EBD-4895-BE90-3820D014124B@greggkellogg.net>
On Jan 4, 2013, at 7:35 AM, Markus Lanthaler <markus.lanthaler@gmx.net> wrote:

> On Friday, January 04, 2013 1:24 AM, Gregg Kellogg wrote:
> 
>> I've spent some time going over the updated algorithms and updating my
>> parser. I have to say, that I'm distressed at how much the core
>> algorithms have changed. In some cases (e.g., Inverse Context) it
>> seems that the motivation was for performance, at the expense of being
>> able to walk through the algorithms; I really think this is the wrong
>> idea, and pretty much requires that an implementation exactly
>> implement the algorithms to have a chance at conforming. This is also
>> quite de-stabilizing, as we are trying to get to LC soon. IMO, the
>> purpose of algorithm definitions in W3C specs is to allow them to be
>> unambiguously understood by an implementor as to what the intent of
>> the algorithms is, rather than the specific way in which they must be
>> implemented.
> 
> I assume you are mostly talking about compaction here because the other
> algorithms haven't been changed much (apart from the re-labeling of blank
> nodes in expansion).

To my eye, Expansion, Context Processing, IRI Expansion, Value Expansion, Compaction, IRI Compaction, and Value Compaction are a fair bit different; sufficiently so that I gave up on doing a broad re-implementation of my versions of these algorithms. As I said, I'm able to achieve the same results with more minimal changes to the original algorithms.

> We all agreed that compaction was already too complex
> before the introduction of property generators and the new container types.
> I looked at various ways to simplify it and the inverse context approach was
> by far the simplest and most performant way to implement it. It was also the
> only scalable solution I could come up with. I do agree that the pure
> algorithms may not be enough to easily understand what's going on - we
> should add some more high-level prose; I just haven't had the time yet. So
> let me try to explain it in this email and see if it is understandable if
> explained at a higher level.
> 
> The inverse context allows to more or less directly access the best-matching
> term for a IRI-value pair (were the value is optional). All terms are put in
> a multi-dimensional array (or nested object structure if you prefer) whose
> structure looks as follow:
> 
> IRI    Container    Type/Lang?  Value        PropGen?   Term(s)
> ---------------------------------------------------------------
>       @null
>       @set         @null       @null        term
> IRI    @list        @type       type IRI     propGens   term(s)
>       @annotation  @language   lang. tag
>       @language
> 
>       term        term (shortest term to use for IRIs without value)
> 
> What's worth noting is that the inverse context is sorted, e.g., by IRI
> (longest first). So, if you need to create a compact IRI, you simply loop
> over the IRIs, at the first match you take invCtx[IRI][term] and you are
> done.
> 
> If you need to compact a IRI-value pair, e.g., IRI: { "@value": "hi",
> "@language": "en" }, you first look at invCtx[IRI][@language]... to see if
> there's a language container, then at
> invCtx[IRI][@set][@language][en][propGens] as property generators and sets
> are preferred. If there are no potential property generators, you look at
> invCtx[IRI][@set][@language][en][term], then you try it at
> invCtx[IRI][@null][@language]... (@null stands for no container) etc. If you
> have the inverse context in front of you it is trivial to see which term is
> chosen. You just need to know the fallback rules, e.g., for containers:
> @language -> @set -> @null.
> 
> If there weren't property generators the algorithms would be a whole lot
> simpler as you wouldn't need to collect candidates.

This still seems like an optimization to me; you could achieve similar results by iterating through terms in the context. Maintaining such an inverse table perhaps creates a more efficient way to do this, but I don't think that's the purpose of the algorithms, as I said.

In fact, property generator term/value compaction is left to a separate step that I had outlined in the issue, one that took place before processing each key/value in element (indeed, requiring two steps).

>> Looking at the record [1], the group decided to adopt the algorithm I
>> had specified (with some modifications) as an update to the current
>> compaction algorithm, however the new compaction algorithm bears
>> little resemblance to the old algorithm.
> 
> If you look more closely at the algorithms you will see that the algorithms
> are doing exactly that:
> 
> - when expanding unlabeled nodes are automatically labeled with blank node
> identifiers (this is also the reason why we need to re-label all blank nodes
> in expansion; as decided in [1])

Yes, I include this as part of IRI Expansion, if a node relabling method is passed.

> - when compacting, all potential property generators are evaluated. The
> algorithms check if duplicates can be found for *all* IRIs using deep
> comparison, if so, they are eliminated and the property generator term is
> used.

The difference seems to be that your algorithm folds this into the main key/value processing step, and depends on Find and Remove Property Generator Duplicates to remove data from the object as it is being processed, which would certainly be impossible with a functional implementation. Even though it takes an extra pass, doing this first simplifies everything that comes later.

>> Looking at IRI compaction, the sense of the group definitely seemed to
>> be to stick as close as we can to the current algorithms [2]. However,
>> the IRI compaction algorithm is also substantially different than
>> before.
> 
> That was just a proposal without a resolution. I announced in the 2012-12-11
> telecon [2] (not transcribed, 1:27:00 onwards in the audio file) that I will
> introduce the notion of a inverse context if I'm going to update the
> algorithms. 

I wouldn't take silence as acquiescence; it's just that we don't all have the time to spend on these things at the moment, which is why I'm responding now.

>> My current implementation sticks closely to the original algorithms,
>> with relatively minor updates to Term Rank, Expansion and Compaction
>> to support language maps, annotations and property generators. I
>> haven't completed my property generator compaction implementation, but
>> I'm confident that the modification I specified relative to the
>> previous algorithm can work (with some minor tweaks, no doubt).
> 
> Do you pass the whole test suite? There were a lot of minor things (mostly
> corner cases) that weren't handled properly in the old algorithms. I added I
> number of tests but I'm not sure I captured them all in the tests.

I pass all the expansion tests (except 0039, as I've complained about before), and all the compaction tests, except those for property generators, which I'm working on now. I'm failing most of the framing tests, and I'm not sure why, this is an algorithm that may have changed as well, but I wasn't going to look at that until I had completed compaction. Framing tests work as well for me as the did before, and of course, I pass all the RDF tests.

>> I recommend that we go back to the previous algorithms and look for
>> the minimal changes that are necessary to specify the desired
>> behavior. I can certainly do this based on the work I've done to bring
>> my implementation up-to-date.
>> 
>> Modifications to those algorithms should try to make the meaning of
>> them clear when reading, rather than relying on emergent behavior. For
>> example, knowing what term to choose when compaction should be
>> something you can determine by walking through the IRI Compaction and
>> Term Rank algorithms given specific input.
> 
> And if you have an inverse context you can pick it from a "table". I think
> we all agree that the previous algorithms weren't simple at all. It's
> debatable whether the new ones are simpler - but at least I tried hard to
> simplify them.
> 
> 
>> Lastly, I'm concerned that the tests depend too much on specifics of
>> Blank Node naming. I can see why this is desirable, but it again
>> limits implementations.
> 
> Well, that's only true if you want to do a simple deep comparison. In that
> case I can't see how it could be solved in another way. A more sophisticated
> test suite could ignore the blank node identifiers completely and check for
> graph equivalence - I doubt that that would be simpler.

I'm not as concerned if the test methodology isn't as easy, as if the core algorithms aren't. Indeed, I think we will do an online test suite, similar to what we did for RDFa, where developers can simply point the test processor at an endpoint to determine if they pass or fail. I use graph isomorphism in my own tests for toRDF anyway.

>> One way to do this would be to transform the
>> resulting JSON-LD into RDF and use well-understood graph isomorphism.
> 
> Nothing prevents you from doing that. But since our target group are JSON
> developers that are (at least not directly) interested in RDF I think that
> would be a step in the wrong direction.
> 
> 
>> To allow us to continue to use JSON object equivalence as the result
>> test, we could add a "preserve blank node identifiers" flag to be used
>> for the bulk of operations that involve expansion. It's only in the
>> case that blank nodes are renamed that we run into problems, and some
>> graph isomorphism might be necessary.
> 
> We discussed such a flag when we discussed the renaming of blank nodes in
> expansion but decided to not do it.

As a general flag, no, but as something that might aide in testing, it might be useful.

> I completely understand your concerns and the fact that you already invested
> a lot of time in the old algorithms but I think reverting the algorithms at
> this point is not a productive way forward. Both specs still need quite some
> editorial edits and also the test suite should be extended with more corner
> cases. We should invest our time in doing that instead of continuing
> discussing the algorithms - especially considering that there are no
> technical reasons to do so. Simplicity/complexity is - at least to a certain
> degree - always subjective.

Well, I've made my opinion clear. When others update their implementations they can determine for themselves the best course to follow. I won't be updating my implementation broadly to implement these algorithms, and I'll push back on tests I feel are too prescriptive. There is no requirement that implementations actually implement the core algorithms as stated, only that they conform to the normative requirements, but I was previously trying to stick pretty close to the documented algorithms.

FWIW, I applaud you for the initiative you took in moving this forward, when Manu and I did not have as much time to devote to it, but I think that as editors, we can do better in coordinating the work so that we don't end up at odds. Perhaps we should be quicker to have editorial meetings outside of the regular Tuesday meetings so that the work remains a collaborative effort.

Gregg

> [1] http://json-ld.org/minutes/2012-12-18/#resolution-1
> [2] http://json-ld.org/minutes/2012-12-11/
> 
> 
> Cheers,
> Markus
> 
> 
> 
> --
> Markus Lanthaler
> @markuslanthaler
> 
>
Received on Friday, 4 January 2013 17:58:49 UTC