RE: JSON-LD Algorithms from Markus Lanthaler on 2013-01-04 (public-linked-json@w3.org from January 2013)

From: Markus Lanthaler <markus.lanthaler@gmx.net>
Date: Fri, 4 Jan 2013 16:35:21 +0100
To: <public-linked-json@w3.org>
Message-ID: <014d01cdea91$1c327a00$54976e00$@lanthaler@gmx.net>
On Friday, January 04, 2013 1:24 AM, Gregg Kellogg wrote:

> I've spent some time going over the updated algorithms and updating my
> parser. I have to say, that I'm distressed at how much the core
> algorithms have changed. In some cases (e.g., Inverse Context) it
> seems that the motivation was for performance, at the expense of being
> able to walk through the algorithms; I really think this is the wrong
> idea, and pretty much requires that an implementation exactly
> implement the algorithms to have a chance at conforming. This is also
> quite de-stabilizing, as we are trying to get to LC soon. IMO, the
> purpose of algorithm definitions in W3C specs is to allow them to be
> unambiguously understood by an implementor as to what the intent of
> the algorithms is, rather than the specific way in which they must be
> implemented.

I assume you are mostly talking about compaction here because the other
algorithms haven't been changed much (apart from the re-labeling of blank
nodes in expansion). We all agreed that compaction was already too complex
before the introduction of property generators and the new container types.
I looked at various ways to simplify it and the inverse context approach was
by far the simplest and most performant way to implement it. It was also the
only scalable solution I could come up with. I do agree that the pure
algorithms may not be enough to easily understand what's going on - we
should add some more high-level prose; I just haven't had the time yet. So
let me try to explain it in this email and see if it is understandable if
explained at a higher level.

The inverse context allows to more or less directly access the best-matching
term for a IRI-value pair (were the value is optional). All terms are put in
a multi-dimensional array (or nested object structure if you prefer) whose
structure looks as follow:

IRI    Container    Type/Lang?  Value        PropGen?   Term(s)
---------------------------------------------------------------
       @null
       @set         @null       @null        term
IRI    @list        @type       type IRI     propGens   term(s)
       @annotation  @language   lang. tag
       @language

       term        term (shortest term to use for IRIs without value)

What's worth noting is that the inverse context is sorted, e.g., by IRI
(longest first). So, if you need to create a compact IRI, you simply loop
over the IRIs, at the first match you take invCtx[IRI][term] and you are
done.

If you need to compact a IRI-value pair, e.g., IRI: { "@value": "hi",
"@language": "en" }, you first look at invCtx[IRI][@language]... to see if
there's a language container, then at
invCtx[IRI][@set][@language][en][propGens] as property generators and sets
are preferred. If there are no potential property generators, you look at
invCtx[IRI][@set][@language][en][term], then you try it at
invCtx[IRI][@null][@language]... (@null stands for no container) etc. If you
have the inverse context in front of you it is trivial to see which term is
chosen. You just need to know the fallback rules, e.g., for containers:
@language -> @set -> @null.

If there weren't property generators the algorithms would be a whole lot
simpler as you wouldn't need to collect candidates.


> Looking at the record [1], the group decided to adopt the algorithm I
> had specified (with some modifications) as an update to the current
> compaction algorithm, however the new compaction algorithm bears
> little resemblance to the old algorithm.

If you look more closely at the algorithms you will see that the algorithms
are doing exactly that:

- when expanding unlabeled nodes are automatically labeled with blank node
identifiers (this is also the reason why we need to re-label all blank nodes
in expansion; as decided in [1])

- when compacting, all potential property generators are evaluated. The
algorithms check if duplicates can be found for *all* IRIs using deep
comparison, if so, they are eliminated and the property generator term is
used.


> Looking at IRI compaction, the sense of the group definitely seemed to
> be to stick as close as we can to the current algorithms [2]. However,
> the IRI compaction algorithm is also substantially different than
> before.

That was just a proposal without a resolution. I announced in the 2012-12-11
telecon [2] (not transcribed, 1:27:00 onwards in the audio file) that I will
introduce the notion of a inverse context if I'm going to update the
algorithms. 


> My current implementation sticks closely to the original algorithms,
> with relatively minor updates to Term Rank, Expansion and Compaction
> to support language maps, annotations and property generators. I
> haven't completed my property generator compaction implementation, but
> I'm confident that the modification I specified relative to the
> previous algorithm can work (with some minor tweaks, no doubt).

Do you pass the whole test suite? There were a lot of minor things (mostly
corner cases) that weren't handled properly in the old algorithms. I added I
number of tests but I'm not sure I captured them all in the tests.


> I recommend that we go back to the previous algorithms and look for
> the minimal changes that are necessary to specify the desired
> behavior. I can certainly do this based on the work I've done to bring
> my implementation up-to-date.
> 
> Modifications to those algorithms should try to make the meaning of
> them clear when reading, rather than relying on emergent behavior. For
> example, knowing what term to choose when compaction should be
> something you can determine by walking through the IRI Compaction and
> Term Rank algorithms given specific input.

And if you have an inverse context you can pick it from a "table". I think
we all agree that the previous algorithms weren't simple at all. It's
debatable whether the new ones are simpler - but at least I tried hard to
simplify them.


> Lastly, I'm concerned that the tests depend too much on specifics of
> Blank Node naming. I can see why this is desirable, but it again
> limits implementations.

Well, that's only true if you want to do a simple deep comparison. In that
case I can't see how it could be solved in another way. A more sophisticated
test suite could ignore the blank node identifiers completely and check for
graph equivalence - I doubt that that would be simpler.


> One way to do this would be to transform the
> resulting JSON-LD into RDF and use well-understood graph isomorphism.

Nothing prevents you from doing that. But since our target group are JSON
developers that are (at least not directly) interested in RDF I think that
would be a step in the wrong direction.


> To allow us to continue to use JSON object equivalence as the result
> test, we could add a "preserve blank node identifiers" flag to be used
> for the bulk of operations that involve expansion. It's only in the
> case that blank nodes are renamed that we run into problems, and some
> graph isomorphism might be necessary.

We discussed such a flag when we discussed the renaming of blank nodes in
expansion but decided to not do it.

I completely understand your concerns and the fact that you already invested
a lot of time in the old algorithms but I think reverting the algorithms at
this point is not a productive way forward. Both specs still need quite some
editorial edits and also the test suite should be extended with more corner
cases. We should invest our time in doing that instead of continuing
discussing the algorithms - especially considering that there are no
technical reasons to do so. Simplicity/complexity is - at least to a certain
degree - always subjective.


[1] http://json-ld.org/minutes/2012-12-18/#resolution-1
[2] http://json-ld.org/minutes/2012-12-11/


Cheers,
Markus



--
Markus Lanthaler
@markuslanthaler
Received on Friday, 4 January 2013 15:35:52 UTC