Re: JSON-LD Algorithms from Gregg Kellogg on 2013-01-17 (public-linked-json@w3.org from January 2013)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Thu, 17 Jan 2013 15:15:18 -0500
To: Markus Lanthaler <markus.lanthaler@gmx.net>
CC: "public-linked-json@w3.org" <public-linked-json@w3.org>
Message-ID: <E3D71EAA-9171-48E6-814F-DC35D09888D2@greggkellogg.net>
On Jan 17, 2013, at 6:57 PM, Markus Lanthaler <markus.lanthaler@gmx.net> wrote:

> I've spend some more time thinking about our algorithms and how we should
> describe them in the spec. I do a agree with Gregg that the algorithms
> should be more descriptive so that they can be understood without actually
> implementing them. My intention when introducing the notion of a inverse
> context was just that. Unfortunately it seems that the inverse context
> failed to make IRI compaction easier to understand and at least I still
> believe that the Term Ranking algorithm is not much better in that regard.
> In fact I had problems understanding it myself without actually implementing
> it (the original version; the current one is probably even more complex due
> property generators).
> 
> I think the main problem with both algorithms is the level of abstraction we
> have chosen. It might be possible to describe the algorithm in a much
> simpler way by slightly increasing the level of abstraction. So let's first
> look at the problem we are trying to solve to ensure we are all on the same
> page.

Agreed, the term rank was always just an expedient way to express the results; restating it in a more descriptive way is a good idea.

> --- Problem Description ---
> 
> IRI compaction tries to find the best matching term or compact IRI in an
> active context for a) just an IRI or b) an IRI-value pair.
> 
> Case a) is just used when compacting values of @type (previously also for
> @id). The result will either be a term, a compact IRI or @vocab-abbreviated
> IRI (it's not @vocab relative), or, if nothing matches, the original IRI (we
> don't want to compact this to relative IRIs, do we?)
> 
> Case b) is a bit more complex as, apart from the IRI itself, a term's
> definition also has to correspond to the passed value (container,
> type/language). The result will either be a single term, compact IRI, or
> relative/absolute IRI or if a property generator matches, a list of
> candidate property generators with a single a single term, compact IRI, or
> relative/absolute IRI to fall back to if it turns out that no property
> generator matches.
> 
> 
> --- Proposed Algorithm ---
> 
> Lets again consider a) and b) separately.
> 
> Case a) is rather trivial:
> 
> - Filter the active context's term definitions by IRI and
>  if the result contains more than one term
>    - Choose the shortest and lexicographically least
> 
> - Otherwise, for every term in the active context, check if
>  it's IRI partially matches the target IRI, if so, construct
>  a compact IRI. Out of all potential compact IRIs, choose the
>  shortest and lexicographically least
> 
> - Otherwise, if @vocab is set, use it to shorten the IRI if possible

Meaning, that if the result doesn't match a term in the active context.

> - Otherwise, return the IRI as is
> 
> 
> Case b) is a bit more complex:
> 
> - Look at value and extract the target container and type/language mapping.
>  I won't spare you the details. The result should be something like
>  container=@list, type=http://... or container=@set, language=en
> 
> Then we have to filter the active context multiple times. First looking for
> perfect matches, then falling back to weaker matches using relaxed filter
> criterions:
> 
> - Filter the active context by container

By this, I presume you mean filter terms in the active context where the container is exactly, or is compatible with that of the value.

>   - Filter the active context by type or language if set

By this, I presume you mean filter terms which exactly match either the type or the language. Need to take into consideration the default language too, I suppose. Preumably, this has some weight for scalar values too.

>     - If property generators are in the result set,
>       order them (shortest and lexicographically least first),
>       and append them to the candidates

By result set, i presume you mean the list of terms that have so far passed through the filter. This would be the first time we've added anything to candidates.

>     - If terms which aren't property generators are in the result set,
>       append shortest and lexicographically least to the candidates
>       and return all candidates (if no term was found, continue we
>       need to add all potential property generators till a term is
>       found)

All potential property generators, or all potential terms, including property generator terms?

>     - Retry, but this time look for terms without type/language
>  - Relax container criteria and try again
> 
> - If no candidates have been found, try to create a compact IRI or use
> @vocab just as for case a)
> 
> Relaxing the container criteria means to use the following container
> fallback rules:
> 
> @list -> no container
> @language -> @set
> @annotation -> @set
> @set -> no container
> 
> ------
> 
> 
> Is this easier to understand then the two current algorithms we have?
> Does someone has another idea how to describe those algorithms in simple
> "prose"?
> Are there other algorithms that are currently difficult to understand?

I think this is a promising direction. Of course, we've made the problem particularly difficult by anticipating that there may be more than one term that matches a property IRI. If this wasn't the case, the whole algorithm would be trivial. I wonder how important it is that we have such complex procedures to deal with multiple matches deterministically, where we could instead just prohibit this behavior, or create simpler rules to deal with duplicates. For example, if we simply took the first term based on length and lexical order and used that, it would be simple and consistent. I don't know if these multiple-term matches are real use cases, or just ones we've come up with ourselves.

Of the other algorithms, the context expansion is difficult, and as a result the IRI expansion, due to the potential recursive nature. We could simplify this by being more vague about how to resolve to absolute IRIs and advise against looping changes.

At least for the versions of Expand and Compact I've worked from, I think they're relatively easy to get through now.

Gregg

> Thanks,
> Markus
> 
> 
> 
> --
> Markus Lanthaler
> @markuslanthaler
> 
>
Received on Thursday, 17 January 2013 20:17:17 UTC