Review of alternate2.html from Markus Lanthaler on 2013-02-11 (public-linked-json@w3.org from February 2013)

From: Markus Lanthaler <markus.lanthaler@gmx.net>
Date: Mon, 11 Feb 2013 13:53:02 +0100
To: "'JSON-LD CG'" <public-linked-json@w3.org>
Message-ID: <01c201ce0856$bb31b610$31952230$@lanthaler@gmx.net>
Hi folks,

As promised I reviewed the changes in alternate2.html. I was happy to see
that overall there were just very subtle differences to the algorithms in
index.html. Therefore, some of the comments in the review are sometimes
really nitpicks but I think we are at a stage where we should pay attention
to those little details. It's quite a long review with a lot of details so
let me try to briefly summarize the main points.

The expansion part of the algorithms is in a very good shape and I think
they are quite understandable by someone not that familiar with JSON-LD.
Basically the algorithms are the same with the only real difference that
keyword aliases are treated specially by alternate2.html whereas in
index.html they are treated just as every other term mapping. Some steps
were moved from one algorithm to another, but the result is the same. I'm
not convinced that splitting the Context Processing algorithm into two
(Context Processing and Create Term Definition Subalgorithm) is really such
a good idea - especially considering that the Create Term Definition
Subalgorithm is just as long as the complete Context Processing algorithm in
index.html. The cost of having two algorithms is that, when you are reading
it, you have to jump between the two (probably first reading the intro of
the second and forgetting where you've been before in the meantime).

The compaction algorithms look to be very similar at the first sight but the
devil lies in the details. I've found a number of bugs for which I added
tests to the test suite and I saw that Dave already fixed at least one of
them in his implementation. Most problems have to do with property
generators. They are solvable but I don't know much sense it makes to do so
instead of just sticking to the algorithms in index.html and improve their
description. They are not so different after all. Alternatively, we could
also try to get rid of the Inverse Context altogether as I proposed on the
mailing list some time ago, see:
http://lists.w3.org/Archives/Public/public-linked-json/2013Jan/0047.html

What follows is a detailed review of the changes in alternate2.html. I hope
you have a chance to at least skim over it so that we can discuss how to
proceed in tomorrow's telecon. Alternatively, I would suggest we discuss
this next week and try to resolve the remaining open issues tomorrow.

Please note that the algorithm reviews are in a slightly different order
than in the spec. The reason is that I reviewed them in the order they are
used, i.e., depend on each other.


2. Features
===========================

Considering the discussion on the mailing list, I would propose to keep
using FOAF in these examples.

2.1 Expansion
---------------------------
1st paragraph: Emphasize removal of context, not switching between contexts

Under Example2: "The difference is in their context information" - that's
wrong, the contexts are the same.
I would thus propose to replace the paragraph with the intro from the
corresponding paragraph in index.html

2.3 Flattening
---------------------------
I would like to mention that all blank nodes will be labeled. This is an
integral part of flattening. It makes the examples slightly more difficult
but certainly helps to understand the algorithms later on.


5. Algorithms
===========================

5.2 Remote Context Resolution
---------------------------
I really like the idea in general but would change it slightly. Instead of
pre-processing the whole document I would call the algorithm as first step
whenever a local context is being processed. So it would be a sub-algorithm
of context processing. It should also be described more formally and include
error handling, mention that other data in the remote context document is
ignored, etc. Slightly adapting the steps under Context Processing 2.2 in
index.html will do the job.

5.3 Context Processing
---------------------------
Steps 3.4.4 and 3.5.4 are unclear. Instead of removing the keys, they are
set to true. This is confusing till you read the Create Term Definition
Subalgorithm.

5.4 Create Term Definition Subalgorithm
---------------------------
I'm not sure that it's a good idea to separate these steps from Context
Processing. I also find the introduction quite confusing. Is it talking
about the term (the left hand side) or its IRI mapping? What if both sides
contain compact IRIs? Step 3 seems to suggest that it is the left hand side
but it is not clear how to proceed after that step.
Step 5 suggests that keyword aliases are managed separately. Why is then
just the alias removed and not a previous IRI mapping?
Step 8.1 Remove @preserve
Step 8.2 vocabRelative must be set to true; the result might be a keyword,
but it won't be put handled properly as keyword alias as in 8.1
Step 8.4 is empty
Step 11, really? Including type mapping etc. What's the point?
Step 12.2.2.2 Property generators should just expand to IRIs, not @type,
this should also be checked.
Step 12.3 Why special case for @type? vocabRelative flag missing again
Step 14 Where comes the prefix mapping from? Active ctx, local ctx?

The order of the algorithms is a bit confusing IMHO. It clearly makes sense
to put context processing first, but then the IRI expansion algorithm should
directly follow it. Alternatively the order of index.html could be used,
i.e., the order of their dependency: expansion -> context processing -> iri
expansion

5.6 IRI Expansion
---------------------------
The defined map may be more efficient but I believe it's easier to
understand how cycles are detected if you pass an array of the currently
being processed terms, e.g., [ A, B, C ] if C then requires B you can simply
detect the recursion if it's already in the algorithm. Don't know, maybe
it's also just a personal preference. I don't have a strong opinion about
this but wanted to mention it.

Problem desc: The algorithm doesn't care if it's already an absolute IRI, it
tries nevertheless. The given value might be an absolute IRI, we don't know
if it is at this point.

General solution: I would shorten it, maybe just keeping the first
paragraph. The rest is just as complicated to read as the algorithm itself.

Algorithm
Step 1: Also keywords should be returned as is
Step 2 & 3: This should be made a bit clearer. It's quite difficult to
understand what's going on. The point is, that terms/prefixes in the local
ctx that haven't been processed yet, are processed first.
Step 4: no need to say "explicitly ignore value", this is confusing. Just
return null.
Step 6 onwards: isAbsoluteIri is confusing, could be a bNode ID as well.
Step 7: Why not just return the mapping (given vocabRelative is true)?
The whole branching around isAbsoluteIri is really confusing.
Maybe it's simpler to use the algorithm index.html and invoke the Create
Term Def. algorithm there instead of recursing into itself (even if I find
that simpler since the algorithm is self-contained).

5.5 Expansion
---------------------------
Step 3.1: The @context member is not removed
Step 3.2: This wouldn't be needed if @graph would be handled accordingly
under 3.4.3 (see index.html 4.3.2.8) and would simplify 3.4.6, 3.9, and 4.1
Step 3.4.2: ... or a bnode ID
Steps 3.4.3.x: IMHO it would be much simpler to expand the values directly
here (just as in steps 4.3.2.x in index.html) instead of having to expand
the property again in value expansion to find out that it is @id and then
expand using the IRI expansion algorithm. All keywords would then be
completely processed after this step. 
Step 3.4.4.2: "and value language value in value" is confusing
Step 3.4.6: might need some clarification. It might be easier to put it
under 3.4.3
Step 3.4.11: also this step would be simpler if everything keyword-related
would be done under 3.4.3
Step 3.5.3: already verified in 3.4.3.5, but @type must be a string and not
an array
Step 3.9: could be simplified if 3.2 is changed (see index.html 4.6)

insideList flag can be eliminated by checking looking instead into active
property's container mapping or active property=@list

5.7 Value Expansion
---------------------------
As already said above, I think it would be simpler to handle @id and @type
expansion directly in the expansion algo. Fewer jumps, fewer branches,
shorter algorithm(s) - not a deal breaker though.

Step 5: not needed, it's done again in step 9.
Step 10: default language missing
Step 12: missing, return result

5.8 Label Blank Nodes Subalgorithm
---------------------------
The isId flag isn't needed, @type is relabeled already in IRI expansion
When will step 4 ever be reached (and do something, not reassigning the same
bnodeId it already is)?

5.9 Generate Blank Node Identifier
---------------------------
Proposal: Get rid of the configurable prefix, we never change it anyway.
Just hardcode _:t or, even better because it's easier to remember, _:b.

Briefly describe what the identifier map is used for, i.e., keeping track of
blank node reassignments.

5.10 Compaction Algorithm
---------------------------
I didn't mention this for the other algorithms, but I think it would be
simpler to reorder the branches so that the longest comes last. In this case
e.g., first process scalars, then arrays, and finally objects (instead of
array, object, scalar which results in a lonely one-liner at the end).
Another advantage is that the steps in object-processing would be a level
higher since the if-block is not needed anymore, there won't be any other
case and the two previous blocks returned already (see step 2-3 in
index.html).

The algorithm needs to be passed the inverse context otherwise it can't pass
it to the subalgorithms (or they have to re-create it constantly from the
active context)

Step 1.3: The compactArrays option is not used
Step 2.2: I don't think we need to explain a shallow copy is created and
certainly we don't need to distinguish between array values and non-array
values, the point is that each value is copied but the references contained
in that copy remain the same references.
Step 2.4: Say keys from shallow and mention that this has to be done because
keys can be removed from shallow in the following loop
Step 2.5.3.1: The IRI compaction algorithm has no documentRelative flag; it
is also not needed as vocabRelative can be used to determine if the value is
being used in @type (true) or @id (false). All IRIs are document-relative.
Step 2.5.3.2.2: Drop documentRelative
Step 2.5.5.1: Is activeProperty the same as "active property" or not? It's
confusing to have two variables with the same name
Step 2.5.5.1 (and in general): Passing parent to the IRI compaction algo
which then passes it to term selection just to eliminate *some* of the
potential property generators because not all IRIs are in shallow doesn't
make sense to me since it doesn't optimize much but makes the algorithms
much harder to understand. You can easily do that check in the Find and
Remove Property Generator Duplicates Subalgorithm and I think that's the
place where it belongs.
Step 2.5.5 and substeps: There's a bug that's a bit difficult to explain,
please see compact-0046 which I just added.
Step 2.5.6.2: Active property could be an array of potential property
generators, how do you choose the container then? Actually I went on and saw
that property generators are not handled correctly at all. I added another
test, compact-0047 which shows the problem. Given this bug, the rest of the
algorithm works under the wrong assumptions. So it doesn't make much sense
to review the remaining steps.

5.12 Inverse Context Creation Subalgorithm
---------------------------
Good idea to sort the terms before processing them! Overall, the algorithm
is 4 times as long as the one in index.html. While I think we should clarify
the one in index.html I don't think it's necessary to blow it up that much.

Overall: I find it confusing to "initialize [variable] to the value
associated with the [key] key in [map]. What you do here is to store a
reference to that member, you don't assign its value to [variable].

Step 3.3.3: It would be more efficient to do this outside of the loop
Step 3.3.4: No need to initialize the defaultLanguage key here, this is done
again in 3.3.6.3.1
Step 3.3.6: I would find it easier to understand if the checks would be "If
there's a type mapping.", "Otherwise, if there's a language mapping", and
finally just "Otherwise". This would also eliminate some nestings.

There's a bug since the algorithm doesn't distinguish between a term with no
language (and no type) mapping and a term with a language mapping set to
null. I've added compact-0048.

5.11 IRI Compaction Algorithm
---------------------------
As already said in the Compaction algorithm, I would suggest dropping the
"parent" parameter. Instead I would pass the inverse context explicitly.
Depending of the outcome of ISSUE-204, also the base IRI is required.

Step 5.4: This is related to the bug described above, numbers, e.g., have no
language mapping. This is different than strings with a language that equals
null.
Step 5.4.4 and substeps: At least @id objects are not handled properly; I've
added compact-0049 for this bug. I won't review the rest of these sub-steps
but continue directly with compact IRIs.
Step 7.4 and 7.5: Why introduce the variable isUsableCurie here?

I can't see how the algorithm works if there are potential property
generators that turn out to not match because the data doesn't match. I've
added compact-0050 as a minimal test. The point is that the fallback terms
are never returned. As I understand the algorithm, in this case only "gen"
would be returned.

Depending on the outcome of ISSUE-204, compaction to relative IRIs is
missing.


5.13 Term Selection Subalgorithm
---------------------------
Drop parent parameter and the corresponding "optimization"

Sometimes @null is used, sometimes @none (also in the other algorithms). I
find this confusing as it is very difficult to remember what's in use where.

Step 3: Missing space "underwhich"
Step 4: Typo in typeOrLangageValue. In step 3 typeOrLanguageValue is
modified, in step 4 then a new variable "preferred" is introduced. Why not
reuse typeOrLanguageValue? Why not do those steps directly in IRI
compaction? Same question for step 1

Overall, I'm not sure if the introduction of so many variables with very
similar names is helpful (e.g. typeOrLanguage, typeOrLanguage map,
typeOrLanguageValue map)

Only one property generator is ever selected but it is not checked whether
the values really match so that the property generator is applicable.
There's also no way to invoke the algorithm again saying to ignore that
specific property generator in the second run. So the algorithm doesn't
really work.


5.14 Value Compaction
---------------------------
Substeps of 1: The whole preserveIndex stuff is quite complicated. Why not
just create a copy of value and then modify that copy as the algorithm in
index.html does?
Step 2: What if there's also @index? Why expand the value? Why IRI expand an
object? It's already expanded, we are compacting here. I think it should
mean expand active property. Typo "passing passing"

PROPOSAL: Keep the algorithm in index.html which is half as long. Modify it
to work on a copy of value if necessary or construct a copy step-by-step
(this should require just minimal changes). 


5.15 Find and Remove Property Generator Duplicates Subalgorithm
---------------------------
In contrast to the algorithm in index.html, the algorithm accepts just a
single property generator. So they do something different. 

The algorithm in index.html does the following:
  - Check for *every potential* property generator if all of its IRIs are
included as properties in element
  - If so, check if they all contain the passed value
  - If so remove those duplicate values (and potentially the whole property
if no other value is left)
  - Return the first matching property generator for which all duplicates
have been found
  - If no matching property generator was found, return the fallback term,
compact IRI or IRI.

The algorithm in alternate2.html does the following:
  - It checks for a *single's* property generator IRIs if element contains
the duplicates
  - If a duplicate is found, it is deleted straight ahead (not checking
first verifying whether all duplicates are really there) - this algorithm
doesn't seem to work on a copy of element so data is potentially lost
  - It doesn't return anything

Since the algorithm is just invoked in the compaction algorithm the
remaining work must be done there but it isn't. Here's what done there:

  If there is a term definition for activeProperty in active
  context that is a property generator, then invoke the Find
  and Remove Property Generator Duplicates algorithm, passing
  active context, shallow for element, expanded property,
  expanded item for value, and activeProperty.

So I cannot see how this would possibly work. I would thus propose to keep
the algorithm of index.html but include steps 2.3.1.X from alternate2.html
to describe in detail what's a duplicate and what's not.

The remaining algorithms haven't been changed.



--
Markus Lanthaler
@markuslanthaler
Received on Monday, 11 February 2013 12:53:32 UTC