Re: Review of alternate2.html from Dave Longley on 2013-02-11 (public-linked-json@w3.org from February 2013)

From: Dave Longley <dlongley@digitalbazaar.com>
Date: Mon, 11 Feb 2013 16:49:47 -0500
To: public-linked-json@w3.org
Message-ID: <5119677B.4080606@digitalbazaar.com>
Thanks for the detailed review, Markus. I've got some feedback on your 
comments below:

On 02/11/2013 07:53 AM, Markus Lanthaler wrote:
> Hi folks,
>
> I'm not convinced that splitting the Context Processing algorithm into two
> (Context Processing and Create Term Definition Subalgorithm) is really such
> a good idea - especially considering that the Create Term Definition
> Subalgorithm is just as long as the complete Context Processing algorithm in
> index.html. The cost of having two algorithms is that, when you are reading
> it, you have to jump between the two (probably first reading the intro of
> the second and forgetting where you've been before in the meantime).

I don't know that we should be too concerned with the length of the 
algorithm if it better explains what's going on. I do sympathize with 
the idea that it is an annoyance to be jumping between two algorithms. 
However, I'm not sure that most people would do that; they may just put 
a placeholder in their head for "now create a term definition" whenever 
they see it, without having to jump down and look at all the details 
each time. Furthermore, it makes it more obvious where you might want to 
break out separate functions when implementing because of common behavior.

> The compaction algorithms look to be very similar at the first sight but the
> devil lies in the details. I've found a number of bugs for which I added
> tests to the test suite and I saw that Dave already fixed at least one of
> them in his implementation.

Hopefully they are all resolved now (I already updated the 
implementations and spec text, it was only a few of lines of text). I 
added a couple more test cases as well; there may be some more I need to 
add related to bugs I found when doing the algorithm rewrite (based off 
of the older algorithms), I'll need to double check. Perhaps we can get 
some more implementors to add in test cases that they have found or that 
seem likely to trip up implementations based on a review of the 
algorithms. There was also at least one new test case added that I'm not 
entirely sure I agree with what the output should be (related to term 
selection w/null language mappings).

> Alternatively, we could
> also try to get rid of the Inverse Context altogether as I proposed on the
> mailing list some time ago, see:
> http://lists.w3.org/Archives/Public/public-linked-json/2013Jan/0047.html

IMO, I don't think it's too difficult to understand with the new text 
and algorithm.

> 2. Features
> ===========================
>
> Considering the discussion on the mailing list, I would propose to keep
> using FOAF in these examples.

I agree that we should add FOAF back into some examples.

> 2.1 Expansion
> ---------------------------
> 1st paragraph: Emphasize removal of context, not switching between contexts

Maybe. I like how the main purpose of removing contexts is explained 
now. JSON-LD is often really, IMO, about being able to express the same 
data using different contexts, so it makes sense to talk about 
everything in that ... context.

> Under Example2: "The difference is in their context information" - that's
> wrong, the contexts are the same.
> I would thus propose to replace the paragraph with the intro from the
> corresponding paragraph in index.html

Are you sure? The two contexts look different to me, but we may be 
talking about something else. The context in Example 1 has the term 
"name", whilst Example 2's context does not.

> 2.3 Flattening
> ---------------------------
> I would like to mention that all blank nodes will be labeled. This is an
> integral part of flattening. It makes the examples slightly more difficult
> but certainly helps to understand the algorithms later on.

This was removed because blank nodes and blank node label is a fairly 
complex issue, IMO. I didn't think it was appropriate to be in the 
introduction. Perhaps it needs more highlighting later on?

> 5. Algorithms
> ===========================
>
> 5.2 Remote Context Resolution
> ---------------------------
> I really like the idea in general but would change it slightly. Instead of
> pre-processing the whole document I would call the algorithm as first step
> whenever a local context is being processed. So it would be a sub-algorithm
> of context processing. It should also be described more formally and include
> error handling, mention that other data in the remote context document is
> ignored, etc. Slightly adapting the steps under Context Processing 2.2 in
> index.html will do the job.

Well, one of the reasons for getting it all out of the way to begin 
with, is to avoid introducing a lot of asynchronous complexity with the 
other algorithms. If someone were implementing a new processor in an 
asynchronous environment, and they tried to follow the algorithms, they 
might find control flow to be a pain. Hopefully they'd notice that if 
they got remote context resolution out of the way first their work would 
be easier, but why not give them a headstart in solving that problem for 
them? I do think both approaches should be mentioned either way.

> 5.3 Context Processing
> ---------------------------
> Steps 3.4.4 and 3.5.4 are unclear. Instead of removing the keys, they are
> set to true. This is confusing till you read the Create Term Definition
> Subalgorithm.

I don't know that it's confusing to mark those keys as having been 
defined (their entries in the defined map are set to true). I will say 
that the reason why those keys aren't removed is because, in this 
rewrite, I tried to take the approach of not modifying inputs as well as 
not having to say "copy the input" everywhere. This means it takes an 
implementor less mental effort to have to tease out where they can 
optimize away copying and where they can't.

> 5.4 Create Term Definition Subalgorithm
> ---------------------------
> I'm not sure that it's a good idea to separate these steps from Context
> Processing. I also find the introduction quite confusing. Is it talking
> about the term (the left hand side) or its IRI mapping?

Is the confusion in the choice of the word "term"? I have found some of 
the terminology to be confusing myself; we create "term definitions" for 
keys that are CURIEs ... are these "terms" or not? The terminology in 
the spec indicates that a term is any "short word" defined in a context 
that may be expanded to an IRI. Perhaps we should be saying "keys" (in 
the local context) before terms are defined? I think this is a general 
problem we have in the specification; I know it was there when I was 
looking over the old spec during the rewrite. I think we should get some 
consensus on our terminology for this moving forward.

> Step 3 seems to suggest that it is the left hand side
> but it is not clear how to proceed after that step.

This may just arise from the general confusion w/"term". All you do is 
invoke the create term definition algorithm recursively to define the 
dependency (as step 3 states), and then you move onto step 4. There's 
nothing special going on there.

> Step 5 suggests that keyword aliases are managed separately. Why is then
> just the alias removed and not a previous IRI mapping?

Because the IRI mapping will be overwritten when you define the term. 
I'm not sure if this needs to be made more clear with a note?

> Step 8.1 Remove @preserve

Yes, this is for supporting the framing algorithm and it should be 
removed at this time.

> Step 8.2 vocabRelative must be set to true; the result might be a keyword,
> but it won't be put handled properly as keyword alias as in 8.1

I thought you are not allowed to use keyword aliases in the context. Has 
this changed?

> Step 8.4 is empty

Fixed.

> Step 11, really? Including type mapping etc. What's the point?

Inheritance from dependencies. You may add/override a container, type, 
or language mapping. For example, if you have a term that is a CURIE 
that defines only a container, language, or type mapping, you want to 
inherit the IRI mapping from the prefix's definition. It sounds like we 
should add several test cases to the test-suite for this.

> Step 12.2.2.2 Property generators should just expand to IRIs, not @type,
> this should also be checked.

This has been fixed ("@type" is now disallowed as a property generator 
IRI mapping).

> Step 12.3 Why special case for @type? vocabRelative flag missing again

The special case has been removed, it was an artifact from a previous 
algorithm. The vocabRelative flag is missing for the same reason as 
above, no keyword aliases are permitted in the context.

> Step 14 Where comes the prefix mapping from? Active ctx, local ctx?

 From the active context (the step text is correct). All dependencies 
are resolved at step 3.

> The order of the algorithms is a bit confusing IMHO. It clearly makes sense
> to put context processing first, but then the IRI expansion algorithm should
> directly follow it. Alternatively the order of index.html could be used,
> i.e., the order of their dependency: expansion -> context processing -> iri
> expansion

I definitely think the order should not be what is in index.html. 
However, if it makes more sense to juggle the order some more in 
alternate2.html I'm all for that. Context processing should definitely 
come first. We should get some consensus on the order afterwards.

> 5.6 IRI Expansion
> ---------------------------
> The defined map may be more efficient but I believe it's easier to
> understand how cycles are detected if you pass an array of the currently
> being processed terms, e.g., [ A, B, C ] if C then requires B you can simply
> detect the recursion if it's already in the algorithm. Don't know, maybe
> it's also just a personal preference. I don't have a strong opinion about
> this but wanted to mention it.

The map was also used because of the way the algorithm was rewritten -- 
it stores a tri-state not defined/defining/defined, not just 
defining/not defined. This is used to avoid infinite recursion and is 
simpler to understand, IMO. I thought showing that there were three 
states was more clear than just having an array. Clearly we differ here; 
I don't really mind which way it is explained in the algorithm so long 
as the consensus is that it is easiest to understand (or maybe it just 
doesn't matter that much).

> Problem desc: The algorithm doesn't care if it's already an absolute IRI, it
> tries nevertheless. The given value might be an absolute IRI, we don't know
> if it is at this point.
>
> General solution: I would shorten it, maybe just keeping the first
> paragraph. The rest is just as complicated to read as the algorithm itself.

I remember wanting to shorten that particular general solution as well.

> Algorithm
> Step 1: Also keywords should be returned as is

This is handled in step 5 to better consolidate, after keyword aliasing 
has been handled.

> Step 2 & 3: This should be made a bit clearer. It's quite difficult to
> understand what's going on. The point is, that terms/prefixes in the local
> ctx that haven't been processed yet, are processed first.

Agree. We could change the language a bit here to clarify.

> Step 4: no need to say "explicitly ignore value", this is confusing. Just
> return null.

I disagree. I wanted to know why null was being returned and I wanted to 
know it right there in the algorithm without having to go find out what 
was going on via another document or statement somewhere. I think it's 
important to have notes within these algorithms about what the special 
meaning of things, like a "null mapping", are.

> Step 6 onwards: isAbsoluteIri is confusing, could be a bNode ID as well.

There are probably some other temporary variable names that could be 
better, in addition to this one.

> Step 7: Why not just return the mapping (given vocabRelative is true)?

It may be a blank node, which needs renaming. This is consolidated into 
a single step below.

> The whole branching around isAbsoluteIri is really confusing.
> Maybe it's simpler to use the algorithm index.html and invoke the Create
> Term Def. algorithm there instead of recursing into itself (even if I find
> that simpler since the algorithm is self-contained).

I definitely prefer keeping the algorithm self-contained. Maybe we can 
clean up the branching in another way. I would prefer that we don't 
repeat ourselves too, if we can avoid it. I'm ok with the general idea: 
get the value associated with the active context, if it's a keyword or 
null return, if you got something else don't look for a CURIE, otherwise 
you got nothing, so you have to check for a CURIE. Once you've dealt 
with the possible CURIE, do blank node renaming or apply vocab or base. 
I think that flow is easy to follow and it's what we should go with, we 
might just need to clean up the algorithm text a bit.

> 5.5 Expansion
> ---------------------------
> Step 3.1: The @context member is not removed

Yeah, I see that now. Perhaps we want to instead skip that key further 
down (rather than removing it, to keep with the notion of not having to 
copy inputs).

> Step 3.2: This wouldn't be needed if @graph would be handled accordingly
> under 3.4.3 (see index.html 4.3.2.8) and would simplify 3.4.6, 3.9, and 4.1

We should look into this. From what I recall, the algorithm in 
index.html was buggy. One of the test cases I added might be related to 
this as well (it didn't work w/your playground). There may be another 
that is required to further expose the problem.

> Step 3.4.2: ... or a bnode ID

Yes, just another case related to bnode IDs not being IRIs. It would be 
nice to have a catch-all identifier name that includes bnode IDs. Either 
that or we can use "if the expanded property is a string that contains a 
colon", but I find that somewhat ugly and doesn't do as good of a job 
explaining what's going on.

> Steps 3.4.3.x: IMHO it would be much simpler to expand the values directly
> here (just as in steps 4.3.2.x in index.html) instead of having to expand
> the property again in value expansion to find out that it is @id and then
> expand using the IRI expansion algorithm. All keywords would then be
> completely processed after this step.

The split was made to separate validation from the general processing 
and to avoid repeating the same steps for array processing of @type, 
etc. In this way @type isn't handled differently from @list or @graph, 
both of which recurse. These array-handling steps are not present in the 
index.html version (so it doesn't work if you implement it according to 
the text because IRI expansion doesn't handle arrays). You can either 
recurse to handle arrays or you can repeat yourself; I tried to keep 
things consistent with recursing for @type ... which also meant it 
seemed cleaner to keep validation consistent and separate from general 
processing. I think this is all a matter of style. Of course, if we can 
simplify it all without repeating ourselves somewhere I'm all for that.

It may be that it's just always a tradeoff with where we repeat 
ourselves. In that case, I'd prefer to just recurse through the 
expansion algorithm to let it handle arrays vs. other values and let all 
values run through "value expansion". The difference in approach may 
again arise from the somewhat amorphous nature of our terminology (eg: 
"value").

> Step 3.4.4.2: "and value language value in value" is confusing

Agree. We should comb over the spec and read it without special markup 
like italics, etc. -- for similar issues.

> Step 3.4.6: might need some clarification. It might be easier to put it
> under 3.4.3

Agree, needs clarification.

> Step 3.4.11: also this step would be simpler if everything keyword-related
> would be done under 3.4.3
> Step 3.5.3: already verified in 3.4.3.5, but @type must be a string and not
> an array
> Step 3.9: could be simplified if 3.2 is changed (see index.html 4.6)

Again, these simpifications come with tradeoffs. As usual, we should do 
whatever the consensus is.

> insideList flag can be eliminated by checking looking instead into active
> property's container mapping or active property=@list

We need to check this. That may not be true -- unless a side-effect 
somewhere else handles a particular case, perhaps through recursion. I 
think we need more test cases in the test-suite for this.

> 5.7 Value Expansion
> ---------------------------
> As already said above, I think it would be simpler to handle @id and @type
> expansion directly in the expansion algo. Fewer jumps, fewer branches,
> shorter algorithm(s) - not a deal breaker though.
>
> Step 5: not needed, it's done again in step 9.
> Step 10: default language missing
> Step 12: missing, return result

The result is returned in step 11, but we can break that out into a 
separate step if necessary. The other two steps have been fixed.

> 5.8 Label Blank Nodes Subalgorithm
> ---------------------------
> The isId flag isn't needed, @type is relabeled already in IRI expansion
> When will step 4 ever be reached (and do something, not reassigning the same
> bnodeId it already is)?

This has been removed. It used to be used when blank node relabeling 
could be "triggered" to be turned on when encountering a property 
generator instead of having to always be on.

> 5.9 Generate Blank Node Identifier
> ---------------------------
> Proposal: Get rid of the configurable prefix, we never change it anyway.
> Just hardcode _:t or, even better because it's easier to remember, _:b.
>
> Briefly describe what the identifier map is used for, i.e., keeping track of
> blank node reassignments.

I agree with this. The reason why the prefix was configurable is because 
this algorithm is reused by RDF graph normalization (no longer part of 
this document).

> 5.10 Compaction Algorithm
> ---------------------------
> I didn't mention this for the other algorithms, but I think it would be
> simpler to reorder the branches so that the longest comes last. In this case
> e.g., first process scalars, then arrays, and finally objects (instead of
> array, object, scalar which results in a lonely one-liner at the end).

I think it's easier to think of "scalar" as "not the other things". I'm 
not sure if other developers will tend to agree with me or not. I have a 
vague feeling (no hard data), based on my experience with reading other 
people's code, that they prefer to check for arrays and objects first 
... and then fall back to primitives. This is cleaner to do in certain 
languages. However, I do prefer that shortest steps come first. I could 
go either way here.

> The algorithm needs to be passed the inverse context otherwise it can't pass
> it to the subalgorithms (or they have to re-create it constantly from the
> active context)

I may have been intentionally vague on how you get to the inverse 
context. Perhaps that should be changed.

> Step 1.3: The compactArrays option is not used

This needs to be added.

> Step 2.2: I don't think we need to explain a shallow copy is created and
> certainly we don't need to distinguish between array values and non-array
> values, the point is that each value is copied but the references contained
> in that copy remain the same references.

The reason for the distinction was to highlight that the shallow copy is 
actually two-levels deep. If we can clean this up without losing what's 
going on, that would be great.

> Step 2.4: Say keys from shallow and mention that this has to be done because
> keys can be removed from shallow in the following loop

I think it says from "element", not from "shallow".

> Step 2.5.3.1: The IRI compaction algorithm has no documentRelative flag; it
> is also not needed as vocabRelative can be used to determine if the value is
> being used in @type (true) or @id (false). All IRIs are document-relative.

This has been corrected.

> Step 2.5.5.1: Is activeProperty the same as "active property" or not? It's
> confusing to have two variables with the same name

Not the same. This has been fixed.

> Step 2.5.5.1 (and in general): Passing parent to the IRI compaction algo
> which then passes it to term selection just to eliminate *some* of the
> potential property generators because not all IRIs are in shallow doesn't
> make sense to me since it doesn't optimize much but makes the algorithms
> much harder to understand. You can easily do that check in the Find and
> Remove Property Generator Duplicates Subalgorithm and I think that's the
> place where it belongs.

I think it makes more sense to consolidate how term selection happens. 
It's easier, IMO, to understand that you pass in some parameters to a 
term selection algorithm -- which then gives you back the term you are 
supposed to use. I think that's better than it maybe returning you a 
list of terms to try out yourself.

> Step 2.5.5 and substeps: There's a bug that's a bit difficult to explain,
> please see compact-0046 which I just added.

This is fixed.

> Step 2.5.6.2: Active property could be an array of potential property
> generators, how do you choose the container then? Actually I went on and saw
> that property generators are not handled correctly at all. I added another
> test, compact-0047 which shows the problem. Given this bug, the rest of the
> algorithm works under the wrong assumptions. So it doesn't make much sense
> to review the remaining steps.

This has been fixed.

> 5.12 Inverse Context Creation Subalgorithm
> ---------------------------
> Good idea to sort the terms before processing them! Overall, the algorithm
> is 4 times as long as the one in index.html. While I think we should clarify
> the one in index.html I don't think it's necessary to blow it up that much.

I'm not sure if the algorithm is actually longer or if the other one did 
not really do enough of an explanation for what was going on. Also, the 
newer approach seems to simplify understanding the inverse context overall.

> Overall: I find it confusing to "initialize [variable] to the value
> associated with the [key] key in [map]. What you do here is to store a
> reference to that member, you don't assign its value to [variable].

I'm not sure what language we want to use to clarify what's going on 
here. We could easily be more verbose to hide what is really an 
implementation detail, but it is also just used to help people 
understand the steps by breaking them down a bit. I'm not sure what we 
want to do with these situations in general.

> Step 3.3.3: It would be more efficient to do this outside of the loop
> Step 3.3.4: No need to initialize the defaultLanguage key here, this is done
> again in 3.3.6.3.1
> Step 3.3.6: I would find it easier to understand if the checks would be "If
> there's a type mapping.", "Otherwise, if there's a language mapping", and
> finally just "Otherwise". This would also eliminate some nestings.

These sound like good suggestions.

> There's a bug since the algorithm doesn't distinguish between a term with no
> language (and no type) mapping and a term with a language mapping set to
> null. I've added compact-0048.

I think it does make that distinction, but I think it makes in an 
opposite direction from the test case. I'm not sure that test-case has 
the desired output. It seems like we'd prefer to select the null 
language term over the no-language term because it is more specific.

> 5.11 IRI Compaction Algorithm
> ---------------------------
> As already said in the Compaction algorithm, I would suggest dropping the
> "parent" parameter. Instead I would pass the inverse context explicitly.
> Depending of the outcome of ISSUE-204, also the base IRI is required.
>
> Step 5.4: This is related to the bug described above, numbers, e.g., have no
> language mapping. This is different than strings with a language that equals
> null.

I'm not sure I agree with this approach. I'll defer to the group, 
though. It's a very easy change if we decide to go the route you're 
proposing.

> Step 5.4.4 and substeps: At least @id objects are not handled properly; I've
> added compact-0049 for this bug. I won't review the rest of these sub-steps
> but continue directly with compact IRIs.
> Step 7.4 and 7.5: Why introduce the variable isUsableCurie here?

The bug has been fixed. That variable is meant to help make clear when a 
CURIE can be used and when it can't. We could do the same in prose and 
remove the variable.

> I can't see how the algorithm works if there are potential property
> generators that turn out to not match because the data doesn't match. I've
> added compact-0050 as a minimal test. The point is that the fallback terms
> are never returned. As I understand the algorithm, in this case only "gen"
> would be returned.

This has been fixed.

> Depending on the outcome of ISSUE-204, compaction to relative IRIs is
> missing.
>
>
> 5.13 Term Selection Subalgorithm
> ---------------------------
> Drop parent parameter and the corresponding "optimization"
>
> Sometimes @null is used, sometimes @none (also in the other algorithms). I
> find this confusing as it is very difficult to remember what's in use where.

I thought it was very important to distinguish between the two. Saying 
something is null in JSON-LD explicitly means you don't want to use it; 
whereas "@none" is used when you haven't said anything at all.

> Step 3: Missing space "underwhich"
> Step 4: Typo in typeOrLangageValue. In step 3 typeOrLanguageValue is
> modified, in step 4 then a new variable "preferred" is introduced. Why not
> reuse typeOrLanguageValue? Why not do those steps directly in IRI
> compaction? Same question for step 1

We could modify typeOrLanguageValue to be an array when the algorithm 
begins if people think that's clearer than putting it into an array of 
preferred values to iterate over. Perhaps renaming the variables, in 
general (you also suggest this), would help resolve this issue.

> 5.14 Value Compaction
> ---------------------------
> Substeps of 1: The whole preserveIndex stuff is quite complicated. Why not
> just create a copy of value and then modify that copy as the algorithm in
> index.html does?

I didn't think it was that complicated. I'll have to compare the two 
algorithms again.

> Step 2: What if there's also @index? Why expand the value? Why IRI expand an
> object? It's already expanded, we are compacting here. I think it should
> mean expand active property. Typo "passing passing"
>
> PROPOSAL: Keep the algorithm in index.html which is half as long. Modify it
> to work on a copy of value if necessary or construct a copy step-by-step
> (this should require just minimal changes).

I think the case where @index is in a subject reference may not be 
handled. Sounds like we need a test-case for that. Also, the algorithm 
in index.html doesn't seem to handle compacting the IRI in that case. 
The length of the algorithm may have more to do with 
understandability/clarify than anything, but I'll have to go back and 
look. Obviously, if the extra verbosity adds nothing we should shorten it.

> 5.15 Find and Remove Property Generator Duplicates Subalgorithm

This algorithm has been patched up a bit since you wrote your review. 
Hopefully it functions appropriately now. It is also reused (used in two 
places) now.

> ---------------------------
> In contrast to the algorithm in index.html, the algorithm accepts just a
> single property generator. So they do something different.
>
> The algorithm in index.html does the following:
>    - Check for *every potential* property generator if all of its IRIs are
> included as properties in element
>    - If so, check if they all contain the passed value
>    - If so remove those duplicate values (and potentially the whole property
> if no other value is left)
>    - Return the first matching property generator for which all duplicates
> have been found
>    - If no matching property generator was found, return the fallback term,
> compact IRI or IRI.
>
> The algorithm in alternate2.html does the following:
>    - It checks for a *single's* property generator IRIs if element contains
> the duplicates
>    - If a duplicate is found, it is deleted straight ahead (not checking
> first verifying whether all duplicates are really there) - this algorithm
> doesn't seem to work on a copy of element so data is potentially lost
>    - It doesn't return anything
>
> Since the algorithm is just invoked in the compaction algorithm the
> remaining work must be done there but it isn't. Here's what done there:
>
>    If there is a term definition for activeProperty in active
>    context that is a property generator, then invoke the Find
>    and Remove Property Generator Duplicates algorithm, passing
>    active context, shallow for element, expanded property,
>    expanded item for value, and activeProperty.
>
> So I cannot see how this would possibly work. I would thus propose to keep
> the algorithm of index.html but include steps 2.3.1.X from alternate2.html
> to describe in detail what's a duplicate and what's not.
>
> The remaining algorithms haven't been changed.
>
>
>
> --
> Markus Lanthaler
> @markuslanthaler
>
>


-- 
Dave Longley
CTO
Digital Bazaar, Inc.
http://digitalbazaar.com
Received on Monday, 11 February 2013 21:49:35 UTC