Comments on XPointer CR section 5 from Michael Dyck on 2002-03-04 (www-xml-linking-comments@w3.org from January to March 2002)

From: Michael Dyck <michaeldyck@shaw.ca>
Date: Mon, 04 Mar 2002 02:27:58 -0800
To: www-xml-linking-comments@w3.org
Message-id: <3C834C2E.44433A11@shaw.ca>
Comments on section 5 of
XML Pointer Language (XPointer) Version 1.0
W3C Candidate Recommendation 11 September 2001

------------------------------------------------------------------------
5 XPointer Extensions to XPath

bullet 2
    Delete "and corresponding".
    Insert comma after "location types".

bullet 7 ("Allowance...")
    Rather than putting this bullet in the midst of bullets about added
    functions, maybe put it before or after them (in 4th or 8th spot).

------------------------------------------------------------------------
5.2 Evaluation Context Initialization

para 1
"except for the generalization of nodes to locations"
    Append "and the addition of properties for here and origin".

"XPointer applications"
    Change to "XPointer processors".

bullet 1
"When the XPointer is a fragment identifier of a URI reference, the
document or external parsed entity is the one identified by the URI
portion."
    Note that this requirement must be enforced outside the XPointer
    processor.  As described in 3.3, the XPointer processor is simply
    handed a resource. It is presumably this resource's root node that
    is the initial setting for the context location.

bullet 4
"XPointer applications"
    Change to "XPointer processors".

------------------------------------------------------------------------
5.2.1 Namespace Initialization

para 1
"Any xmlns parts attempting to override the xml prefix must be ignored."
    What about the xmlns prefix?


------------------------------------------------------------------------
5.3 The point and range Location Types

para 1
"Locations of the point and range type"
    Change "type" to "types".

para 2
See XP121 in the Linking Issue List. The decision was:
"He is looking for clarification, but more properly this should come
from the DOM side, not ours."
    Yes, I want clarification, but first I want correctness. I believe
    you are misusing Unicode terminology, and I think you should either
    correct it or (since it's just a Note) drop it.

------------------------------------------------------------------------
5.3.1 Definition of Point Location

para 1
    Delete "] [Definition:" in the middle of the sentence.

"Two points are identical if"
    After "identical", insert "(equal)", since that term is sometimes
    used (e.g., in the definition of collapsed range).

    After "if", append "and only if".

para 2
    Change "applications" to "XPointer processors".

para 3
"a text node inside an element"
    Delete "inside an element". It's redundant. Every text node is
    inside an element.

para 4
"a non-zero index n indicates the point immediately after the nth child
node"
    Note that the point is (in general) *not* immediately after that
    node in document order, because the node contains descendant nodes
    or points that intervene.

------------------------------------------------------------------------
5.3.2 Definition of Range Location

para 1
    Put "[Definition:" "]" around the whole sentence.
    Put "range", "start point", and "end point" in bold.

para 3
"a range from the start of a processing instruction"
    This meaning of "start" (meaning the point to the immediate left of
    the PI) differs from that of "start-point" (meaning the leftmost
    point inside the PI). Instead of "the start of", you might say
    "immediately before".

para 5
"between the start point and end point"
    "Between" might rate another forward reference to document order.

para 6
"The axes of a range location are identical to the axes of its start
point."
    On the previous XPointer draft, I raised a comment on the weirdness
    of this definition, and suggested that you'd be better off saying
    that a range's self axis (and its *-or-self axes) contain the range
    itself, and all the other axes are empty. This went into the Linking
    Issue List as XP123(g), but it appears that the WG's response was
    misplaced under XP123(e):
        Discussion: after some discussion it appeared that keeping
        ranges 'terminal' w.r.t. axis computation wasn't a problem and
        keeping all axis of a range being empty is not a problem in
        practice.

        Decision: accepted all the axis for a a range are empty excepted
        *self which are the range itself. Add a note about start-point()
        or end-point() as intermediary step for doing axis computation
        from a range .
    Perhaps the misplacement explains why the decision was not carried
    out.

------------------------------------------------------------------------
5.3.3 Covering Ranges for All Location Types

bullet 5
"For any other kind of location"
    Append "(i.e., element, text, comment, or processing instruction)".

------------------------------------------------------------------------
5.3.4 Tests for point and range Locations

para 1
"production for NodeType... by adding"
    Delete "...".
    Put "NodeType" in a 'code' element.

production
    Here, when you modify XPath production [38], you label it [11].  In
    5.4.1, when you modify XPath production [4], you label it [4xptr].
    I think I prefer the latter technique. Thus, change "[11]" to
    "[38xptr]".

------------------------------------------------------------------------
5.3.5 Document Order

para 2
"node point" (3 times)
"character point" (2 times)
    Insert hyphen.

"Conceptually, node points label gaps between nodes,"
    I'm not sure people will understand what you mean by "gaps". You
    might say "positions" instead.

"while character points occur within a node, between the node points to
the right and left of the node."
    This is also true of node-points. Wouldn't it be more accurate (and
    more parallel) to say that character-points label the gaps/positions
    between characters?

para 3
"node point" (2 times)
    Insert hyphen.

para 4
"character point"
    Insert hyphen.

paras 2-4
    It seems to me that these three paragraphs aren't particularly
    pertinent to document order, and mostly just try to give people a
    conceptual grasp of point locations. So they might fit better back
    in 5.3.1.

para 5
"immediately preceding node"
    Put in bold.

"except that there is no point defined preceding or following the root"
    So? I don't think this affects the definition in any way. Delete it.

(In what follows, I abbreviate "immediately preceding node" as "IPN".)

"The following diagram..."
    It would be helpful to have the original XML text that this diagram
    represents. It appears to be:

    <p id='p1'>Everything_is_<em>deeply_</em>intertwingled.</p>

    (Although the underscores in the diagram are probably just
    placeholders for spaces.)

    I think it would be more common to put the space *after* the </em>
    tag rather than before it.

Diagram
    This diagram doesn't seem particularly related to document order.
    It too might fit better back in 5.3.1, unless you use it to give
    examples of IPN and document order.

    There should be an attribute node for id='p1', and markers for its
    three character-points.

    The text talks about the "gaps" between nodes: it would be nice if
    the diagram showed such gaps! (e.g., between the right side of the
    'text node 1' pentagon and the left side of the 'em' box)

    "postion 1 in p"
    "postion 0 in em"
        Change "postion" to "position".

    Note that "position" is not an XPointer term. The phrase "position
    1 in p" presumably means "point with index 1 and container-node p":
    you should either explain this or change the wording on the diagram
    ("point with index 1 in p" might be okay).

    All indications of "startpoint" and "endpoint" disagree with the
    definitions of the start-point and end-point functions. For example:
    --- The start-point of 'text node 1' is the character-point with
        container node = 'text node 1' and index = 0, i.e., the point
        labelled '0' just before the 'E', which is not what "text node 1
        startpoint" indicates.
    --- The start-point of 'p' is the node-point with container node =
        'p' and index = 0, which is presumably what is meant by
        "position 0 in p", but that label does not coincide with
        "p startpoint".

    Here is a rough ASCII-art version of the diagram that fixes the
    above problems. (You'll need to view it in a fixed-width font.)
    Please excuse the terse labels.

    +-------------------------------------------------------------------------------------------------------------------+
    |                                                                                                                   |
    |                                                         p                                                         |
    |                                                                                                                   |
    +-------------------------------------------------------------------------------------------------------------------+
    .     |                                                   |
    .     |                                                   |
    .     |                       +---------------------------+----+--------------------------------+
    .     |                       |                                |                                |
    .     |     !                 |                 !              |              !                 |                 !
    . +-------+ ! +-------------------------------+ ! +-------------------------+ ! +-------------------------------+ !
    . |       | ! |                               | ! |                         | ! |                               | !
    . |  id   | ! |          text node 1          | ! |           em            | ! |          text node 3          | !
    . |       | ! |                               | ! |                         | ! |                               | !
    . |       | ! |                               | ! +-------------------------+ ! |                               | !
    . |       | ! |                               | ! .            |              ! |                               | !
    . |       | ! |                               | ! . !          |          !   ! |                               | !
    . |       | ! |                               | ! . ! +-----------------+ !   ! |                               | !
    . |       | ! |                               | ! . ! |                 | !   ! |                               | !
    . |       | ! |                               | ! . ! |   text node 2   | !   ! |                               | !
    . |       | ! |                               | ! . ! |                 | !   ! |                               | !
    . +-------+ ! +-------------------------------+ ! . ! +-----------------+ !   ! +-------------------------------+ !
    . .  p 1    ! .  E v e r y t h i n g _ i s _    ! . ! .  d e e p l y _    !   ! .  i n t e r t w i n g l e d .    !
    . . ! ! !   ! . ! ! ! ! ! ! ! ! ! ! ! ! ! ! !   ! . ! . ! ! ! ! ! ! ! !   !   ! . ! ! ! ! ! ! ! ! ! ! ! ! ! ! !   !
    . . 0   2   ! . 0         5        10       !   ! . ! . 0         5   !   !   ! . 0         5        10       !   !
    . . !   !   ! . !                           !   ! . ! . !             !   !   ! . !                           !   !
    . . +sp !   ! . +------ TN1 start-point     !   ! . ! . +-TN2 start-p !   !   ! . +----- TN3 start-point      !   !
    . .   ep+   ! .         TN1 end-point ------+   ! . ! .   TN2 end-p --+   !   ! .        TN3 end-point -------+   !
    . .         ! .                                 ! . !                     !   ! .                                 !
                !                                   !   0 <- node-pts in em-> 1   !                                   !
                !                                   !   !                     !   !                                   !
                !                                   !   +--- em start-point   !   !                                   !
                !                                   !        em end-point ----+   !                                   !
                !                                   !                             !                                   !
                0 <-- node-points in p --->         1                             2                                   3
                !                                                                                                     !
                +-- p start-point                                                                       p end-point --+
  
    (Vertical lines made of exclamation marks denote points. Vertical
    lines made of dots indicate where the respective nodes occur in
    document order, relative to points, assuming a reasonable
    definition of document order.)

Node and point
"A node is before a point if the node is before or equal in document
order to the IPN of the point; otherwise, the node is after the point."
    Let point X be node-point 2 in the <p> node (between the <em> node
    and text node 3).  Its IPN is the <em> node. So every node before
    or equal to the <em> node is before that point. That's fine.
    However, every *other* node is defined to be after the point, and
    that includes text node 2. But I really don't think you want text
    node 2 to be after point X.

Point and point
"Two points P1 and P2 are equal if their IPNs are equal and the indexes
of the points are equal."
    This is incorrect. Consider:
    P1 = the character-point between 'E' and 'v'
         (container = text node 1, index = 1)
    P2 = the node-point between text node 1 and the <em> node
         (container = <p> node, index = 1)
    For both P1 and P2, the IPN is text node 1 and the index is 1. So by
    the above definition, P1 and P2 are equal. But obviously they are
    not.

"P1 is before P2 if P1's IPN is before P2's"
    Consider:
    P1 = node-point 2 in the <p> node ("point X" previously)
    P2 = any character-point in text node 2.
    P1's IPN is the <em> node and P2's is text node 2. The <em> node is
    before text node 2, so the above definition says that P1 is before
    P2. But I don't think you want that to be the case.

"[P1 is before P2] if their IPNs are equal and P1's index is less than
P2's."
    Consider:
    P1 = the node-point between text node 1 and the <em> node
         (container = <p> node, index = 1)
    P2 = the character-point between the 'v' and the 'e'
         (container = text node 1, index = 2)
    For both P1 and P2, the IPN is text node 1. P1's index is less than
    P2's, so P1 is supposedly before P2. But I don't think you want it
    to be.

document order in general
    The problem with these definitions stems from the definition and use
    of the IPN concept. It's very tempting to think that a point's
    "immediately preceding node" is the node that immediately precedes
    it in document order. If it *did* mean that, the definitions above
    would make a lot more sense (although some would still be wrong).

    So now you might want to redefine IPN so that it does mean that, but
    I don't think it would be worth the effort. I think you'd still have
    trouble defining the relative order of points with the same IPN.

    Instead, how about just giving a nice recursive definition of
    document order? Something like this:

        Let "point(C,I)" denote the point whose container node is C and
            whose index is I.
        Let "child(N,I)" denote the Ith child of node N.
        Let "doc_order(N)" denote the document order of the nodes and
        points under node N, defined as follows:

        doc_order(N):
            if N is an element node or root node:
                Let k be the number of children of N.
                N
                For each namespace node S of N, doc_order(S)
                For each attribute node A of N, doc_order(A)
                point(N,0)
                For each i such that 1 <= i <= k,
                    doc_order( child(N,k) )
                    point(N,k)

            if N is any other kind of node:
                Let k be the length of the string-value of N.
                N
                For each i such that 0 <= i <= k,
                    point(N,k)

last para
"Note that one consequence of these rules is that a point can be treated
the same as the equivalent collapsed range."
    Only for the purpose of determining document order.

------------------------------------------------------------------------
5.4 XPointer Functions

para 1
"XPointer applications"
    Change "applications" to "processors".

Throughout 5.4.x:
    For consistency with XPath, in every function prototype, remove the
    space before the closing parenthesis.

------------------------------------------------------------------------
5.4.1 range-to Function

para 1
"For each location in the context"
    This is still misleading. Yes, I made this comment on the previous
    draft of XPointer, and yes, the WG decided (XP126(b) in the Linking
    Issue List) to keep it as is. However, I was not satisfied with the
    rationale for the decision, as detailed in
    http://lists.w3.org/Archives/Public/www-xml-linking-comments/2001AprJun/0073.html
    under "xp126-b-dyck". I have had no response to that posting.

"the start point of the context location (as determined by the
start-point function)"
    It would be better to put the parenthetical remark after "start
    point". That's what you do for "end point" in the same sentence.

"the location"
    As I pointed out on the previous draft, and as Elliotte Rusty Harold
    has pointed out on this draft, you don't say what happens when the
    location-set argument contains other than a single location. Perhaps
    you should say that the function returns a location-set containing
    a range for each location in the argument location-set.

------------------------------------------------------------------------
5.4.2 string-range() Function
    Delete "()" from the section title. None of the other section titles
    for functions has parentheses.

para 1
"For each location in the location-set argument, string-range returns a
set of ranges..."
    This suggests that, for instance, if the location-set contains two
    locations, the function returns two sets of ranges, one for each
    location. Presumably, it really only returns one set of ranges, the
    union of those two. So I suggest rewording to something like:

        This function returns a location-set containing ranges
        determined as follows. For each location in the location-set
        argument, the function searches the string-value of the
        location for substrings that match the string argument.

"An empty string"
    Maybe italicize "string" to indicate that it refers to the argument
    string, not the string-value of a location.

"Each non-overlapping match"
    Consider searching "banana" for substrings that match "ana". One
    possible interpretation of the phrase "non-overlapping match" would
    say that there are two matches, but they overlap, therefore there
    are no non-overlapping matches. I suspect the intent is that there
    is one non-overlapping match, but this is not at all clear.

para 2
"matched string" (2 times)
    Change "string" to "substring".

"The default value is 1, which makes the range start immediately before
the first character of the matched string."
    Are numbers less than 1 allowed? If so, it would be nice to give an
    example of such. If not, you should definitely say so.

    Are non-integral numbers allowed?

"The fourth argument gives the number of characters in the range"
    Presumably this must be greater than or equal to zero. What happens
    if a negative number is passed in?

    Are non-integral numbers allowed?

    This sentence doesn't completely define the resulting range.
    Consider the document:
        <doc>Thomas <em>Pyn</em>chon</doc>
    and the function call:
        string-range( /doc/em, "Py", 1, 3 )
    The resulting range starts at
        point( container = /doc/em/text(), index = 0 )
    but it could end at any of:
        point( container = /doc/em/text(), index = 3 )
        point( container = /doc/em,        index = 1 )
        point( container = /doc,           index = 2 )
        point( container = /doc/node()[3], index = 0 )
    and still satisfy the constraint that there be three characters in
    the range.

"Thus, both of the start point and end point of each range ... will be
character points."
    This statement does not logically follow from the previous. As my
    example shows, there can be node-points that satisfy the
    constraints.

"character points"
    Insert hyphen.

para 4
"For any particular match, if the string argument is not found in the
string-value of the location"
    This phrase doesn't make sense, because if the string argument isn't
    found, there *is* no match. I suggest rewording to something like:

        For any particular location, if no match is found, no range is
        added to the result for that location. For any particular match,
        if the third and fourth argument ...

"wholly beyond"
    So if they indicate a range that is only *partially* beyond the
    beginning or end of the document or entity, a range *is* added to
    the result? It wouyld be good to give an example.

"beyond the beginning or end of the document or entity"
    On the previous draft, I asked:
    What happens if the third or fourth arguments indicate a position
    that is within the document, but outside the string-value of the
    location?  For example, with this as the document:
        <doc>Thomas <em>Pyn</em>chon</doc>
    and this as the xpointer:
        string-range(/doc/em, "P", 1, 7)
    Does it select "Pynchon", "Pyn", or nothing?

    XP127(g) in the Linking Issue List shows a recommendation:
        sounds clear that this will select "Pynchon", since "Element
        boundaries, as well as entire embedded nodes such as processing
        instructions and comments, are ignored"
    but no actual decision.

    The thing is, once you "leave" the string-value of location being
    searched, where are you? In some nearby text node presumably, i.e.
    still in the string-value of some higher node.  In fact, it seems
    like the endpoints of the range are located with respect to the
    string-value of the whole document or external parsed entity.

    On a related note, consider the document:
        <doc>Pynchon<!-- Pyn-->chon</doc>
    and the function call:
        string-range(/doc/node(), "Pyn", 1, 7)
    Matches are found in the first text node and the comment node. The
    former will certainly add a range to the result, but what about the
    latter? You can imagine similar examples involving attribute,
    namespace, and processing instruction nodes.

para 5
"character points"
    Insert hyphen.
    This sentence repeats the last sentence of para 2.

para 9
"string content"
"retain the structural context"
    These phrases are not well-defined. Maybe this paragraph should just
    be a Note.

"For example, if the 17th occurrence of "Thomas Pynchon"..."
    Because this pertains to the first example, it would probably make
    more sense to put this paragraph after the first example.

"XPointer application"
    Change "application" to "processor".

------------------------------------------------------------------------
5.4.3.1 range Function

"representing the covering range"
    Append "(see 5.3.3)".

------------------------------------------------------------------------
5.4.3.2 range-inside Function

"If x is ... a point, then x is added to the result location-set."
    On the previous draft, I said:
    But if x is a point, then you'd be adding a point to the result, and
    you just said that the function returns ranges.  Instead, you
    presumably want to add the collapsed range at that point.

    XP128(a) in the Linking Issue List says "Decision: approved", but
    the decision has not been carried out.

"character point"
    Insert hyphen.

"If the end point is a character point then its index is the length of
the string-value of x; otherwise its index is the number of children of
x."
    This is somewhat circular, in that you're defining the end point
    based on a property of the end point. Of course, it works, because
    the property of it being a character-point or node-point is
    dependent only on its container node, which was specified in the
    previous sentence. Still, I think it would be clearer if you said
    something like:

        If x is an element node or root node, the index of the end point
        of the range is the number of children of x; otherwise its index
        is the length of the string-value of x.

------------------------------------------------------------------------
5.4.3.3 start-point Function
5.4.3.4 end-point Function

"If x is of type attribute or namespace, the XPointer part in which the
function appears fails."
    There is no reason for this. Attribute and namespace nodes are
    perfectly fine as containers for points and ranges.

    On the previous draft, I said:
    I'm mystified: why is it so wrong to ask for the start-point (or
    end-point) of an attribute or namespace location? Why can't these
    functions treat such locations just like text, comment, and
    processing instruction locations? That's what range-inside does.
    In fact, if someone really wanted to write
        start-point(@foo)
    they could get around start-point's bizarre dislike of attribute
    locations just by writing
        start-point(range-inside(@foo))
    If the latter expression isn't erroneous, why is the former?

    XP129(d) in the Linking Issue List gives the decision: "keep as is
    we would prefer to not add complexity at this point".

    My response to that appears in
    http://lists.w3.org/Archives/Public/www-xml-linking-comments/2001AprJun/0073.html
    under "xp129-d-dyck":
    Complexity? The following change would satisfy me:
        In the description for each of start-point() and end-point(),
        delete the bullet regarding attribute or namespace,
        and in the previous bullet, change
           "text, comment, or processing instruction"
        to
           "text, attribute, namespace, comment, or processing instruction".
    Can you honestly say that this adds complexity? To my thinking, the
    result is simpler than the current definition. Moreover, I'd say
    it's easier to implement.

    I have received no reply to that submission.

5.4.3.4 para 1
"to the result location-set"
    Change "result" to "resulting".

------------------------------------------------------------------------
5.4.4 here Function

para 1
"the XPointer part in which the here function appears fails"
    Does a resource error occur?

Note
"The returned location for an XPointer appearing in element content does
not have a node type of element because the XPointer is in a text node
that is itself inside an element."
    Huh? This seems to ignore what the first bullet says.

------------------------------------------------------------------------
5.4.5 origin Function

para 1
"a link expressed in an XML document"
    Is it important that the link be expressed in an XML document?
    (Could it be expressed in any other kind of document? Would it make
    a difference if it was?)

    I think it would a be a bit clearer if the last sentence of the
    paragraph were inserted after the first sentence.

para 2
"It is a resource error to use origin in the fragment identifier portion
of a URI reference where a URI is also provided and identifies a
resource different from the resource from which traversal was initiated,
or in a situation where traversal is not occurring."
    Why? It seems like it would be a useful thing to do.
    Imagine that document A has emphasized words:
        <em>frimmin</em> on the <em>jimjam</em>
    and document B is a glossary for these words:
        <entry><word>frimmin</word><defn>...</defn></entry>
    and you want to create third-party links such that from any <em>
    node in A, you can initiate traversal to the corresponding glossary
    entry in B.
    Wouldn't you need a URI reference something like this?:
        B.xml#xpointer(//entry[word = origin()])
    And wouldn't that be a resource error according to the quoted
    sentence?

------------------------------------------------------------------------
5.5 Root Node Children

"XPointer extends the XPath data model"
    Fine, but where is the data model of an external parsed entity
    defined?

------------------------------------------------------------------------

-Michael Dyck
Received on Monday, 4 March 2002 06:40:12 UTC