F&O WD from Jeni Tennison on 2002-05-13 (public-qt-comments@w3.org from May 2002)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Mon, 13 May 2002 12:01:57 +0100
To: Jonathan Robie <jonathan.robie@datadirect-technologies.com>
CC: public-qt-comments@w3.org
Message-ID: <1101822242185.20020513120157@jenitennison.com>
Hi Jonathan,

I promised some more detailed comments on the F&O WD, so here you are.
As usual, these come from my perspective as an XSLT user rather than
anything else. I've ignored the constructors and casting sections,
since I know they're under review anyway.

I guess my guiding principal is that if a function is just a shorthand
for something that can be implemented without a recursive function,
then it shouldn't be included in the core set of XPath 2.0 functions.
Both XQuery and XSLT have methods of defining extension functions, so
I think that it's more important to focus on the functions that are
impossible or difficult to implement in XQuery/XSLT rather than those
that are simply convenience functions.

Cheers,

Jeni

---

The new functions, added on to XPath 1.0, are the following. I've put
* by the ones that I think should stay, - by those that I think should
go, and + by those on which I'm equivocal:

  - node-kind() -- I've hardly ever seen a problem that's required
    this functionality. I think it would be more flexible to use the
    "instance of" operator to work out what kind of node you're
    looking at; it would be easy enough to define your own function to
    give you the name of the node type on the rare cases that's
    required. In other words, you should be able to get at the type of
    a node and the type of an atomic value in the same way.
  
  + node-name() -- This used to be the name() function; I wonder
    whether it would be possible to merge this with the name()
    function. It would be great if that could be done so that the
    name() function works in the way that people think it works, such
    that "name() = 'pfx:name'" is equivalent to "self::pfx:name"; this
    would be backwards-incompatible with XPath 1.0, but would be more
    intuitive for users.

  * data() -- Certainly required now, but as with a lot of these
    functions, I wonder whether it would be helpful to have it follow
    the pattern of existing functions, like name() and string(), and
    have it return the typed value of the context node if it doesn't
    have an argument passed to it. I know that the F&O document
    purposefully tries to avoid overloaded functions, but for users,
    both those used to XPath 1.0 and those coming new to XPath 2.0, it
    will be confusing that different functions work in different ways
    depending on which version they were introduced in.

  * base-uri() -- Certainly very useful; we often get questions asking
    how to get the URL of the file that's being used as the source of
    the transformation.

  - unique-ID() -- I've never known anyone to have to get hold of the
    value of the ID attribute on a given element. If they do, they
    know the name of the attribute and can get its value through
    normal mechanisms. I'm also worried that this function will get
    confused with the generate-id() function.

  * compare() -- We do need this facility although not as much as
    you might think, in my opinion. I have to say that personally I
    find a return value of -1, 0 or 1 difficult to work with: I always
    get confused about which way round the arguments are related. It
    would be great if there was an alternative design, but I doubt
    that there is and since we'll rarely have to use different
    collations, I don't think that's too much of a problem.

  - normalize-unicode() -- As far as I understand the character
    model for the WWW, all text on the Internet should be normalized,
    and specifications should require unicode normalized (NFC) text. I
    can't recall ever seeing someone need to do unicode normalization;
    I suspect that such operations would be better done at a lower
    level in the application (normalize early) and that the data model
    should dictate that text is normalized.

  * upper-case() and lower-case() -- There's definitely a strong
    requirement for these, although allowing case-insensitive
    comparisons (which I think is supported with collations?) will go
    most of the way towards supporting the usual reason for
    case-changing. As I think I might have mentioned before, I believe
    that technically there should be a title-case() function as well,
    since the title case version of a letter is not always the same as
    the upper case version of a letter (ref.
    http://www.unicode.org/unicode/reports/tr21/)

  + string-pad() -- Repeating the same string is a fairly common
    operation, although it is one that's particularly easy to
    accomplish now with a user-defined function and a simple
    iteration. I therefore don't think that this function is vital,
    and if you want to save space, I think it should be dropped.

  * match() and replace() -- I think that you know that we need more
    regular expression support than this; I believe that you're
    working on that and that I've already commented on it.

  + duration/dateTime functions -- I've already commented on these in
    a separate thread. I think that this is the poorest section of the
    spec. The kinds of things that people want to do with dates are:

      - reformat them (which I believe is being supported separately
        in XSLT 2.0, though it's not there yet)

      - get a date from the common "seconds since
        1970-01-01T00:00:00Z" representation (for all its faults)

      - perform calculations between them

    Dates have a fixed format, so it's not hard to extract individual
    components from a date; I don't think that the set of functions to
    do so are necessary. It's harder to extract information from a
    duration because it doesn't have a fixed format, but not
    drastically so, and I think it's really very rare that you need to
    know get that kind of information from a duration.

    One thing that *is* difficult, and is useful, is to get values
    like "the number of seconds represented by this duration" (i.e.
    the reverse of dayTimeDuration-from-seconds()) -- it's useful
    because that enables you to perform calculations with durations
    (adding them, dividing them) that you can't do otherwise.

  * get-local-name() and get-namespace-uri() -- Makes me wish that
    the structured data types such as QNames, dates, durations and so
    on could be treated as virtual elements, so you could do
    $qname/local-name or $date/year. These are certainly handy
    functions, though.

  * resolve-URI() -- I imagine this will be very handy.

    URI manipulation is, I think, the primary reason for the
    requirement for string manipulation functions like
    subtring-after-last() or index-of-last(). Perhaps a
    get-file-name() method would be useful; I'm not sure.

  + deep-equal() -- I wouldn't personally say that this was a
    high-priority function. My guess would be that people would use it
    for the common task of moving through two documents to see where
    differences lie between them, and in that context I think it would
    be very expensive. But others might have use cases that I'm
    unaware of.

  - root() -- I think that root($node) does the same thing as
    $node/ancestor::node()[last()]. Given that the function is
    possible with very little effort, and that you rarely need to get
    from a node to the root node of that document, I don't really see
    the point of this function.

  - if-absent() and if-empty() are shorthands for:
    if (not($node)) then $default else $node and
    if (not($node) or not($node/node())) then $default else $node
    I don't find these expressions so burdensome that they require
    shorthand functions, especially not compared to some of the other
    functionality that's currently missing from the spec.

  * index-of() -- definitely required, though I have no doubt that
    people will use it like:

    $nodes[index-of(for $n in $nodes return string(), 'foo')]

  - empty() -- empty($seq) seems to be equivalent to
    not(boolean($seq)); as with other shorthands for easy expressions,
    I don't think this one's necessary, although it's true that the
    casting of empty sequences to boolean false can be non-obvious for
    beginners.

  - exists() -- seems to be equivalent to not(empty($seq)) or exactly
    equivalent to boolean($seq). I don't think this is necessary;
    empty() is more useful if you didn't want to use boolean() in the
    way that it's been used in XPath 1.0.

  + distinct-nodes() -- This obviously doesn't arise in XSLT 1.0
    because it's impossible to create a node set that contains more
    than one of a particular node. Given that node sequences are (or
    should/can be) created with duplicates automatically removed, I
    doubt that this will come into play very often; there aren't any
    use cases for it in the XQuery use case document either. On the
    other hand, the equivalent expression (distinct-nodes($nodes) is
    the same as union(() | $nodes)) is a bit of a hack and might not
    get you precisely what you want (since it also reorders into
    document order), so it's probably best to be on the safe side.

  * distinct-values() -- This functionality is required (and
    lacking) in XSLT 1.0, but the grouping facilities in XSLT 2.0 mean
    that it wouldn't be nearly as important there. I can see places
    where it would be handy, though (for example to write things like
    "there are 4 groups...", and to allow me to apply templates to
    distinct nodes in order to get more flexibility in my stylesheet).
    Since this function is likely to be much more heavily used than
    distinct-nodes(), I think it should be shortened to distinct().

  - insert() -- I can't really see the point, given that there's a
    concat(), a subsequence() and an index-of() and I don't think that
    there will often be times when you need to insert items into the
    middle of a sequence.

  - remove() -- Again, I don't see why this is needed, given that you
    can use a predicate to do the same thing: $target[position() !=
    $position].

  * subsequence() -- I imagine would be useful.

  + sequence-deep-equal() and sequence-node-equal() -- I'm not sure
    about sequence-deep-equal(), for the same reason I'm not sure
    about deep-equal(). The most useful, I would imagine, would be a
    plain sequence-equal() that compared the two sequences to see if
    they were the same on an item-by-item basis, with nodes being
    assessed based on identity, and values being assessed on their
    value.

  - avg() -- I'm not personally convinced (since the equivalent
    expression of sum() div count() really isn't difficult).

  * max() and min() -- Definitely. This is a requirements that's
    probably even greater than date formatting or regular expressions.
    It would be even more helpful if there was a quick way of getting
    to the node(s) that has the min/max value, rather than just
    getting the value itself. I imagine we're going to see rather a
    lot of $nodes[. = max($nodes)] otherwise, although I guess that
    could be optimised.

  - idref() -- As I've said elsewhere, id() turns out to be hardly
    used in XSLT because of the issues to do with requiring a DTD be
    present for the link to be any use. Where you need a reverse link,
    you can generally set up a key instead. I'd rather see keys from
    XML Schema supported than a specific idref() function introduced.

  - filter() -- I think this is potentially very useful, but, like
    copy() and shallow(), it has to do with creating nodes, which
    means that it shouldn't live at the XPath level.

  - collection() -- I don't really understand how this is different
    from the document() function.

  * input() -- Sounds reasonable.

  - context-item() -- I assume that this is not a real function, but
    actually just a backup for the shorthand '.'? It should say so.

  * current-dateTime() -- Definitely required; XForms calls this
    function now(), which has the advantage of being short and
    avoiding the mixed case convention difficulties.

Aside from those mentioned above, functions that are missing are:

  * tokenize(), which people ask for all the time, particularly for
    splitting strings into lines or words

  + possibly sqrt(), sin() and cos(), which are particularly useful
    when creating graphic formats such as SVG and aren't that easy to
    implement in XSLT

  * random() (create random numbers) and more usefully, I think,
    randomize() (randomly alter the order of items in a sequence),
    both with obvious side-effect issues; again these are impossible
    to implement using XSLT

  * function-available() to support the idea that XPath function
    libraries could be provided by particular implementations.

  * system-property() to support getting information about the XPath
    implementation version and so on.

FWIW, on the issues front:

  14: (operator-function-signatures) I agree, some of the
      signatures are confusing; I read the spec as indicating the
      required types for the functions, such that if you're using
      XPath the casting to those types is done automatically.

  20: (operator-codepoint-vs-character) I agree that the spec
      should be clear about whether it's talking about code points or
      characters, but I think that the character model spec recommends
      talking about character strings rather than code unit strings
      (ref. http://www.w3.org/TR/charmod/#sec-Strings)

  21: (operator-function-return-types) In my opinion, the return
      type of a function should be fixed, and not change based on the
      actual type passed as the argument of a function.

  37: (semantic-contains) I think that adding linguistic/semantic
      contains is a huge effort for very little benefit, at least for
      XPath 2.0. I can see that XQuery might want it, but I wouldn't
      want XSLT to be burdened, as the primary task of XSLT is
      transformation rather than querying.

  44: (operator-collation-specification) I think that XPath 2.0 should
      follow the pattern of XPath/XSLT 1.0 and use qualified names
      rather than URIs, for consistency and because it makes them
      easier to use.

  63: (operator-augment-index-of) I find the distinction between
      performing operations on nodes vs. performing operations on
      their values fiddly. In the case of index-of(), it strikes me
      that it wouldn't be difficult to perform index-of-value() if you
      had support for an index-of() that matched by node identity or
      simple type value (by creating a sequence of the node values and
      getting the index of the value you were after).

  66: (operator-docorder-function) Like distinct-nodes(), the
      requirement (or lack of it) for this function isn't yet apparent
      because it's not an issue in XSLT 1.0. Personally, I don't think
      that it will be used that often, but it may be best to be on the
      safe side as it wouldn't be particularly easy to replicate this
      functionality without removing duplicate nodes at the same time.

  67: (operator-remove-dupes) Since location paths do remove
      duplicates, and there thus isn't any backwards incompatibility
      with XPath 1.0, I don't think there's any reason for count() or
      sum() to remove duplicates.

  73: (operator-compare-between) I don't think that a
      compare-between() function is required.

  77: (operator-string-from-char) chars aren't data types in XML
      Schema -- are they in XPath? If not, then this issue isn't
      relevant.

  94: (operator-within-window) As with (semantic-contains), I don't
      think this is a high priority for XPath 2.0.

 108: (operators-always-normalize) I don't think that we should need
      to worry about unicode normalization within XPath 2.0.

 136: (function-datetime-timezone-conversion) In XML Schema, the
      timezone isn't part of the value space of a dateTime. Adding a
      timezone to a dateTime is essentially a formatting function.

 139: (need-fuller-definition-of-error-behavior-and-handling) Yes. We
      need to be able to test if an item is an error, and then be able
      to get information about that error, most importantly an error
      message that describes it and probably some information about
      the context in which the error occurred (e.g. what the context
      node was). I'm sure that you already have something on the cards
      here. Another point of confusion is that the empty sequence is
      sometimes used as a kind of error value, but at other times an
      error object is returned. I haven't yet worked out what the
      underlying heuristic is there, assuming that there is one.

 141: (does-string-equality-use-codepoint-or-default-collation) I
      think it should use the default collation, like the other string
      manipulation functions.

 142: (what-should-floor-ceiling-round-return) For compatibility, this
      should really return a xs:double (I believe). However, I think
      that returning an xs:integer, with an empty sequence used
      instead of NaN, would also be reasonable.

 143: (need-tokenize-function) As above, we definitely need a
      tokenize() function, preferably one that defaults to breaking on
      whitespace.

 144: (should-concat-accept-sequence-arguments) It would be useful,
      but highly incompatible. Perhaps a separate concat-sequence()
      function should be invented. (In XSLT 2.0, you can achieve
      the same effect with an xsl:value-of and an empty separator
      attribute, but since XSLT shouldn't be used for general sequence
      construction (apparently), this isn't ideal.

 150: (should-comparison-that-return-indeterminate-results-be-supported)
      As I've said before, yes. This is far more important than
      supporting matching of 'nearby' strings and so on, in my
      opinion.

 151: (comparison-functions-for-other-date-and-time-types) Yes, there
      should be comparison functions for other date and time types,
      although a basic rule about how the comparisons are carried out
      would be better than listing every possible combination of
      comparisons.

 152: (parameterized-extraction-functions-for-date-and-times) I view
      the extraction functions as superfluous, in the face of
      substring() and the prospect of a format-date() function. If you
      have them here, then I do think that they should be
      parameterised.

 154: (second-order-distinct-function) Like the other second-order
      functions, it would be great, but I don't think it's worth
      entering that territory at this stage.

 157: (boolean-from-string-legal-literals) Absolutely.

 162: (can-the-node-parameter-to-root-be-omitted) As I mentioned
      above, I think that having single-argument functions default to
      using the context item is a very useful tactic, and one that
      XPath 1.0 users are used to exploiting. It would be good, for
      consistency, if the new functions supported this shorthand.

 164: (for-complex-types-what-should-data-return) I don't have a
      strong opinion either way, but it should be consistent with the
      description of the typed value accessor in the data model. Since
      the string value is readily accessible in other ways, I think
      data() should probably not return the string value of the
      element if it has a complex type with complex content.

 166: (current-dateTime-convenience-functions) On the principal of
      having as few functions as possible, I don't think these
      convenience functions are necessary. They are easy to define for
      people who want them.

 168: (should-id-take-a-list-of-strings) id() definitely should be
      compatible with id() in XPath 1.0, and therefore accept a list
      of IDs.

---
Jeni Tennison
http://www.jenitennison.com/
Received on Monday, 13 May 2002 07:02:00 UTC