XML Schema WG comments on RDF documents

Colleagues:

With apologies for the delay, I transmit to you herewith the comments
of the XML Schema Working Group on the various RDF documents published
in Last Call recently.  We congratulate you on the progress of your
work and hope our comments are useful to you.  An HTML version of our
comments may be found at

http://www.w3.org/XML/Group/2003/03/xml-schema-rdf-notes.html

and I append an ASCII-only version for the convenience of those who
find it more convenient.

-C. M. Sperberg-McQueen
  Co-chair, W3C XML Schema Working Group

...........................................................

W3C XML Schema Working Group

Comments on RDF documents

ed. Charles Campbell, C. M. Sperberg-McQueen, Henry S. Thompson

10 March 2003

      _________________________________________________________________

      * 1. [1]Notes on RDF Primer
           + 1.1. [2]Design question, complexity (substantive)
           + 1.2. [3]Whitespace handling (schema-related)
      * 2. [4]Notes on RDF Concepts and Abstract Syntax
           + 2.1. [5]Mapping from lexical forms to values (schema-related,
             terminological)
           + 2.2. [6]Values without lexical forms (schema-related,
             important)
           + 2.3. [7]Lexical forms, strings, and character sequences
             (schema-related, editorial)
           + 2.4. [8]Strings for natural-language data (substantive)
           + 2.5. [9]Typos and minor editorial notes
      * 3. [10]Notes on RDF Semantics
           + 3.1. [11]The "meaning" of literals (editorial)
           + 3.2. [12]Types as lexical mappings (schema-related)
           + 3.3. [13]Miscellaneous editorial notes
      * 4. [14]Notes on RDF/XML Syntax Specification (Revised)
           + 4.1. [15]Manifest typing in the instance (policy)
           + 4.2. [16]QNames (Editorial, but important)
           + 4.3. [17]Miscellaneous editorial notes
           + 4.4. [18]Normative specification of XML grammar (policy,
             substantive)
           + 4.5. [19]On the relation between RDF and off-the-shelf XML
             tools (policy, substantive)

      _________________________________________________________________

    NOTE:
    [These notes have been considered and approved by the W3C XML Schema
    Working Group, and are transmitted to the RDF Core Working Group as
    comments on the last-call drafts of various RDF-related documents.]
    $Id: xml-schema-rdf-notes.html,v 1.11 2003/03/10 21:31:34 cmsmcq Exp $
    The XML Schema Working Group congratulates the RDF Core Working Group
    on progressing its several documents to Last Call; we apologize for
    the late submission of these comments, and hope that they prove
    helpful.
    Our comments include some which bear directly on the use of XML
    Schema's simple types by RDF, to which we believe you wished us to
    give particular attention. In the text which follows, these are
    labeled "schema-related". Some other comments, in contrast, relate to
    important and difficult technical and policy questions relating to
    language design and tool usage; these are labeled "policy". We hope
    that you will give these comments very serious consideration, but we
    do not pretend to any special standing in raising them, other than as
    representative members of the XML community at large. Finally, there
    are some other questions which are not directly related to XML Schema
    or to XML in general, and for which we therefore pretend to no
    particular expertise or standing, but which we happened to notice and
    which we call to your attention, as any technically minded reader
    might do, in the hopes that doing so may be useful to you; these are
    labeled "substantive" or "editorial" as the case might be.

1. Notes on RDF Primer

    RDF Primer, section 2.4 Typed literals
    [20]http://www.w3.org/TR/rdf-primer/#typedliterals

      [20] http://www.w3.org/TR/rdf-primer/#typedliterals

1.1. Design question, complexity (substantive)

    The introduction of pairs consisting of a lexical form and a type (or,
    strictly speaking, a lexical form and a type label) seems at first
    glance to complicate the RDF model somewhat. We have had the
    impression that in other parts of RDF, typing is handled by adding
    further arcs and nodes. If the type of a resource is identified by
    having an arc labeled rdf:type from it to (the URI of) its (RDF) type,
    and if the type of an arc is similarly identified by an arc, then
    surely a reason ought to be given for shifting to a different method
    for typing literal strings. It seems like a dramatic shift in the
    infrastructure of RDF, from "everything is a node, an arc, or a
    literal value" to "everything is a node, an arc, or a typed literal
    value". Perhaps not quite so dramatic, after all. But the question of
    design consistency remains: why not "everything is a typed node, a
    typed arc, or a typed literal"?

1.2. Whitespace handling (schema-related)

    Some members of the XML Schema WG have expressed concern that XML
    Schema's rules for whitespace handling may interfere with expected
    behavior in other contexts. This may be the appropriate place to bring
    this question up.
    In brief, XML Schema's simple types each define a whitespace facet,
    which governs the kind of whitespace pre-processing done by an XML
    Schema processor before the lexical form is checked for type validity.
    Since the point of whitespace normalization is to simplify subsequent
    processing, the lexical spaces of XML Schema's simple types are (like
    those in many programming languages) defined without reference to the
    preceding whitespace normalization. Integers, for example, are
    represented by sequences of decimal digits; sequences containing
    blanks are not legal lexical forms for integers. Indeed, strictly
    speaking it is only after the whitespace pre-processing is done that
    the XML Schema processor can be said to be working with a lexical form
    at all.
    For example, the integer type has a value of collapse for the
    whitespace facet, which means leading and trailing whitespace is
    stripped, and internal whitespace sequences are reduced to a single
    blank (x20) character. In an XML document in which the element
    exterms:age is defined as having type xs:integer, the following
    instances of exterms:age will all be type-valid:

      <exterms:age>27</exterms:age>
      <exterms:age>
        27
      </exterms:age>
      <exterms:age>   27  </exterms:age>
      <exterms:age>   2<!--* ha, ha, fooled your full-text indexer!
      *-->7  </exterms:age>

    The input information set, in each case, contains a character
    information item for "2" followed by a character information item for
    "7", with character information items for whitespace characters, and a
    comment information item, present in some of the examples. In all
    cases, the lexical form proper is the character sequence "27" (i.e.
    the sequence of characters after white space handling, and ignoring
    comments, processing instructions, entity boundaries, and other
    distractions). This is a legal lexical form for an integer, so all the
    examples are type valid.
    Some members of the XML Schema WG have worried that it may not be
    obvious that the whitespace processing is not part of the process of
    checking lexical forms for type validity, but part of the process of
    extracting the lexical forms from the XML information set presented to
    the processor. If an RDF document contains

      <exterms:age>   27  </exterms:age>

    and a processor hands the contents of the element to a generic
    type-checker for XML Schema's simple types, saying in effect "this
    purports to be the lexical form of an integer; is that OK?", that type
    checker will be required (if it conforms to the XML Schema spec's
    definition of the simple types) to say "no, the character sequence
    `   27  ' is not a legal lexical form for an integer."
    It's not clear whether RDF, being type-system neutral, can directly
    address this concern (e.g. by specifying that an RDF processor should
    do the appropriate whitespace pre-processing, or by warning users that
    they should not include vagrant whitespace in typed literals), or
    whether it suffices for developers of RDF software with built-in
    support for XML Schema's simple types to deal with it, e.g. by
    performing it themselves before handing the resulting lexical form to
    a type checker.
    As noted, some members of our WG feel that you need to be alerted to
    this as a possible source of confusion and unexpected results. Other
    members of the WG feel that it verges on disrespect to assume that you
    need instruction on this point. We compromised by agreeing to point
    out the issue to you, and to leave you to draw your own conclusions.

2. Notes on RDF Concepts and Abstract Syntax

2.1. Mapping from lexical forms to values (schema-related, terminological)

    In [21]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:

      [21] http://www.w3.org/TR/rdf-concepts/#section-Datatypes

      A datatype mapping is a set of pairs whose first element belongs to
      the lexical space of the datatype, and the second element belongs
      to the value space of the datatype:

    We agree that it is useful to define a term to denote such mappings;
    in the interests of inter-specification consistency, we wonder whether
    you would be willing to consider using the term lexical mapping, which
    we are introducing in our forthcoming draft of XML Schema 1.1. The
    term datatype mapping seems unlikely to be usable in the XML Schema
    specification, where it would suggest to some readers a mapping from
    one datatype to another, rather than as here a mapping from lexical
    space to value space. (XML Schema 1.0 got by without a term for this
    concept.)

2.2. Values without lexical forms (schema-related, important)

    In [22]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:

      [22] http://www.w3.org/TR/rdf-concepts/#section-Datatypes

      * Each member of the value space may be paired with any number
        (including zero) of members of the lexical space (lexical
        representations for that value).

    The provision for values without corresponding lexical forms
    contradicts an assumption to which the XML Schema spec appeals from
    time to time. The lexical space of any simple datatype in XML Schema
    is the domain of the type's lexical mapping; the value space is its
    domain. There are no meaningless lexical forms in the lexical space of
    the type, nor are there ineffable values in the value space. By
    eliminating values from the value space (e.g. by setting minimal and
    maximal values), the type definer may indirectly also eliminate
    lexical forms from the lexical space; conversely, by eliminating some
    items from the lexical space (e.g. by setting a pattern), the type
    definer may eliminate items from the value space.
    Are there crucial aspects of RDF which will break if the list item
    quoted above is changed to read "paired with one or more members of
    the lexical space"?

2.3. Lexical forms, strings, and character sequences (schema-related,
editorial)

    In [23]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:

      [23] http://www.w3.org/TR/rdf-concepts/#section-Datatypes

      With one exception, the datatypes used in RDF have a lexical space
      consisting of a set of strings.

    Since "string" is used as the local name for a particular simple type
    in the XML Schema namespace, we believe it will be less confusing for
    users, in the long run, if the lexical representations of
    simple-datatype values are described not as "strings" but as
    "character sequences".
    This comment also applies to other uses of the term string to denote
    the members of a lexical space.

2.4. Strings for natural-language data (substantive)

    In [24]http://www.w3.org/TR/rdf-concepts/#section-Datatypes:

      [24] http://www.w3.org/TR/rdf-concepts/#section-Datatypes

      * A plain literal is a string combined with an optional language
        identifier. This should be used for plain text in a natural
        language. As recommended in the RDF formal semantics
        [RDF-SEMANTICS], these plain literals are self-denoting.

    We do not believe that simple strings are likely to be adequate for
    the representation of arbitrary natural-language text. Even in
    English, natural-language utterances (such as this document) may need
    some degree of inline markup for clarity and adequate presentation; in
    natural-language utterances requiring bidirectional display or ruby,
    the best authorities (including the W3C I18n Working Group) recommend
    the use of markup within the natural-language utterance. We thus
    suggest that you may wish to moderate this recommendation that
    natural-language material be represented by literals.
    This is not an area in which we claim particular technical expertise;
    we merely call it to your attention in the hopes that doing so may be
    useful to you.

2.5. Typos and minor editorial notes

    In [25]http://www.w3.org/TR/rdf-concepts/#section-Literal-Value, for
    "the datatype mapping is applied to the pair form by the lexical form
    and the language identifier" read "the datatype mapping is applied to
    the pair formed by the lexical form and the language identifier".
    In the same section, for "Such a case, while in error, is not
    syntacticly ill-formed " read "Such a case, while in error, is not
    syntactically ill-formed" (et passim).
    In section [26]http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral,
    for "root element tag" read "root element".
    In the same section, for "XML element content" read "XML data" (the
    term element content is used in some markup-related specs as a
    complement of mixed content to denote the content of elements which
    can contain other elements but cannot contain parsed character data).

      [25] http://www.w3.org/TR/rdf-concepts/#section-Literal-Value
      [26] http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral

3. Notes on RDF Semantics

3.1. The "meaning" of literals (editorial)

      The meaning of a literal is principally determined by its character
      string: it either refers to the value mapped from the string by the
      associated datatype, or if no datatype is provided then it refers
      to the literal itself, which is either a unicode character string
      or a pair of a string with a language tag.

    Some members of the XML Schema WG are made nervous by the appeal to
    the notion of "meaning" here. [N.B. our task force read this section
    out of context, and were not aware of any foregoing elucidation. So
    this comment may be out of place.] There is also some concern about
    the apparent conflation here of the notions of meaning and reference.
    We wonder whether this discussion would be weakened by replacing
    references to meaning and reference by references to denotation; we
    are inclined to think it would be an improvement, but recognize that
    the RDF Core WG's views may differ.

3.2. Types as lexical mappings (schema-related)

      A datatype is an entity characterized by a set of character strings
      called lexical forms and a mapping from that set to a set of
      values.

    We have a couple of reservations concerning this characterization.
      * Elsewhere (e.g. in Concepts and Abstract Syntax, section 3.3,
        [27]http://www.w3.org/TR/rdf-concepts/#section-Datatypes), the RDF
        specs say that there may be values in a value space which are not
        in the range of the lexical mapping; we have suggested that if
        possible those statements should be changed, but if they are
        retained, then a datatype cannot be characterized solely by the
        lexical space and the lexical mapping, because such ineffable
        values appear in neither of these.
      * The statement describes (with the exception of the problem just
        noted) simple datatypes, but not the class of complex datatypes
        which can be defined by XML Schema, nor all the types (or
        type-like constructs) definable in various other schema languages
        for XML.

      [27] http://www.w3.org/TR/rdf-concepts/#section-Datatypes

3.3. Miscellaneous editorial notes

    In [28]http://www.w3.org/TR/rdf-mt/#dtype_interp, for "which we will
    refer to as XSD and use the Qname prefix xsd:" read "which we will
    refer to as XSD and denote using the Qname prefix xsd" (or something
    similar).
    In [29]http://www.w3.org/TR/rdf-mt/#dtype_interp:

      [28] http://www.w3.org/TR/rdf-mt/#dtype_interp
      [29] http://www.w3.org/TR/rdf-mt/#dtype_interp

      For example, XML Schema requires that the value spaces of
      xsd:string and xsd:decimal to be disjoint ...

    This sentence is not exactly wrong, but it seems slightly unusual to
    use the verb require here, instead of define or something similar. We
    suggest recasting this as "For example, XML Schema defines the value
    spaces of xsd:string and xsd:decimal as disjoint ..." (Note, for the
    record, that the value spaces of all the primitive simple datatypes of
    XML Schema 1.0 are pairwise disjoint.)
    In ,

      any literal of the form "sss"@ttt^^ddd, where ddd is not
      rdf:XMLLiteral, treated as identical to the same literal without
      the language tag, "sss"@ddd

    is "sss"@ddd a typo for "sss"^^ddd?
    In [30]http://www.w3.org/TR/rdf-mt/#dtype_entail, for "it is valid to
    add any number of leading zeros to any numeral and still be a correct
    lexical form for xsd:integer", perhaps read "it is possible to add any
    number of leading zeros to any lexical form for xs:integer without it
    ceasing to be a correct lexical form for xsd:integer"

      [30] http://www.w3.org/TR/rdf-mt/#dtype_entail

4. Notes on RDF/XML Syntax Specification (Revised)

    RDF/XML Syntax, [31]http://www.w3.org/TR/rdf-syntax-grammar/

      [31] http://www.w3.org/TR/rdf-syntax-grammar/

4.1. Manifest typing in the instance (policy)

      RDF allows Typed Literals to be given as the object node of arcs.
      These consist of a literal string (with optional language) and a
      datatype RDF URI Reference. This is handled ... with an additional
      rdf:datatype="datatypeURI" attribute on the property element.

    We believe there are probably good reasons for using an rdf:datatype
    attribute, instead of re-using the existing xsi:type attribute which
    has (when the type is defined in a schema defined by XML Schema 1.0)
    the same semantics. In particular, rdf:datatype does not assume or
    assert the existence of the type named as a type in a schema defined
    by XML Schema, so it would be problematic to use xsi:type.
    We do fear, however, that users are likely to find this
    near-duplication of the meaning and function of xsi:type confusing. It
    is not clear to us what, if anything, can or should be done to
    minimize this danger.

4.2. QNames (Editorial, but important)

    We were unable, on a first reading, to determine whether the default
    namespace declaration, and thus unprefixed names, were or were not
    allowed in order to encode 'RDF URI References'. Indeed the
    introductory prose about QNames (2nd para of
    [32]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro])
    does not seem to connect up with the relevant (?) production in
    [33]http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar]
    , which we take to be
    [34]http://www.w3.org/TR/rdf-syntax-grammar/#URI-reference].
    This can and should be cleared up.

      [32] http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro
      [33] http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar
      [34] http://www.w3.org/TR/rdf-syntax-grammar/#URI-reference

4.3. Miscellaneous editorial notes

    In
    [35]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-empty-prop
    erty-elements, the sentence

      [35] 
http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-empty-property-elements

      When an arc in an RDF Graph points to an object node which has no
      further arcs, which appears in RDF/XML as an empty node element
      sequence such as the pair <rdf:Description rdf:about="...">
      </rdf:Description>, this form can be shortened.

    seems less clear than it might be. Different readers prove to have
    different views on what is meant by "the pair <rdf:Description
    rdf:about="..."> </rdf:Description>"; perhaps it can be replaced by
    something like "the empty element <rdf:Description rdf:about="..."/>"
    without loss of precision? Perhaps the sentence could read

      When an arc in an RDF Graph points to an object node which has no
      further arcs, which appears in RDF/XML as an empty node element
      such as <rdf:Description rdf:about="..."/>, this form can be
      shortened.

4.4. Normative specification of XML grammar (policy, substantive)

    We note with admiration the excellent tutorial introduction to the
    striped syntax in Section 2
    [36]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax]. We are
    less happy with the nature of the syntax, and with the approach taken
    to its normative statement
    [37]http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar]
    .
    As regards the syntax itself, we would much prefer to have seen a move
    to a single canonical syntax with much less variablity. With respect,
    the current design suggests that the value of XML has been
    misunderstood. The range of alternative forms of expression provided
    for in the current design make it very difficult to use the broad
    range of generic XML tools (e.g. syntax-directed editors, XSLT) which
    could give so much benefit to RDF users. (More on this below.) At the
    very least we would encourage you to specify a single canonical form,
    probably strictly striped, which could be defined by an XML Schema or
    DTD. We would be happy to work with you to develop a schema for such a
    subset.
    As regards the approach taken to defining the syntax, in our view,
    layering of specs has very high value, and so defining an XML document
    type by way of what is very nearly a character-level BNF is at best a
    missed opportunity and at worst a serious mistake. It obscures the
    important aspects of the document type behind a welter of irrelevant
    detail about e.g. whitespace and start-tag/end-tag matching. It makes
    it very difficult for the reader to actually understand what is and
    isn't actually allowed -- what an RDF/XML document actually looks
    like.
    Not only does this confuse levels and thus readers, it also runs the
    risk of inadvertently defining an XML subset. It also appears, on a
    strict reading, to rule out XML documents not derived from the parsing
    of character streams as possible RDF/XML (so that it would be
    illegitimate to regard a data structure created using a DOM interface,
    for example, as RDF/XML).
    The use of event-triggered data-model construction actions to specify
    the relationship between XML representation and corresponding data
    objects is innovative and compelling, but surely it would be
    straight-forward to associate these events with a pre-order traversal
    of an infoset independently constrained by a DTD, XML Schema schema or
    other appropriate definition of the canonical document type. If
    continued support for alternative forms is considered essential, then
    a two-step approach where the semantics of any non-canonical form is
    defined in terms of a canonical form to which it corresponds would
    still be far simpler than the current approach.

      [36] http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax
      [37] http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar

4.5. On the relation between RDF and off-the-shelf XML tools (policy,
substantive)

    With some diffidence, we conclude by raising what may be a sensitive
    issue.
    It does not seem to us that the XML serialization of RDF shows RDF to
    advantage. At the level of the underlying graph model, RDF information
    has a simple and regular structure, which appears in the XML
    serialization to be anything but simple and so irregular as to bring
    the words "capricious" and "arbitrary" to the lips of unprejudiced
    observers. Tastes in markup style differ, but we believe that the root
    of the problem is the high degree of variability with which the same
    underlying graph structures may be serialized, according to the rules
    given in this document.
    Owing in part to the variability itself, and in part to the specific
    forms taken by that variability, it is not feasible to write an XML
    Schema schema, or (if the comments in Appendix A.1 are accurate) a
    Relax NG schema, or an XML 1.0 DTD, which defines the set of correct
    serializations of correct RDF graphs. It is not convenient to run XSLT
    processes over arbitrary RDF serializations, nor to query or process
    arbitrary RD data using XQuery. Arbitrary RDF data is similarly
    inconvenient for other standard XML tools to process.
    There is, as a result, something of a cleft between the RDF community
    and the set of RDF tools on the one hand, and the community of users
    and tools employing what some have called colloquial XML. The parallel
    development of query languages, schema languages, object models, APIs,
    editors, display tools, and so on does offer relatively harmless ways
    for a large number of people to employ their time, but it does not
    seem to us to serve the larger Web community well.
    The cleft between RDF and colloquial XML does not seem to us to be
    required by the RDF data model. A graph in which nodes have certain
    properties and arcs have certain properties is not, in itself, a
    peculiarly difficult structure to render in XML or to process with
    off-the-shelf XML tools. An XML vocabulary in which nodes may appear
    as elements, or as attributes, or as attribute values, or as the
    PCDATA content of elements, and in which property names may appear as
    three of the same four constructs, on the other hand, seems a rather
    less straightforward XML representation of the underlying graph
    structure than most XML vocabularies for graphs have chosen.
    The result is that not just arbitrary RDF data, but data encoded using
    vocabularies defined in RDF terms (for which current W3C work provides
    a number of examples), will be hard to process using off-the-shelf
    tools. We believe this difficulty represents a lost opportunity, and
    we believe the opportunity could readily be seized if the XML
    serialization were modified to capture more of the regularity of the
    RDF data model.
    We are ready to work together with the Working Groups in the Semantic
    Web Activity and with other interested parties to formulate an XML
    serialization which captures the information in the RDF model and
    which is more readily amenable to processing with off-the-shelf XML
    tools.

Received on Monday, 10 March 2003 16:35:17 UTC