- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Mon, 10 Mar 2003 14:33:35 -0700
- To: www-rdf-comments@w3.org
- Cc: W3C XML Schema IG <w3c-xml-schema-ig@w3.org>
Colleagues: With apologies for the delay, I transmit to you herewith the comments of the XML Schema Working Group on the various RDF documents published in Last Call recently. We congratulate you on the progress of your work and hope our comments are useful to you. An HTML version of our comments may be found at http://www.w3.org/XML/Group/2003/03/xml-schema-rdf-notes.html and I append an ASCII-only version for the convenience of those who find it more convenient. -C. M. Sperberg-McQueen Co-chair, W3C XML Schema Working Group ........................................................... W3C XML Schema Working Group Comments on RDF documents ed. Charles Campbell, C. M. Sperberg-McQueen, Henry S. Thompson 10 March 2003 _________________________________________________________________ * 1. [1]Notes on RDF Primer + 1.1. [2]Design question, complexity (substantive) + 1.2. [3]Whitespace handling (schema-related) * 2. [4]Notes on RDF Concepts and Abstract Syntax + 2.1. [5]Mapping from lexical forms to values (schema-related, terminological) + 2.2. [6]Values without lexical forms (schema-related, important) + 2.3. [7]Lexical forms, strings, and character sequences (schema-related, editorial) + 2.4. [8]Strings for natural-language data (substantive) + 2.5. [9]Typos and minor editorial notes * 3. [10]Notes on RDF Semantics + 3.1. [11]The "meaning" of literals (editorial) + 3.2. [12]Types as lexical mappings (schema-related) + 3.3. [13]Miscellaneous editorial notes * 4. [14]Notes on RDF/XML Syntax Specification (Revised) + 4.1. [15]Manifest typing in the instance (policy) + 4.2. [16]QNames (Editorial, but important) + 4.3. [17]Miscellaneous editorial notes + 4.4. [18]Normative specification of XML grammar (policy, substantive) + 4.5. [19]On the relation between RDF and off-the-shelf XML tools (policy, substantive) _________________________________________________________________ NOTE: [These notes have been considered and approved by the W3C XML Schema Working Group, and are transmitted to the RDF Core Working Group as comments on the last-call drafts of various RDF-related documents.] $Id: xml-schema-rdf-notes.html,v 1.11 2003/03/10 21:31:34 cmsmcq Exp $ The XML Schema Working Group congratulates the RDF Core Working Group on progressing its several documents to Last Call; we apologize for the late submission of these comments, and hope that they prove helpful. Our comments include some which bear directly on the use of XML Schema's simple types by RDF, to which we believe you wished us to give particular attention. In the text which follows, these are labeled "schema-related". Some other comments, in contrast, relate to important and difficult technical and policy questions relating to language design and tool usage; these are labeled "policy". We hope that you will give these comments very serious consideration, but we do not pretend to any special standing in raising them, other than as representative members of the XML community at large. Finally, there are some other questions which are not directly related to XML Schema or to XML in general, and for which we therefore pretend to no particular expertise or standing, but which we happened to notice and which we call to your attention, as any technically minded reader might do, in the hopes that doing so may be useful to you; these are labeled "substantive" or "editorial" as the case might be. 1. Notes on RDF Primer RDF Primer, section 2.4 Typed literals [20]http://www.w3.org/TR/rdf-primer/#typedliterals [20] http://www.w3.org/TR/rdf-primer/#typedliterals 1.1. Design question, complexity (substantive) The introduction of pairs consisting of a lexical form and a type (or, strictly speaking, a lexical form and a type label) seems at first glance to complicate the RDF model somewhat. We have had the impression that in other parts of RDF, typing is handled by adding further arcs and nodes. If the type of a resource is identified by having an arc labeled rdf:type from it to (the URI of) its (RDF) type, and if the type of an arc is similarly identified by an arc, then surely a reason ought to be given for shifting to a different method for typing literal strings. It seems like a dramatic shift in the infrastructure of RDF, from "everything is a node, an arc, or a literal value" to "everything is a node, an arc, or a typed literal value". Perhaps not quite so dramatic, after all. But the question of design consistency remains: why not "everything is a typed node, a typed arc, or a typed literal"? 1.2. Whitespace handling (schema-related) Some members of the XML Schema WG have expressed concern that XML Schema's rules for whitespace handling may interfere with expected behavior in other contexts. This may be the appropriate place to bring this question up. In brief, XML Schema's simple types each define a whitespace facet, which governs the kind of whitespace pre-processing done by an XML Schema processor before the lexical form is checked for type validity. Since the point of whitespace normalization is to simplify subsequent processing, the lexical spaces of XML Schema's simple types are (like those in many programming languages) defined without reference to the preceding whitespace normalization. Integers, for example, are represented by sequences of decimal digits; sequences containing blanks are not legal lexical forms for integers. Indeed, strictly speaking it is only after the whitespace pre-processing is done that the XML Schema processor can be said to be working with a lexical form at all. For example, the integer type has a value of collapse for the whitespace facet, which means leading and trailing whitespace is stripped, and internal whitespace sequences are reduced to a single blank (x20) character. In an XML document in which the element exterms:age is defined as having type xs:integer, the following instances of exterms:age will all be type-valid: <exterms:age>27</exterms:age> <exterms:age> 27 </exterms:age> <exterms:age> 27 </exterms:age> <exterms:age> 2<!--* ha, ha, fooled your full-text indexer! *-->7 </exterms:age> The input information set, in each case, contains a character information item for "2" followed by a character information item for "7", with character information items for whitespace characters, and a comment information item, present in some of the examples. In all cases, the lexical form proper is the character sequence "27" (i.e. the sequence of characters after white space handling, and ignoring comments, processing instructions, entity boundaries, and other distractions). This is a legal lexical form for an integer, so all the examples are type valid. Some members of the XML Schema WG have worried that it may not be obvious that the whitespace processing is not part of the process of checking lexical forms for type validity, but part of the process of extracting the lexical forms from the XML information set presented to the processor. If an RDF document contains <exterms:age> 27 </exterms:age> and a processor hands the contents of the element to a generic type-checker for XML Schema's simple types, saying in effect "this purports to be the lexical form of an integer; is that OK?", that type checker will be required (if it conforms to the XML Schema spec's definition of the simple types) to say "no, the character sequence ` 27 ' is not a legal lexical form for an integer." It's not clear whether RDF, being type-system neutral, can directly address this concern (e.g. by specifying that an RDF processor should do the appropriate whitespace pre-processing, or by warning users that they should not include vagrant whitespace in typed literals), or whether it suffices for developers of RDF software with built-in support for XML Schema's simple types to deal with it, e.g. by performing it themselves before handing the resulting lexical form to a type checker. As noted, some members of our WG feel that you need to be alerted to this as a possible source of confusion and unexpected results. Other members of the WG feel that it verges on disrespect to assume that you need instruction on this point. We compromised by agreeing to point out the issue to you, and to leave you to draw your own conclusions. 2. Notes on RDF Concepts and Abstract Syntax 2.1. Mapping from lexical forms to values (schema-related, terminological) In [21]http://www.w3.org/TR/rdf-concepts/#section-Datatypes: [21] http://www.w3.org/TR/rdf-concepts/#section-Datatypes A datatype mapping is a set of pairs whose first element belongs to the lexical space of the datatype, and the second element belongs to the value space of the datatype: We agree that it is useful to define a term to denote such mappings; in the interests of inter-specification consistency, we wonder whether you would be willing to consider using the term lexical mapping, which we are introducing in our forthcoming draft of XML Schema 1.1. The term datatype mapping seems unlikely to be usable in the XML Schema specification, where it would suggest to some readers a mapping from one datatype to another, rather than as here a mapping from lexical space to value space. (XML Schema 1.0 got by without a term for this concept.) 2.2. Values without lexical forms (schema-related, important) In [22]http://www.w3.org/TR/rdf-concepts/#section-Datatypes: [22] http://www.w3.org/TR/rdf-concepts/#section-Datatypes * Each member of the value space may be paired with any number (including zero) of members of the lexical space (lexical representations for that value). The provision for values without corresponding lexical forms contradicts an assumption to which the XML Schema spec appeals from time to time. The lexical space of any simple datatype in XML Schema is the domain of the type's lexical mapping; the value space is its domain. There are no meaningless lexical forms in the lexical space of the type, nor are there ineffable values in the value space. By eliminating values from the value space (e.g. by setting minimal and maximal values), the type definer may indirectly also eliminate lexical forms from the lexical space; conversely, by eliminating some items from the lexical space (e.g. by setting a pattern), the type definer may eliminate items from the value space. Are there crucial aspects of RDF which will break if the list item quoted above is changed to read "paired with one or more members of the lexical space"? 2.3. Lexical forms, strings, and character sequences (schema-related, editorial) In [23]http://www.w3.org/TR/rdf-concepts/#section-Datatypes: [23] http://www.w3.org/TR/rdf-concepts/#section-Datatypes With one exception, the datatypes used in RDF have a lexical space consisting of a set of strings. Since "string" is used as the local name for a particular simple type in the XML Schema namespace, we believe it will be less confusing for users, in the long run, if the lexical representations of simple-datatype values are described not as "strings" but as "character sequences". This comment also applies to other uses of the term string to denote the members of a lexical space. 2.4. Strings for natural-language data (substantive) In [24]http://www.w3.org/TR/rdf-concepts/#section-Datatypes: [24] http://www.w3.org/TR/rdf-concepts/#section-Datatypes * A plain literal is a string combined with an optional language identifier. This should be used for plain text in a natural language. As recommended in the RDF formal semantics [RDF-SEMANTICS], these plain literals are self-denoting. We do not believe that simple strings are likely to be adequate for the representation of arbitrary natural-language text. Even in English, natural-language utterances (such as this document) may need some degree of inline markup for clarity and adequate presentation; in natural-language utterances requiring bidirectional display or ruby, the best authorities (including the W3C I18n Working Group) recommend the use of markup within the natural-language utterance. We thus suggest that you may wish to moderate this recommendation that natural-language material be represented by literals. This is not an area in which we claim particular technical expertise; we merely call it to your attention in the hopes that doing so may be useful to you. 2.5. Typos and minor editorial notes In [25]http://www.w3.org/TR/rdf-concepts/#section-Literal-Value, for "the datatype mapping is applied to the pair form by the lexical form and the language identifier" read "the datatype mapping is applied to the pair formed by the lexical form and the language identifier". In the same section, for "Such a case, while in error, is not syntacticly ill-formed " read "Such a case, while in error, is not syntactically ill-formed" (et passim). In section [26]http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral, for "root element tag" read "root element". In the same section, for "XML element content" read "XML data" (the term element content is used in some markup-related specs as a complement of mixed content to denote the content of elements which can contain other elements but cannot contain parsed character data). [25] http://www.w3.org/TR/rdf-concepts/#section-Literal-Value [26] http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral 3. Notes on RDF Semantics 3.1. The "meaning" of literals (editorial) The meaning of a literal is principally determined by its character string: it either refers to the value mapped from the string by the associated datatype, or if no datatype is provided then it refers to the literal itself, which is either a unicode character string or a pair of a string with a language tag. Some members of the XML Schema WG are made nervous by the appeal to the notion of "meaning" here. [N.B. our task force read this section out of context, and were not aware of any foregoing elucidation. So this comment may be out of place.] There is also some concern about the apparent conflation here of the notions of meaning and reference. We wonder whether this discussion would be weakened by replacing references to meaning and reference by references to denotation; we are inclined to think it would be an improvement, but recognize that the RDF Core WG's views may differ. 3.2. Types as lexical mappings (schema-related) A datatype is an entity characterized by a set of character strings called lexical forms and a mapping from that set to a set of values. We have a couple of reservations concerning this characterization. * Elsewhere (e.g. in Concepts and Abstract Syntax, section 3.3, [27]http://www.w3.org/TR/rdf-concepts/#section-Datatypes), the RDF specs say that there may be values in a value space which are not in the range of the lexical mapping; we have suggested that if possible those statements should be changed, but if they are retained, then a datatype cannot be characterized solely by the lexical space and the lexical mapping, because such ineffable values appear in neither of these. * The statement describes (with the exception of the problem just noted) simple datatypes, but not the class of complex datatypes which can be defined by XML Schema, nor all the types (or type-like constructs) definable in various other schema languages for XML. [27] http://www.w3.org/TR/rdf-concepts/#section-Datatypes 3.3. Miscellaneous editorial notes In [28]http://www.w3.org/TR/rdf-mt/#dtype_interp, for "which we will refer to as XSD and use the Qname prefix xsd:" read "which we will refer to as XSD and denote using the Qname prefix xsd" (or something similar). In [29]http://www.w3.org/TR/rdf-mt/#dtype_interp: [28] http://www.w3.org/TR/rdf-mt/#dtype_interp [29] http://www.w3.org/TR/rdf-mt/#dtype_interp For example, XML Schema requires that the value spaces of xsd:string and xsd:decimal to be disjoint ... This sentence is not exactly wrong, but it seems slightly unusual to use the verb require here, instead of define or something similar. We suggest recasting this as "For example, XML Schema defines the value spaces of xsd:string and xsd:decimal as disjoint ..." (Note, for the record, that the value spaces of all the primitive simple datatypes of XML Schema 1.0 are pairwise disjoint.) In , any literal of the form "sss"@ttt^^ddd, where ddd is not rdf:XMLLiteral, treated as identical to the same literal without the language tag, "sss"@ddd is "sss"@ddd a typo for "sss"^^ddd? In [30]http://www.w3.org/TR/rdf-mt/#dtype_entail, for "it is valid to add any number of leading zeros to any numeral and still be a correct lexical form for xsd:integer", perhaps read "it is possible to add any number of leading zeros to any lexical form for xs:integer without it ceasing to be a correct lexical form for xsd:integer" [30] http://www.w3.org/TR/rdf-mt/#dtype_entail 4. Notes on RDF/XML Syntax Specification (Revised) RDF/XML Syntax, [31]http://www.w3.org/TR/rdf-syntax-grammar/ [31] http://www.w3.org/TR/rdf-syntax-grammar/ 4.1. Manifest typing in the instance (policy) RDF allows Typed Literals to be given as the object node of arcs. These consist of a literal string (with optional language) and a datatype RDF URI Reference. This is handled ... with an additional rdf:datatype="datatypeURI" attribute on the property element. We believe there are probably good reasons for using an rdf:datatype attribute, instead of re-using the existing xsi:type attribute which has (when the type is defined in a schema defined by XML Schema 1.0) the same semantics. In particular, rdf:datatype does not assume or assert the existence of the type named as a type in a schema defined by XML Schema, so it would be problematic to use xsi:type. We do fear, however, that users are likely to find this near-duplication of the meaning and function of xsi:type confusing. It is not clear to us what, if anything, can or should be done to minimize this danger. 4.2. QNames (Editorial, but important) We were unable, on a first reading, to determine whether the default namespace declaration, and thus unprefixed names, were or were not allowed in order to encode 'RDF URI References'. Indeed the introductory prose about QNames (2nd para of [32]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro]) does not seem to connect up with the relevant (?) production in [33]http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar] , which we take to be [34]http://www.w3.org/TR/rdf-syntax-grammar/#URI-reference]. This can and should be cleared up. [32] http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-intro [33] http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar [34] http://www.w3.org/TR/rdf-syntax-grammar/#URI-reference 4.3. Miscellaneous editorial notes In [35]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-empty-prop erty-elements, the sentence [35] http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax-empty-property-elements When an arc in an RDF Graph points to an object node which has no further arcs, which appears in RDF/XML as an empty node element sequence such as the pair <rdf:Description rdf:about="..."> </rdf:Description>, this form can be shortened. seems less clear than it might be. Different readers prove to have different views on what is meant by "the pair <rdf:Description rdf:about="..."> </rdf:Description>"; perhaps it can be replaced by something like "the empty element <rdf:Description rdf:about="..."/>" without loss of precision? Perhaps the sentence could read When an arc in an RDF Graph points to an object node which has no further arcs, which appears in RDF/XML as an empty node element such as <rdf:Description rdf:about="..."/>, this form can be shortened. 4.4. Normative specification of XML grammar (policy, substantive) We note with admiration the excellent tutorial introduction to the striped syntax in Section 2 [36]http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax]. We are less happy with the nature of the syntax, and with the approach taken to its normative statement [37]http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar] . As regards the syntax itself, we would much prefer to have seen a move to a single canonical syntax with much less variablity. With respect, the current design suggests that the value of XML has been misunderstood. The range of alternative forms of expression provided for in the current design make it very difficult to use the broad range of generic XML tools (e.g. syntax-directed editors, XSLT) which could give so much benefit to RDF users. (More on this below.) At the very least we would encourage you to specify a single canonical form, probably strictly striped, which could be defined by an XML Schema or DTD. We would be happy to work with you to develop a schema for such a subset. As regards the approach taken to defining the syntax, in our view, layering of specs has very high value, and so defining an XML document type by way of what is very nearly a character-level BNF is at best a missed opportunity and at worst a serious mistake. It obscures the important aspects of the document type behind a welter of irrelevant detail about e.g. whitespace and start-tag/end-tag matching. It makes it very difficult for the reader to actually understand what is and isn't actually allowed -- what an RDF/XML document actually looks like. Not only does this confuse levels and thus readers, it also runs the risk of inadvertently defining an XML subset. It also appears, on a strict reading, to rule out XML documents not derived from the parsing of character streams as possible RDF/XML (so that it would be illegitimate to regard a data structure created using a DOM interface, for example, as RDF/XML). The use of event-triggered data-model construction actions to specify the relationship between XML representation and corresponding data objects is innovative and compelling, but surely it would be straight-forward to associate these events with a pre-order traversal of an infoset independently constrained by a DTD, XML Schema schema or other appropriate definition of the canonical document type. If continued support for alternative forms is considered essential, then a two-step approach where the semantics of any non-canonical form is defined in terms of a canonical form to which it corresponds would still be far simpler than the current approach. [36] http://www.w3.org/TR/rdf-syntax-grammar/#section-Syntax [37] http://www.w3.org/TR/rdf-syntax-grammar/#section-Infoset-Grammar 4.5. On the relation between RDF and off-the-shelf XML tools (policy, substantive) With some diffidence, we conclude by raising what may be a sensitive issue. It does not seem to us that the XML serialization of RDF shows RDF to advantage. At the level of the underlying graph model, RDF information has a simple and regular structure, which appears in the XML serialization to be anything but simple and so irregular as to bring the words "capricious" and "arbitrary" to the lips of unprejudiced observers. Tastes in markup style differ, but we believe that the root of the problem is the high degree of variability with which the same underlying graph structures may be serialized, according to the rules given in this document. Owing in part to the variability itself, and in part to the specific forms taken by that variability, it is not feasible to write an XML Schema schema, or (if the comments in Appendix A.1 are accurate) a Relax NG schema, or an XML 1.0 DTD, which defines the set of correct serializations of correct RDF graphs. It is not convenient to run XSLT processes over arbitrary RDF serializations, nor to query or process arbitrary RD data using XQuery. Arbitrary RDF data is similarly inconvenient for other standard XML tools to process. There is, as a result, something of a cleft between the RDF community and the set of RDF tools on the one hand, and the community of users and tools employing what some have called colloquial XML. The parallel development of query languages, schema languages, object models, APIs, editors, display tools, and so on does offer relatively harmless ways for a large number of people to employ their time, but it does not seem to us to serve the larger Web community well. The cleft between RDF and colloquial XML does not seem to us to be required by the RDF data model. A graph in which nodes have certain properties and arcs have certain properties is not, in itself, a peculiarly difficult structure to render in XML or to process with off-the-shelf XML tools. An XML vocabulary in which nodes may appear as elements, or as attributes, or as attribute values, or as the PCDATA content of elements, and in which property names may appear as three of the same four constructs, on the other hand, seems a rather less straightforward XML representation of the underlying graph structure than most XML vocabularies for graphs have chosen. The result is that not just arbitrary RDF data, but data encoded using vocabularies defined in RDF terms (for which current W3C work provides a number of examples), will be hard to process using off-the-shelf tools. We believe this difficulty represents a lost opportunity, and we believe the opportunity could readily be seized if the XML serialization were modified to capture more of the regularity of the RDF data model. We are ready to work together with the Working Groups in the Semantic Web Activity and with other interested parties to formulate an XML serialization which captures the information in the RDF model and which is more readily amenable to processing with off-the-shelf XML tools.
Received on Monday, 10 March 2003 16:35:17 UTC