- From: by way of <jjc@hplb.hpl.hp.com>
- Date: Wed, 26 Sep 2001 09:40:27 -0400
- To: w3c-rdfcore-wg@w3.org
[I have no idea why this got caught in the spam filter -rrs] Date: Wed, 26 Sep 2001 08:34:00 -0400 (EDT) Message-ID: <3BB1CC22.B2AB2455@hplb.hpl.hp.com> From: Jeremy Carroll <jjc@hplb.hpl.hp.com> To: w3c-rdfcore-wg@w3.org I attach a new draft of the literals text. I am hoping that this text may be agreed upon on Friday. If there are more comments before then I will have another attempt at consolidation. Notes: Paragraph numbering is unchanged, so that some new paragraphs have numbers like [1a]. Accepted changes from Graham and myself. Para [1a], [9] made normalization of Literals and RDF/XML mandatory. Para [8] clarified context of phone number. (cf. Aaron's msg). Para [13] merged into [12] (in brackets). Para [14] added ref. Para [43] and [44] changed significantly, as in (this allows greater freedom in which substitutions can be made). Para [44a] added as clarification of [43] and [44]. An extra comment from me .... I have agreed with Graham's suggestion of MUST-ing W3C normalization of RDF/XML documents, but this leaves us in the strange position of having to specify the behaviour of RDF/XML processors when they meet illegal input as well as legal input. In general, I think we should only specify the semantics of legal input. The obligation to specify behaviour on illegal input arises from CHARMOD's concept of early uniform normalization, and charmod's requirement on spec writers to mandate early uniform normalization. Here the responsibility to normalize is explicitly at the beginning of a process and not elsewhere. Thus while an RDF processor can reject a non-normalized literal, it shouldn't normalize it. The scenario is that: an illegal document that has not been normalized, is processed both by a lazy system that doesn't notice that it hasn't been normalized, and by a better system. The second system must either get an error or the same result as the lazy system. This is partly motivated by a belief that it is not possible to require all software to do internationalization properly. Jeremy [1] An RDF Literal is a Unicode string, optionally paired with a language tag (as defined in RFC3066). [1a] The Unicode String in an RDF Literal is normalized according to Unicode Normalization Form C [NFC, NFC-Corrigendum], using a framework of early uniform normalization. [2] Future versions of RDF may migrate to a more general mechanism for literal representation in which the current representation would be embedded. One candidate is that an RDF literal would be a pair of a unicode string and a URI reference. The current literals would be embedded within this new representation using a well-known URI as a base for all language tag URIs. [3] NOTE: The RDF Core Working Group has yet to consider whether such an approach would be useful for integrating XML schema datatyping with RDF. [4] When comparing two RDF Literals, their Unicode strings MUST be equal for the RDF Literals to compare as equal. If both Literals have language tags, these tags MUST be equal for the Literals to be considered equal. If two Literals are found with equal Unicode strings but only one has a language tag, the Literals SHOULD NOT be considered equal. [5] NOTE: The purpose of 'SHOULD NOT' is to allow applications some flexibility in dealing with language tags. That is, when a literal is equal to another but only one has a language tag, they can be considered equivalent, which might be sufficient for some applications to make a match. [6] The truth table for equality is as follows. Pairs (s,t) are the unicode string, and the RFC3066 tag. '_' means no tag is given. 'f*' means 'SHOULD NOT' be true. 'f' means 'MUST' be false. 't' means 'MUST' be true. s1!=s and t1!=t according to the specifications in question. [7] (s,_) (s,t) (s1,_) (s1,t1) -------------------------------------- (s,_) t f* f f (s,t) f* t f f (s1,_) f f t f* (s1,t1) f f f* t (s,t1) (s1,t) --------------- (s,t1) f* f f f t f (s1,t) f f f* f f t [8] RDF makes a distinction between equality and equivalence for Literals. RDF Literals are equal in accordance with [2]. Equivalence represents a broader notion of how Literals might be considered to match each other or be interchangeable in some way. Applications MAY determine that RDF Literals are equivalent while not being equal. For example the literals '+353(0)12968607' and '0035312968607' are not equal, but they might reasonably be interpreted as equivalent when the context is that of an Irish telephone number being considered elsewhere in Europe. Inferring such equivalences typically requires extra metadata or assumptions about the literals in question (such as might be available about the predicate whom the literal is the value of, or being aware that the literal is of a well known type). RDF processors SHOULD deal with equality and normalisation of Literals only, and SHOULD NOT be expected to make or find such equivalences. Future work in RDF may define ways in which extra information about RDF Literals can be modelled in the light of implementation experience. [9] RDF/XML documents MUST be W3C-normalized. An early uniform normalization framework is used. See [CHARMOD] for definition. [10] When parsing RDF/XML the XML processor, if necessary, converts the XML document to the UCS character domain. When doing this from any encoding that is not UCS-based this conversion SHOULD use Unicode Normalization Form C [NFC, NFC-Corrigendum]. [11] RDF/XML processors MUST NOT normalize input from an XML document that is encoded in a UCS-based encoding. c.f. [CHARMOD] for rationale. [12] RDF/XML documents SHOULD be W3C-normalized as specified in [CHARMOD]. Moreover, after the stripping of comments and processing instructions an RDF/XML document SHOULD still be W3C-normalized. It is the responsibility of the document creator to fulfil this requirement. RDF/XML processors MUST NOT correct input that is not W3C-normalized. (RDF/XML processors may detect lack of W3C-normalization in an input document, and issue a diagnostic). [14] Summary of text normalization for RDF/XML processors. RDF/XML processors MUST use a normalizing transcoder from non-UCS-based encodings. RDF/XML processors MUST NOT do any other text normalization. (cf. http://www.w3.org/TR/charmod/#def-normalizing-transcoder ) [15] Unicode string equality within Literals is given by binary comparison as given by http://www.w3.org/TR/charmod/#sec-IdentityMatching . [16] Language tag equality is defined by RFC3066 and is case insensitive. [17] RDF Literals arising from general attribute values (using the production [6.10] propAttr ::= propName '="' string '"', i.e. Production 4.13 propAttr in [Refactoring] ): [18] + have their language component defined by the value of xml:lang (if any) in the enclosing element, as specified in [XML]. [19] + the Unicode string is given by the attribute value after XML attribute value normalization. [20] NOTE: document authors are warned that XML attribute value normalization differs slightly if an attribute is declared in a DTD with an attribute type other than CDATA. In such cases, validating and non-validating XML parsers will normalize some values differently, and hence RDF/XML processors will produce different RDF Literals. [21] RDF Literals arising from the propertyElt production with a non-empty string value as element content, and no rdf:parseType attribute (using the first production of 6.12): [22] + have their language component defined by the value of xml:lang (if any) in the property element, as specified in [XML]. [23] + the RDF Literal Unicode string is formed as the concatenation of the element content subject to the usual XML processing rules: [24] - character references are expanded. [25] - entity references are illegal, it MAY be possible that a different RDF production is matched in this case. [26] - CDATA sections are expanded. [27] - XML comments are discarded. [28] - processing instructions are discarded. [29] NOTE: When converting the document from any encoding that is not UCS-based, Unicode Normalization Form C is produced before concatenation of the various parts of the literal. Hence, it is possible (if unwise) to use various XML escaping mechanism to produce non-normalized RDF Literals. Such a document is not W3C-normalized. [30] RDF Literals arising from the propertyElt production with an empty string value as element content, and no rdf:parseType attribute (using the first production of 6.12): [31] + have their language component defined by the value of xml:lang (if any) in the property element, as specified in [XML]. [32] + the RDF Literal Unicode string is the string of length 0. [33] RDF Literals arising from the propertyElt production with rdf:parseType="Literal" attribute (using the [n]th production of 6.12): [34] + have their language component defined by the value of xml:lang (if any) in the property element, as specified in [XML]. [35] + MAY have their Unicode string component as the string of the element content as given in the input document, after converting the character encoding. [36] + MAY have their Unicode string component as given by the Unicode string of the XML Canonicalization of the document subset consisting of the element content. See XML Canonicalization section 2.4. http://www.w3.org/TR/2001/REC-xml-c14n-20010315#DocSubsets The XML Canonicalization specifies a UTF-8 string, the RDF Literal is the encoded Unicode string. Such a canonicalization MAY or MAY NOT include comments. [37] + MAY have their Unicode string component as the as the string of the element content as given in the input document, after converting the character encoding, and then processed in any of the following ways, in any order: [38] - all comments MAY be stripped. [39] - all processing instructions MAY be stripped. [40] - all character references MAY be expanded. [41] - all entity references MAY be expanded. [42] - all CDATA sections MAY be replaced by their content. [43] - all expanded attribute values can be further processed by replacing any character with an appropriate numeric characeter reference or an XML predefined entity reference (i.e. <, >, &, ' or "). All identical characters MUST be processed identically. If such processing applies, similar processing MUST be applied to text nodes. [44] - all expanded text nodes can be further processed by replacing any character with an appropriate numeric characeter reference or an XML predefined entity reference. All identical characters MUST be processed identically. If such processing applies, similar processing MUST be applied to attribute values. [44a] NOTE: "similar" processing does not imply that identical substitutions must apply to both text nodes and attribute values. [45] - any additional XML namespace declarations which is in scope in the surrounding property element MAY be added to any start element tag in the XML literal. [46] - any XML namespace declaration MAY be deleted from any start element tag. [47] - any change that does not change the document information set MAY be made. A non-exhaustive list can be found in Infoset Appendix D. [48] NOTE: The meaning of 'all' in the above paragraphs is that an RDF processing environment that makes such a change in one instance in one literal MUST make the corresponding change in every instance in every literal. [49] NOTE: namespace prefix rewriting is prohibited. Paragraphs [45] and [46] permit changing the namespace declaration but not the changing of the prefixes in the QNames. This is to allow rdf:parseType="Literal" values to include namespace prefixes in attribute values and elsewhere. [50] For maximum interoperability RDF processors are RECOMMENDED to use XML canonicalization without comments as the string in the RDF Literal formed by the rdf:parseType="Literal" property element production. [51] NOTE: The Working Group will review this recommendation in light of implementation experience.
Received on Wednesday, 26 September 2001 09:40:45 UTC