[1] An RDF Literal is a Unicode string, optionally paired with a language tag (as defined in RFC3066). [2] Future versions of RDF may migrate to a more general mechanism for literal representation in which the current representation would be embedded. One candidate is that an RDF literal would be a pair of a unicode string and a URI reference. The current literals would be embedded within this new representation using a well-known URI as a base for all language tag URIs. [3] NOTE: The RDF Core Working Group has yet to consider whether such an approach would be useful for integrating XML schema datatyping with RDF. [4] When comparing two RDF Literals, their Unicode strings MUST be equal for the RDF Literals to compare as equal. If both Literals have language tags, these tags MUST be equal for the Literals to be considered equal. If two Literals are found with equal Unicode strings but only one has a language tag, the Literals SHOULD NOT be considered equal. [5] NOTE: The purpose of 'SHOULD NOT' is to allow applications some flexibility in dealing with language tags. That is, when a literal is equal to another but only one has a language tag, they can be considered equivalent, which might be sufficient for some applications to make a match. [6] The truth table for equality is as follows. Pairs (s,t) are the unicode string, and the RFC3066 tag. '_' means no tag is given. 'f*' means 'SHOULD NOT' be true. 'f' means 'MUST' be false. 't' means 'MUST' be true. s1!=s and t1!=t according to the specifications in question. [7] (s,_) (s,t) (s1,_) (s1,t1) -------------------------------------- (s,_) t f* f f (s,t) f* t f f (s1,_) f f t f* (s1,t1) f f f* t (s,t1) (s1,t) --------------- (s,t1) f* f f f t f (s1,t) f f f* f f t [8] RDF makes a distinction between equality and equivalence for Literals. RDF Literals are equal in accordance with [2]. Equivalence represents a broader notion of how Literals might be considered to match each other or be interchangeable in some way. Applications MAY determine that RDF Literals are equivalent while not being equal. For example the literals '+353(0)12968607' and '0035312968607' are not equal, but they might reasonably be interpreted as equivalent in some context. Inferring such equivalences typically requires extra metadata or assumptions about the literals in question (such as might be available about the predicate whom the literal is the value of, or being aware that the literal is of a well known type). RDF processors SHOULD deal with equality and normalisation of Literals only, and SHOULD NOT be expected to make or find such equivalences. Future work in RDF may define ways in which extra information about RDF Literals can be modelled in the light of implementation experience. ASIDE to rdfcore-wg: Does the example work for US & Asian readers? I note that "some context" is approx. on a European telephone. (I'm guessing that + does not expand as 00 in Asia!) [9] RDF assumes that the Unicode strings in Literals are normalized according to Unicode Normalization Form C [NFC, NFC-Corrigendum]. An early uniform normalization framework is used. [10] When parsing RDF/XML the XML processor, if necessary, converts the XML document to the UCS character domain. When doing this from any encoding that is not UCS-based this conversion SHOULD use Unicode Normalization Form C [NFC, NFC-Corrigendum]. [11] RDF/XML processors MUST NOT normalize input an XML document that is encoded in a UCS-based encoding. c.f. [CHARMOD] for rationale. [12] RDF/XML documents SHOULD be W3C-normalized as specified in [CHARMOD]. Moreover, after the stripping of comments and processing instructions an RDF/XML document SHOULD still be W3C-normalized. It is the responsibility of the document creator to fulfil this requirement. RDF/XML processors MUST NOT correct input that is not W3C-normalized. [13] RDF/XML processors MAY detect lack of W3C-normalization in an input document, and issue a diagnostic. [14] Summary of text normalization for RDF/XML processors. RDF/XML processors MUST use a normalizing transcoder from non-UCS-based encodings. RDF/XML processors MUST NOT do any other text normalization. [15] Unicode string equality within Literals is given by binary equality. (cf. http://www.w3.org/TR/charmod/#sec-IdentityMatching ) [16] Language tag equality is defined by RFC3066 and is case insensitive. [17] RDF Literals arising from general attribute values (using the production [6.10] propAttr ::= propName '="' string '"', i.e. Production 4.13 propAttr in [Refactoring] ): [18] + have their language component defined by the value of xml:lang (if any) in the enclosing element, as specified in [XML]. [19] + the Unicode string is given by the attribute value after XML attribute value normalization. [20] NOTE: document authors are warned that XML attribute value normalization differs slightly if an attribute is declared in a DTD with an attribute type other than CDATA. In such cases, validating and non-validating XML parsers will normalize some values differently, and hence RDF/XML processors will produce different RDF Literals. [21] RDF Literals arising from the propertyElt production with a non-empty string value as element content, and no rdf:parseType attribute (using the first production of 6.12): [22] + have their language component defined by the value of xml:lang (if any) in the property element, as specified in [XML]. [23] + the RDF Literal Unicode string is formed as the concatenation of the element content subject to the usual XML processing rules: [24] - character references are expanded. [25] - entity references are illegal, it MAY be possible that a different RDF production is matched in this case. [26] - CDATA sections are expanded. [27] - XML comments are discarded. [28] - processing instructions are discarded. [29] NOTE: When converting the document from any encoding that is not UCS-based, Unicode Normalization Form C is produced before concatenation of the various parts of the literal. Hence, it is possible (if unwise) to use various XML escaping mechanism to produce non-normalized RDF Literals. Such a document is not W3C-normalized. [30] RDF Literals arising from the propertyElt production with an empty string value as element content, and no rdf:parseType attribute (using the first production of 6.12): [31] + have their language component defined by the value of xml:lang (if any) in the property element, as specified in [XML]. [32] + the RDF Literal Unicode string is the string of length 0. [33] RDF Literals arising from the propertyElt production with rdf:parseType="Literal" attribute (using the [n]th production of 6.12): [34] + have their language component defined by the value of xml:lang (if any) in the property element, as specified in [XML]. [35] + MAY have their Unicode string component as the string of the element content as given in the input document, after converting the character encoding. [36] + MAY have their Unicode string component as given by the Unicode string of the XML Canonicalization of the document subset consisting of the element content. See XML Canonicalization section 2.4. http://www.w3.org/TR/2001/REC-xml-c14n-20010315#DocSubsets The XML Canonicalization specifies a UTF-8 string, the RDF Literal is the encoded Unicode string. Such a canonicalization MAY or MAY NOT include comments. [37] + MAY have their Unicode string component as the as the string of the element content as given in the input document, after converting the character encoding, and then processed in any of the following ways, in any order: [38] - all comments MAY be stripped. [39] - all processing instructions MAY be stripped. [40] - all character references MAY be expanded. [41] - all entity references MAY be expanded. [42] - all CDATA sections MAY be replaced by their content. [43] - all attribute values can be normalized as in XML canonicalization viz, replacing :- . all ampersands (&) with & . all open angle brackets (<) with < . all quotation mark characters with " . all whitespace characters #x9, #xA, and #xD, with character references. [44] - all text nodes can be normalized as in XML canonicalization viz., replacing :- . all ampersands are replaced by & . all open angle brackets (<) are replaced by < . all closing angle brackets (>) are replaced by > . all #xD characters are replaced by . [45] - any additional XML namespace declarations which is in scope in the surrounding property element MAY be added to any start element tag in the XML literal. [46] - any XML namespace declaration MAY be deleted from any start element tag. [47] - any change that does not change the document information set MAY be made. A non-exhaustive list can be found in Infoset Appendix D. [48] NOTE: The meaning of 'all' in the above paragraphs is that an RDF processing environment that makes such a change in one instance in one literal MUST make the corresponding change in every instance in every literal. [49] NOTE: namespace prefix rewriting is prohibited. Paragraphs [45] and [46] permit changing the namespace declaration but not the changing of the prefixes in the QNames. This is to allow rdf:parseType="Literal" values to include namespace prefixes in attribute values and elsewhere. [50] For maximum interoperability RDF processors are RECOMMENDED to use XML canonicalization without comments as the string in the RDF Literal formed by the rdf:parseType="Literal" property element production. [51] NOTE: The Working Group will review this recommendation in light of implementation experience.