Re: 2001-09-07#5 Literals

[I have no idea why this got caught in the spam filter -rrs]

Date: Wed, 26 Sep 2001 08:34:00 -0400 (EDT)
Message-ID: <3BB1CC22.B2AB2455@hplb.hpl.hp.com>
From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
To: w3c-rdfcore-wg@w3.org

I attach a new draft of the literals text.

I am hoping that this text may be agreed upon on Friday.

If there are more comments before then I will have another attempt at
consolidation.

Notes:

Paragraph numbering is unchanged, so that some new paragraphs have
numbers like [1a].

Accepted changes from Graham and myself.

   Para [1a], [9] made normalization of Literals and RDF/XML mandatory.

   Para [8] clarified context of phone number. (cf. Aaron's msg).

   Para [13] merged into [12] (in brackets).

   Para [14] added ref.

   Para [43] and [44] changed significantly, as in

   (this allows greater freedom in which substitutions can be made).

   Para [44a] added as clarification of [43] and [44].


An extra comment from me ....

I have agreed with Graham's suggestion of MUST-ing W3C normalization of
RDF/XML documents, but this leaves us in the strange position of having
to specify the behaviour of RDF/XML processors when they meet illegal
input as well as legal input.

In general, I think we should only specify the semantics of legal input.

The obligation to specify behaviour on illegal input arises from
CHARMOD's concept of early uniform normalization, and charmod's
requirement on spec writers to mandate early uniform normalization. Here
the responsibility to normalize is explicitly at the beginning of a
process and not elsewhere. Thus while an RDF processor can reject a
non-normalized literal, it shouldn't normalize it. The scenario is that:
an illegal document that has not been normalized, is processed both by a
lazy system that doesn't notice that it hasn't been normalized, and by a
better system. The second system must either get an error or the same
result as the lazy system. This is partly motivated by a belief that it
is not possible to require all software to do internationalization
properly.

Jeremy
[1]
An RDF Literal is a Unicode string, optionally paired with a
language tag (as defined in RFC3066).

[1a]
The Unicode String in an RDF Literal is normalized according
to Unicode Normalization Form C [NFC, NFC-Corrigendum], using
a framework of early uniform normalization.

[2]
Future versions of RDF may migrate to a more general mechanism for
literal representation in which the current representation would be
embedded. One candidate is that an RDF literal would be a pair 
of a unicode string and a URI reference. The current literals would 
be embedded within this new representation using a well-known URI 
as a base for all language tag URIs. 

[3]
   NOTE: The RDF Core Working Group has yet to consider whether
   such an approach would be useful for integrating XML schema
   datatyping with RDF.

[4]
When comparing two RDF Literals, their Unicode strings MUST be
equal for the RDF Literals to compare as equal. If both Literals
have language tags, these tags MUST be equal for the Literals to
be considered equal. If two Literals are found with equal Unicode
strings but only one has a language tag, the Literals SHOULD NOT 
be considered equal.

[5]
   NOTE: The purpose of 'SHOULD NOT' is to allow 
   applications some flexibility in dealing with 
   language tags. That is, when a literal is equal to
   another but only one has a language tag, they can be 
   considered equivalent, which might be sufficient 
   for some applications to make a match.

[6]
The truth table for equality is as follows.
Pairs (s,t) are the unicode string, and the RFC3066 tag.
'_' means no tag is given. 'f*' means 'SHOULD NOT'
be true. 'f' means 'MUST' be false. 't' means 'MUST' be true.
s1!=s and t1!=t according to the specifications in question.

[7]

        (s,_)  (s,t)  (s1,_)  (s1,t1)
--------------------------------------
(s,_)     t      f*      f       f
(s,t)     f*     t       f       f
(s1,_)    f      f       t       f*
(s1,t1)   f      f       f*      t   (s,t1)  (s1,t)
                                     ---------------
(s,t1)    f*     f       f       f      t      f
(s1,t)    f      f       f*      f      f      t


[8]
RDF makes a distinction between equality and equivalence for
Literals. RDF Literals are equal in accordance with [2]. 
Equivalence represents a broader notion of how Literals might 
be considered to match each other or be interchangeable in some 
way. Applications MAY determine that RDF Literals 
are equivalent while not being equal. For example the literals
'+353(0)12968607' and '0035312968607' are not equal, but they 
might reasonably be interpreted as equivalent when the context
is that of an Irish telephone number being considered elsewhere
in Europe.
Inferring such equivalences typically requires extra metadata 
or assumptions about the literals in question (such as might be 
available about the predicate whom the literal is the value of, or 
being aware that the literal is of a well known type). RDF 
processors SHOULD deal with equality and normalisation
of Literals only, and SHOULD NOT be expected to make or find such
equivalences. Future work in RDF may define ways in which extra
information about RDF Literals can be modelled in the light of
implementation experience.

[9]
RDF/XML documents MUST be W3C-normalized. 
An early uniform normalization framework is used.
See [CHARMOD] for definition.

[10]
When parsing RDF/XML the XML processor, if necessary, converts
the XML document to the UCS character domain. When doing this
from any encoding that is not UCS-based this conversion SHOULD
use Unicode Normalization Form C [NFC, NFC-Corrigendum].

[11]
RDF/XML processors MUST NOT normalize input from an XML document 
that is encoded in a UCS-based encoding. c.f. [CHARMOD] for 
rationale.

[12]
RDF/XML documents SHOULD be W3C-normalized as specified in
[CHARMOD]. Moreover, after the stripping of comments and
processing instructions an RDF/XML document SHOULD still be
W3C-normalized. It is the responsibility of the document
creator to fulfil this requirement. RDF/XML processors MUST NOT
correct input that is not W3C-normalized. (RDF/XML processors 
may detect lack of W3C-normalization in an input document, and 
issue a diagnostic).

[14]
Summary of text normalization for RDF/XML processors.
RDF/XML processors MUST use a normalizing transcoder
from non-UCS-based encodings.
RDF/XML processors MUST NOT do any other text normalization.
(cf. http://www.w3.org/TR/charmod/#def-normalizing-transcoder )

[15]
Unicode string equality within Literals is given by 
binary comparison as given by 
http://www.w3.org/TR/charmod/#sec-IdentityMatching . 

[16]
Language tag equality is defined by RFC3066 and is case 
insensitive.

[17]
RDF Literals arising from general attribute values (using the 
production 
      [6.10] propAttr  ::=  propName '="' string '"', 
i.e.  Production 4.13  propAttr in [Refactoring]   
):

[18]
+ have their language component defined by the value of xml:lang 
(if any) in the enclosing element, as specified in [XML].

[19]
+ the Unicode string is given by the attribute value after XML
  attribute value normalization.

[20]
    NOTE: document authors are warned that XML attribute
    value normalization differs slightly if an attribute is
    declared in a DTD with an attribute type other than CDATA.
    In such cases, validating and non-validating XML parsers
    will normalize some values differently, and hence RDF/XML
    processors will produce different RDF Literals.

[21]
RDF Literals arising from the propertyElt production with
a non-empty string value as element content, and no rdf:parseType 
attribute (using the first production of 6.12):

[22]
+ have their language component defined by the value of xml:lang 
(if any) in the property element, as specified in [XML].

[23]
+ the RDF Literal Unicode string is formed as the concatenation 
  of the element content subject to the usual XML processing
  rules:

[24]
     - character references are expanded.

[25]
     - entity references are illegal, it MAY be possible
       that a different RDF production is matched in this case.

[26]
     - CDATA sections are expanded.

[27]
      - XML comments are discarded.

[28]
      - processing instructions are discarded.

[29]
  NOTE: When converting the document from any encoding 
  that is not UCS-based, Unicode Normalization Form C
  is produced before concatenation of the various parts 
  of the literal. Hence, it is possible (if unwise) to 
  use various XML escaping mechanism to produce non-normalized 
  RDF Literals. Such a document is not W3C-normalized.

[30]
RDF Literals arising from the propertyElt production with
an empty string value as element content, and no rdf:parseType 
attribute (using the first production of 6.12):

[31]
+ have their language component defined by the value of xml:lang 
(if any) in the property element, as specified in [XML].

[32]
+ the RDF Literal Unicode string is the string of length 0.

[33]
RDF Literals arising from the propertyElt production with 
rdf:parseType="Literal" attribute (using the [n]th production 
of 6.12):

[34]
+ have their language component defined by the value of xml:lang 
(if any) in the property element, as specified in [XML].

[35]
+ MAY have their Unicode string component as
  the string of the element content as given
  in the input document, after converting the character encoding.

[36]
+ MAY have their Unicode string component as given by the 
  Unicode string of the XML Canonicalization of the document 
  subset consisting of the element content. See XML 
  Canonicalization section 2.4. 
  http://www.w3.org/TR/2001/REC-xml-c14n-20010315#DocSubsets
  The XML Canonicalization specifies a UTF-8 string, the
  RDF Literal is the encoded Unicode string. 
  Such a canonicalization MAY or MAY NOT include comments.

[37]
+ MAY have their Unicode string component as the as
  the string of the element content as given
  in the input document,  after converting the character encoding,
  and then processed in any of the following ways, in any order:

[38]
  - all comments MAY be stripped.

[39]
  - all processing instructions MAY be stripped.

[40]
  - all character references MAY be expanded.

[41]
  - all entity references MAY be expanded.

[42]
  - all CDATA sections MAY be replaced by their content.

[43] 
  - all expanded attribute values can be further processed by replacing 
    any character with an appropriate numeric characeter reference or 
    an XML predefined entity reference (i.e. &lt;, &gt;, &amp;, &apos; 
    or &quot;). All identical characters MUST be processed identically. 
    If such processing applies, similar processing MUST be applied to 
    text nodes. 

[44] 
  - all expanded text nodes can be further processed by replacing any 
    character with an appropriate numeric characeter reference or an 
    XML predefined entity reference. All identical characters MUST 
    be processed identically. If such processing applies, similar 
    processing MUST be applied to attribute values. 

[44a]
      NOTE: "similar" processing does not imply that identical
      substitutions must apply to both text nodes and attribute 
      values.

[45]
  - any additional XML namespace declarations which is 
    in scope in the surrounding property element MAY 
    be added to any start element tag in the XML literal.

[46]
  - any XML namespace declaration MAY be deleted from
    any start element tag.

[47]
  - any change that does not change the document information
    set MAY be made. A non-exhaustive list can be found
    in Infoset Appendix D.

[48]
   NOTE: The meaning of 'all' in the above paragraphs is that
   an RDF processing environment that makes such a change
   in one instance in one literal MUST make the corresponding
   change in every instance in every literal.

[49]
   NOTE: namespace prefix rewriting is prohibited. 
   Paragraphs [45] and [46] permit changing the namespace
   declaration but not the changing of the prefixes in 
   the QNames. This is to allow rdf:parseType="Literal"
   values to include namespace prefixes in attribute values 
   and elsewhere.

[50]
For maximum interoperability RDF processors are RECOMMENDED
to use XML canonicalization without comments as the string
in the RDF Literal formed by the rdf:parseType="Literal"
property element production.

[51]
   NOTE: The Working Group will review this recommendation
   in light of implementation experience.

Received on Wednesday, 26 September 2001 09:40:45 UTC