the madness of xsd:string

There has been some discussion of the status of xsd:string literals and
related literals in RDF documents and graphs, particularly with respect to
ASCII control characters.

The situation in RDF 2004 was that plain literals could include all Unicode
code points.  It was assumed by some people that this meant that plain
literals wihtout language tags were the same as xsd:string.  However, not
all Unicode control points are allowed in XSD strings.  In particular, #x0
is not allowed.

The current Concepts says, in Section 3.3, that simple literals are sugar
for typed literals with type xsd:string.  In the changes section it says:
    The xsd:string datatype does not permit the #x0 character, and
    implementations may not permit control codes in the #x1-#x1F
    range. Earlier versions of RDF allowed these characters in simple
    literals, although they could never be serialized in a
    W3C-recommended concrete syntax.
This last not correct.

As well, xsd:string has undergone a change recently, allowing more control
characters.

As I see it, the situation is thus as follows, using Turtle syntax.  All
examples are syntactically correct and produce valid RDF literals.

Syntax:     "\u0000"
2004:        plain literal
  Value:        the Unicode string containing a single NULL
Current:     ill-typed xsd:string literal

Syntax:     "\u0001"
2004:        plain literal
  Value:        the Unicode string containing a single SOH
Current:     well-typed xsd:string literal
  Value:        the Unicode string containing a single SOH

Syntax:     "\u0001"^^xsd:string
2004:        ill-typed xsd:string literal
Current:     well-typed xsd:string literal
  Value:        the Unicode string containing a single SOH


I think that the following changes are required in the core documents.

Concepts:
  Changes section:
    The xsd:string datatype does not permit the #x0 character, and
    implementations may not permit control codes in the #x1-#x1F
    range.  Earlier versions of RDF allowed these characters as values
     in simple literals, although they could never be serialized in a
    W3C-recommended concrete syntax.  Currently a literal with type
     xsd:string containing the #x0 character is an ill-typed literal, but is
     syntactically permissable.
Semantics:
  Section 4
    However, IL is total on language-tagged strings (but not on literals
     of type xsd:string).

Received on Wednesday, 17 April 2013 17:41:34 UTC