RDF-ISSUE-75 (#x0): Valid plain literals containing #x0 are no longer valid in RDF 1.1 from RDF Working Group Issue Tracker on 2011-08-19 (public-rdf-wg@w3.org from August 2011)

From: RDF Working Group Issue Tracker <sysbot+tracker@w3.org>
Date: Fri, 19 Aug 2011 18:44:12 +0000
To: public-rdf-wg@w3.org
Message-Id: <E1QuU3A-0005Sv-Fi@lowblow.w3.org>

RDF-ISSUE-75 (#x0): Valid plain literals containing #x0 are no longer valid in RDF 1.1

http://www.w3.org/2011/rdf-wg/track/issues/75

Raised by: Richard Cyganiak
On product: 

The lexical space of xsd:string doesn't cover all Unicode strings.

I assume we will end up referring to XSD 1.1 for the definition of xsd:string [1]. That document leaves it up to implementations whether they support the XML 1.0 or XML 1.1; accordingly, the definition of allowed characters in an xsd:string is [2] or [3].

The more permissive one from XML 1.1:

    Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

This excludes #x0, Unicode codepoint U+0000. XML 1.0 also excludes a number of other control codes in the #x0-#x1F range.

The definition of “lexical form” in RDF 2004 [4] says “Unicode string”, which according to [5] includes *all* codepoints including the control codes.

So, any string that includes #x0 was a valid untagged plain literal in RDF 2004. In RDF 1.1, it will be typed as an xsd:string, and thus will be an ill-typed literal.

(On the other hand, such strings could never be serialized in RDF/XML or XHTML+RDFa; they were serializable only in N-Triples and Turtle.)

Is this a problem? Can we go ahead with the new literal design despite this restriction? Should we acknowledge it in the RDF Concepts spec?

[1] http://www.w3.org/TR/2005/WD-xmlschema11-2-20050224/datatypes.html#string
[2] http://www.w3.org/TR/REC-xml/#dt-character
[3] http://www.w3.org/TR/xml11/#NT-Char
[4] http://www.w3.org/TR/rdf-concepts/#dfn-lexical-form
[5] http://www.unicode.org/versions/Unicode6.0.0/UnicodeStandard-6.0.pdf

Received on Friday, 19 August 2011 18:44:17 UTC