Re: Turtle tests claiming to be xsd:string

Hold the presses!  XML and XSD 1.1 come to the rescue.  :-)


http://www.w3.org/TR/xmlschema11-2/#string

3.3.1.1 Value Space

The ·value space· of string is the set of finite-length sequences of
zero or more characters (as defined in [XML]) that ·match· the Char
production from [XML]. A character is an atomic unit of communication;
it is not further specified except to note that every character has a
corresponding Universal Character Set (UCS) code point, which is an
integer.

It is ·implementation-defined· whether an implementation of this
specification supports the Char production from [XML], or that from
[XML 1.0], or both. See Dependencies on Other Specifications (§1.3).


>From http://www.w3.org/TR/xml11/#charsets

2.2 Characters

[Definition: A parsed entity contains text, a sequence of characters,
which may represent markup or character data.] [Definition: A
character is an atomic unit of text as specified by ISO/IEC 10646
[ISO/IEC 10646]. Legal characters are tab, carriage return, line feed,
and the legal characters of Unicode and ISO/IEC 10646. The versions of
these standards cited in A.1 Normative References were current at the
time this document was prepared. New characters may be added to these
standards by amendments or new editions. Consequently, XML processors
MUST accept any character in the range specified for Char.]

Character Range

[2]   Char   ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */
[2a]   RestrictedChar   ::=   [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] |
[#x7F-#x84] | [#x86-#x9F]


Of course, there still is a mismatch in XML 1.1 between the text,
comment, and the grammar.

The text allows #x0 and surrogates.  (I think that it allows
surrogates, as there are other non-character Unicode code points.)
The comment allows #x0 but not surrogates.   The grammar disallows
both #x0 and surrogates.

I have no idea why #x0 is disallowed.  It is, after all, a perfectly
good Unicode code point.

It would be much better if the XML 1.1 was rewritten.

peter

On Wed, Apr 10, 2013 at 1:46 PM, Eric Prud'hommeaux <eric@w3.org> wrote:
> * Peter Patel-Schneider <pfpschneider@gmail.com> [2013-04-10 13:05-0700]
>> The situation is rather murky, at best.
>>
>> http://www.w3.org/TR/REC-xml/#charsets votes 2 (text definition and comment)
>> to 1 (grammar) for allowing control characters:
>>
>> 2.2 Characters
>>
>> [Definition: A parsed entity contains text, a sequence of characters, which
>> may represent markup or character data.]
>> [Definition: A character is an atomic unit of text as specified by ISO/IEC
>> 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line
>> feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of
>> these standards cited in A.1 Normative References were current at the time
>> this document was prepared. New characters may be added to these standards
>> by amendments or new editions. Consequently, XML processors MUST accept any
>> character in the range specified for Char. ]
>> Character Range
>>
>> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>> [#x10000-#x10FFFF]
>> /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
>>
>>
>> http://www.unicode.org/charts/PDF/U0000.pdf shows that ASCII
>> control characters (including 0x0) are Unicode characters.
>
> I think the above makes it pretty clear that our literals are
> sequences of unicode character, not sequences of XML Chars.
> Yet we call them xs:strings.
>
>
>> I wonder how many XML processors forbid control characters.
>
> at least LibXML and Expat:
>
>     DB<1> use XML::LibXML
>     DB<2> p $dom = XML::LibXML->load_xml(string => '<a> </a>');
>   :1: parser error : PCDATA invalid Char value 1
>   <a> </a>
>      ^
>     DB<3> use XML::Parser
>     DB<4> $p1 = new XML::Parser(Style => 'Debug')
>     DB<5> p $p1->parse('<a> </a>')
>   not well-formed (invalid token) at line 1, column 3, byte 3 at /usr/lib/perl5/XML/Parser.pm line 187
>
>
>> Unfortunately, the public version of ISO 10646 does not appear to be
>> currently accessible.  It *is* annoying for a W3C standard to point to
>> another standard whose master version is not freely available.
>
> yeah, that's why I always look at the unicode code charts (as you
> apparently did above).
>
>
>> peter
>>
>>
>>
>>
>>
>>
>> On Wed, Apr 10, 2013 at 12:38 PM, Eric Prud'hommeaux <eric@w3.org> wrote:
>>
>> > Tests like LITERAL1_all_controls include control codes not allowed in
>> > xsd:string. XSD says that xsd:strings are XML character data:
>> >
>> > [[
>> > The ·value space· of string is the set of finite-length sequences of
>> > characters (as defined in [XML 1.0 (Second Edition)]) that ·match·
>> > the Char production from [XML 1.0 (Second Edition)].
>> > ]] — http://www.w3.org/TR/xmlschema-2/#string
>> >
>> > XML character data excludes non-whitespaec control characters:
>> >
>> > [[
>> > A parsed entity contains text…Legal characters are tab, carriage
>> > return, line feed, and the legal characters of Unicode and ISO/IEC
>> > 10646.
>> > ]] — http://www.w3.org/TR/REC-xml/#dt-character
>> >
>> > Points 4 below explain why this calls into question whether any
>> > string can contain (so called "C0") control codes and be typed as
>> > an xsd:string.
>> >
>> > I have to say, I've always appreciated that RDF doesn't make me
>> > uu-encode or invent escaping mechanisms all the time like XML does;
>> > this control code issue is tied to a behavior which makes RDF
>> > (e.g. Turtle) considerably more flexible and easy to deal with.
>> >
>> >
>> > * Eric Prud'hommeaux <eric@w3.org> [2013-04-07 17:55-0400]
>> > > I've had these niggling doubts for a while, and finally succumbed to
>> > > that morbid desire to explore some problems that I'd rather not know
>> > > about. We've all known for a while that we can create graphs with APIs
>> > > (now even serializable in Turtle) which can't be written in RDF/XML.
>> > > Here's a list of issues I think we need to clarify:
>> > >
>> > >
>> > >
>> > > 1 Namespaces are OK syntactically[nssyn], though our notion of namespace
>> > >   IRIs is of course outside the Namespaces definition as URIs [nsURI].
>> > >   [nssyn] http://www.w3.org/TR/REC-xml-names/#NT-Attribute
>> > >   [nsURI] http://www.w3.org/TR/REC-xml-names/#dt-namespace
>> > >
>> > >   ------------------------------------------------------------
>> > >
>> > >
>> > > 2 QNames forbid a raft of [first] and [nth] characters which are
>> > >   permissible in [IRIs].
>> > >
>> > >   first: [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] |
>> > >          [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
>> > >          [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
>> > >          [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
>> > >          [#x10000-#xEFFFF]
>> > >
>> > >   nth: first | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] |
>> > >        [#x203F-#x2040]
>> > >   http://www.w3.org/TR/REC-xml-names/#NT-NCName
>> > >
>> > >   IRIs: ipchar = [A-Z] | "_" | [a-z] | [0-9] | "-" | "." "~" |
>> > >                  "%" HEX HEX | "!" | "$" | "&" | "'" | "(" | ")" |
>> > >                  "*" | "+" | "," | ";" | "=" | ":" | "@" |
>> > >                  [#xA0-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFEF] |
>> > >                  [#x10000-#x1FFFD] | [#x20000-#x2FFFD] |
>> > >                  [#x30000-#x3FFFD] | [#x40000-#x4FFFD] |
>> > >                  [#x50000-#x5FFFD] | [#x60000-#x6FFFD] |
>> > >                  [#x70000-#x7FFFD] | [#x80000-#x8FFFD] |
>> > >                  [#x90000-#x9FFFD] | [#xA0000-#xAFFFD] |
>> > >                  [#xB0000-#xBFFFD] | [#xC0000-#xCFFFD] |
>> > >                  [#xD0000-#xDFFFD] | [#xE1000-#xEFFFD]
>> > >   http://tools.ietf.org/html/rfc3987#section-2.2
>> > >
>> > >   ------------------------------------------------------------
>> > >
>> > >
>> > > 3 XML content excludes [#x00-#x08] [#x0B-#x0C] [#x0E-#x1F], all of
>> > >   which are permitted in "Unicode strings" and thus RDF literals
>> > >   [Rlit]. This applies regardless of CDATA enclosure or entity
>> > >   substitution.
>> > >   [Rlit]
>> > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-concepts/index.html#dfn-lexical-form
>> > >   [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>> > >                [#x10000-#x10FFFF]
>> > >     http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Char
>> > >
>> > >   ------------------------------------------------------------
>> > >
>> > >
>> > > 4 XML Schema also prohibits the above control characters from
>> > >   appearing in something typed as xsd:string [string].
>> > >   [string] http://www.w3.org/TR/xmlschema-2/#dt-string
>> > >
>> > >   ------------------------------------------------------------
>> > >
>> > >
>> > > For 4, I propose notes in RDF Concepts and the serialization syntaxes
>> > > (e.g. Turtle). For the others, I wonder if we're forced into some
>> > > miserable escaping mechanism applied on top of XML.
>> > >
>> > > --
>> > > -ericP
>> >
>> > --
>> > -ericP
>> >
>> >
>
> --
> -ericP

Received on Wednesday, 10 April 2013 21:24:16 UTC