- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Wed, 10 Apr 2013 16:46:31 -0400
- To: Peter Patel-Schneider <pfpschneider@gmail.com>
- Cc: W3C RDF WG <public-rdf-wg@w3.org>
* Peter Patel-Schneider <pfpschneider@gmail.com> [2013-04-10 13:05-0700]
> The situation is rather murky, at best.
>
> http://www.w3.org/TR/REC-xml/#charsets votes 2 (text definition and comment)
> to 1 (grammar) for allowing control characters:
>
> 2.2 Characters
>
> [Definition: A parsed entity contains text, a sequence of characters, which
> may represent markup or character data.]
> [Definition: A character is an atomic unit of text as specified by ISO/IEC
> 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line
> feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of
> these standards cited in A.1 Normative References were current at the time
> this document was prepared. New characters may be added to these standards
> by amendments or new editions. Consequently, XML processors MUST accept any
> character in the range specified for Char. ]
> Character Range
>
> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
> [#x10000-#x10FFFF]
> /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
>
>
> http://www.unicode.org/charts/PDF/U0000.pdf shows that ASCII
> control characters (including 0x0) are Unicode characters.
I think the above makes it pretty clear that our literals are
sequences of unicode character, not sequences of XML Chars.
Yet we call them xs:strings.
> I wonder how many XML processors forbid control characters.
at least LibXML and Expat:
DB<1> use XML::LibXML
DB<2> p $dom = XML::LibXML->load_xml(string => '<a></a>');
:1: parser error : PCDATA invalid Char value 1
<a></a>
^
DB<3> use XML::Parser
DB<4> $p1 = new XML::Parser(Style => 'Debug')
DB<5> p $p1->parse('<a></a>')
not well-formed (invalid token) at line 1, column 3, byte 3 at /usr/lib/perl5/XML/Parser.pm line 187
> Unfortunately, the public version of ISO 10646 does not appear to be
> currently accessible. It *is* annoying for a W3C standard to point to
> another standard whose master version is not freely available.
yeah, that's why I always look at the unicode code charts (as you
apparently did above).
> peter
>
>
>
>
>
>
> On Wed, Apr 10, 2013 at 12:38 PM, Eric Prud'hommeaux <eric@w3.org> wrote:
>
> > Tests like LITERAL1_all_controls include control codes not allowed in
> > xsd:string. XSD says that xsd:strings are XML character data:
> >
> > [[
> > The ·value space· of string is the set of finite-length sequences of
> > characters (as defined in [XML 1.0 (Second Edition)]) that ·match·
> > the Char production from [XML 1.0 (Second Edition)].
> > ]] — http://www.w3.org/TR/xmlschema-2/#string
> >
> > XML character data excludes non-whitespaec control characters:
> >
> > [[
> > A parsed entity contains text…Legal characters are tab, carriage
> > return, line feed, and the legal characters of Unicode and ISO/IEC
> > 10646.
> > ]] — http://www.w3.org/TR/REC-xml/#dt-character
> >
> > Points 4 below explain why this calls into question whether any
> > string can contain (so called "C0") control codes and be typed as
> > an xsd:string.
> >
> > I have to say, I've always appreciated that RDF doesn't make me
> > uu-encode or invent escaping mechanisms all the time like XML does;
> > this control code issue is tied to a behavior which makes RDF
> > (e.g. Turtle) considerably more flexible and easy to deal with.
> >
> >
> > * Eric Prud'hommeaux <eric@w3.org> [2013-04-07 17:55-0400]
> > > I've had these niggling doubts for a while, and finally succumbed to
> > > that morbid desire to explore some problems that I'd rather not know
> > > about. We've all known for a while that we can create graphs with APIs
> > > (now even serializable in Turtle) which can't be written in RDF/XML.
> > > Here's a list of issues I think we need to clarify:
> > >
> > >
> > >
> > > 1 Namespaces are OK syntactically[nssyn], though our notion of namespace
> > > IRIs is of course outside the Namespaces definition as URIs [nsURI].
> > > [nssyn] http://www.w3.org/TR/REC-xml-names/#NT-Attribute
> > > [nsURI] http://www.w3.org/TR/REC-xml-names/#dt-namespace
> > >
> > > ------------------------------------------------------------
> > >
> > >
> > > 2 QNames forbid a raft of [first] and [nth] characters which are
> > > permissible in [IRIs].
> > >
> > > first: [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] |
> > > [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
> > > [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
> > > [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
> > > [#x10000-#xEFFFF]
> > >
> > > nth: first | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] |
> > > [#x203F-#x2040]
> > > http://www.w3.org/TR/REC-xml-names/#NT-NCName
> > >
> > > IRIs: ipchar = [A-Z] | "_" | [a-z] | [0-9] | "-" | "." "~" |
> > > "%" HEX HEX | "!" | "$" | "&" | "'" | "(" | ")" |
> > > "*" | "+" | "," | ";" | "=" | ":" | "@" |
> > > [#xA0-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFEF] |
> > > [#x10000-#x1FFFD] | [#x20000-#x2FFFD] |
> > > [#x30000-#x3FFFD] | [#x40000-#x4FFFD] |
> > > [#x50000-#x5FFFD] | [#x60000-#x6FFFD] |
> > > [#x70000-#x7FFFD] | [#x80000-#x8FFFD] |
> > > [#x90000-#x9FFFD] | [#xA0000-#xAFFFD] |
> > > [#xB0000-#xBFFFD] | [#xC0000-#xCFFFD] |
> > > [#xD0000-#xDFFFD] | [#xE1000-#xEFFFD]
> > > http://tools.ietf.org/html/rfc3987#section-2.2
> > >
> > > ------------------------------------------------------------
> > >
> > >
> > > 3 XML content excludes [#x00-#x08] [#x0B-#x0C] [#x0E-#x1F], all of
> > > which are permitted in "Unicode strings" and thus RDF literals
> > > [Rlit]. This applies regardless of CDATA enclosure or entity
> > > substitution.
> > > [Rlit]
> > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-concepts/index.html#dfn-lexical-form
> > > [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
> > > [#x10000-#x10FFFF]
> > > http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Char
> > >
> > > ------------------------------------------------------------
> > >
> > >
> > > 4 XML Schema also prohibits the above control characters from
> > > appearing in something typed as xsd:string [string].
> > > [string] http://www.w3.org/TR/xmlschema-2/#dt-string
> > >
> > > ------------------------------------------------------------
> > >
> > >
> > > For 4, I propose notes in RDF Concepts and the serialization syntaxes
> > > (e.g. Turtle). For the others, I wonder if we're forced into some
> > > miserable escaping mechanism applied on top of XML.
> > >
> > > --
> > > -ericP
> >
> > --
> > -ericP
> >
> >
--
-ericP
Received on Wednesday, 10 April 2013 20:47:00 UTC