- From: Peter Patel-Schneider <pfpschneider@gmail.com>
- Date: Wed, 10 Apr 2013 13:05:00 -0700
- To: "Eric Prud'hommeaux" <eric@w3.org>
- Cc: W3C RDF WG <public-rdf-wg@w3.org>
- Message-ID: <CAMpDgVw9XVZ2KS_OZDVTe2-FjTsrMxgn3j47Bs+ruMP8sW1+Jw@mail.gmail.com>
The situation is rather murky, at best. http://www.w3.org/TR/REC-xml/#charsets votes 2 (text definition and comment) to 1 (grammar) for allowing control characters: 2.2 Characters [Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors MUST accept any character in the range specified for Char. ] Character Range [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ http://www.unicode.org/charts/PDF/U0000.pdf shows that ASCII control characters (including 0x0) are Unicode characters. I wonder how many XML processors forbid control characters. Unfortunately, the public version of ISO 10646 does not appear to be currently accessible. It *is* annoying for a W3C standard to point to another standard whose master version is not freely available. peter On Wed, Apr 10, 2013 at 12:38 PM, Eric Prud'hommeaux <eric@w3.org> wrote: > Tests like LITERAL1_all_controls include control codes not allowed in > xsd:string. XSD says that xsd:strings are XML character data: > > [[ > The ·value space· of string is the set of finite-length sequences of > characters (as defined in [XML 1.0 (Second Edition)]) that ·match· > the Char production from [XML 1.0 (Second Edition)]. > ]] — http://www.w3.org/TR/xmlschema-2/#string > > XML character data excludes non-whitespaec control characters: > > [[ > A parsed entity contains text…Legal characters are tab, carriage > return, line feed, and the legal characters of Unicode and ISO/IEC > 10646. > ]] — http://www.w3.org/TR/REC-xml/#dt-character > > Points 4 below explain why this calls into question whether any > string can contain (so called "C0") control codes and be typed as > an xsd:string. > > I have to say, I've always appreciated that RDF doesn't make me > uu-encode or invent escaping mechanisms all the time like XML does; > this control code issue is tied to a behavior which makes RDF > (e.g. Turtle) considerably more flexible and easy to deal with. > > > * Eric Prud'hommeaux <eric@w3.org> [2013-04-07 17:55-0400] > > I've had these niggling doubts for a while, and finally succumbed to > > that morbid desire to explore some problems that I'd rather not know > > about. We've all known for a while that we can create graphs with APIs > > (now even serializable in Turtle) which can't be written in RDF/XML. > > Here's a list of issues I think we need to clarify: > > > > > > > > 1 Namespaces are OK syntactically[nssyn], though our notion of namespace > > IRIs is of course outside the Namespaces definition as URIs [nsURI]. > > [nssyn] http://www.w3.org/TR/REC-xml-names/#NT-Attribute > > [nsURI] http://www.w3.org/TR/REC-xml-names/#dt-namespace > > > > ------------------------------------------------------------ > > > > > > 2 QNames forbid a raft of [first] and [nth] characters which are > > permissible in [IRIs]. > > > > first: [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | > > [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | > > [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | > > [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | > > [#x10000-#xEFFFF] > > > > nth: first | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | > > [#x203F-#x2040] > > http://www.w3.org/TR/REC-xml-names/#NT-NCName > > > > IRIs: ipchar = [A-Z] | "_" | [a-z] | [0-9] | "-" | "." "~" | > > "%" HEX HEX | "!" | "$" | "&" | "'" | "(" | ")" | > > "*" | "+" | "," | ";" | "=" | ":" | "@" | > > [#xA0-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFEF] | > > [#x10000-#x1FFFD] | [#x20000-#x2FFFD] | > > [#x30000-#x3FFFD] | [#x40000-#x4FFFD] | > > [#x50000-#x5FFFD] | [#x60000-#x6FFFD] | > > [#x70000-#x7FFFD] | [#x80000-#x8FFFD] | > > [#x90000-#x9FFFD] | [#xA0000-#xAFFFD] | > > [#xB0000-#xBFFFD] | [#xC0000-#xCFFFD] | > > [#xD0000-#xDFFFD] | [#xE1000-#xEFFFD] > > http://tools.ietf.org/html/rfc3987#section-2.2 > > > > ------------------------------------------------------------ > > > > > > 3 XML content excludes [#x00-#x08] [#x0B-#x0C] [#x0E-#x1F], all of > > which are permitted in "Unicode strings" and thus RDF literals > > [Rlit]. This applies regardless of CDATA enclosure or entity > > substitution. > > [Rlit] > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-concepts/index.html#dfn-lexical-form > > [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | > > [#x10000-#x10FFFF] > > http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Char > > > > ------------------------------------------------------------ > > > > > > 4 XML Schema also prohibits the above control characters from > > appearing in something typed as xsd:string [string]. > > [string] http://www.w3.org/TR/xmlschema-2/#dt-string > > > > ------------------------------------------------------------ > > > > > > For 4, I propose notes in RDF Concepts and the serialization syntaxes > > (e.g. Turtle). For the others, I wonder if we're forced into some > > miserable escaping mechanism applied on top of XML. > > > > -- > > -ericP > > -- > -ericP > >
Received on Wednesday, 10 April 2013 20:05:28 UTC