XML characters and literals from Graham Klyne on 2003-07-25 (w3c-rdfcore-wg@w3.org from July 2003)

From: Graham Klyne <gk@ninebynine.org>
Date: Fri, 25 Jul 2003 15:56:30 +0100
To: rdf core <w3c-rdfcore-wg@w3.org>
Cc: pat hayes <phayes@ihmc.us>
Message-Id: <5.1.0.14.2.20030725153840.02a5ce40@127.0.0.1>
Special characters in XML...

Short answer:  the excluded characters *cannot* be included in RDF/XML 
literal values... for reason, skip to end, and text following production [66].

Details:

XML 1.1:

[[
2.2 Characters

Change production [2]:

  [2]     Char    ::=    #x9 | #xA | #xD | [#x20-#x7E] | #x85 | [#xA0-#xD7FF]
                       | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Change the associated comment to read:

any Unicode character, excluding most ISO controls, the surrogate blocks, 
FFFE, and FFFF
]]
-- http://www.w3.org/TR/xml11/#sec2.2

Otherwise, no changes from the sections in XML 1.0 that I'll cite.

XML 1.0:

[[
[2]    Char    ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate 
blocks, FFFE, and FFFF. */
]]
-- http://www.w3.org/TR/REC-xml#NT-Char

These are the allowed charaters.

[[
2.4 Character Data and Markup

Text consists of intermingled character data and markup. [Definition: 
Markup takes the form of start-tags, end-tags, empty-element tags, entity 
references, character references, comments, CDATA section delimiters, 
document type declarations, processing instructions, XML declarations, text 
declarations, and any white space that is at the top level of the document 
entity (that is, outside the document element and not inside any other 
markup).]

[Definition: All text that is not markup constitutes the character data of 
the document.]

The ampersand character (&) and the left angle bracket (<) may appear in 
their literal form only when used as markup delimiters, or within a 
comment, a processing instruction, or a CDATA section. If they are needed 
elsewhere, they must be escaped using either numeric character references 
or the strings "&amp;" and "&lt;" respectively. The right angle bracket (>) 
may be represented using the string "&gt;", and must, for compatibility, be 
escaped using "&gt;" or a character reference when it appears in the string 
"]]>" in content, when that string is not marking the end of a CDATA section.

In the content of elements, character data is any string of characters 
which does not contain the start-delimiter of any markup. In a CDATA 
section, character data is any string of characters not including the 
CDATA-section-close delimiter, "]]>".

To allow attribute values to contain both single and double quotes, the 
apostrophe or single-quote character (') may be represented as "&apos;", 
and the double-quote character (") as "&quot;".
Character Data

[14]    CharData    ::=    [^<&]* - ([^<&]* ']]>' [^<&]*)
]]
-- http://www.w3.org/TR/REC-xml#syntax

This says informally that character data includes character references.

As for the content of attributes and elements, see:
   http://www.w3.org/TR/REC-xml#sec-logical-struct

[[
Content of Elements
[43]    content    ::=    CharData? ((element | Reference | CDSect | PI | 
Comment) CharData?)* /* */
]]
-- http://www.w3.org/TR/REC-xml#NT-content

So, again, element content includes 'Reference'...

[[
[10]    AttValue    ::=    '"' ([^<&"] | Reference)* '"'
|  "'" ([^<&'] | Reference)* "'"
]]
-- http://www.w3.org/TR/REC-xml#NT-AttValue

Attribute values include 'Reference'...

[[
[67]    Reference    ::=    EntityRef | CharRef
[68]    EntityRef    ::=    '&' Name ';' [WFC: Entity Declared]
[VC: Entity Declared]
[WFC: Parsed Entity]
[WFC: No Recursion]
]]
-- http://www.w3.org/TR/REC-xml#NT-Reference


[[
[66]    CharRef    ::=    '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';' [WFC: Legal Character]

Well-formedness constraint: Legal Character

Characters referred to using character references must match the production 
for Char.
]]
-- http://www.w3.org/TR/REC-xml#NT-CharRef

OOPS!  That well-formedness constraint means the awkward characters are not 
allowed.  I was wrong, the awkward characters cannot be encoded in XMl.

#g


-------------------
Graham Klyne
<GK@NineByNine.org>
PGP: 0FAA 69FF C083 000B A2E9  A131 01B9 1C7A DBCA CB5E
Received on Friday, 25 July 2003 11:11:42 UTC