- From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- Date: Fri, 1 Mar 2002 18:00:15 -0000
- To: <w3c-rdfcore-wg@w3.org>, <w3c-i18n-ig@w3.org>
Misha and I drilled down on the RDF M&S spec to try and understand what it says. There is a very significant erratum to the XML spec that impacts RDF M&S. My understanding at the end of this was that: When it was written RDF M&S says that IRIs (the term is not used) are % escaped to make US ASCII URIs. Now, despite not having changed, M&S says that IRIs are not % escaped. [ Aside: If I understand Misha's position, it is that the XML erratum is merely a clarification so that while RDF M&S may have appeared to say that IRIs are % escaped, it never actually did. The erratum now shows that it in fact all along said that IRIs are not % escaped. ] === More detail =========== RDF M&S para 204 http://lists.w3.org/Archives/Public/www-archive/2001Jun/att-0021/00-part#204 [[[ Note: Although non-ASCII characters in URIs are not allowed by [URI], [XML] specifies a convention to avoid unnecessary incompatibilities in extended URI syntax. Implementors of RDF are encouraged to avoid further incompatibility and use the XML convention for system identifiers. Namely, that a non-ASCII character in a URI be represented in UTF-8 as one or more bytes, and then these bytes be escaped with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value). ]]] This introduces a dependency between RDF M&S and the (changing) document http://www.w3.org/TR/REC-xml which when M&S was written identified a logical document based on http://www.w3.org/TR/1998/REC-xml-19980210 with some errata http://www.w3.org/XML/xml-19980210-errata The particular section referred to was http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent [[[ An XML processor should handle a non-ASCII character in a URI by representing the character in UTF-8 as one or more bytes, and then escaping these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value). ]]] I read that combination as suggesting that the identified for RDF purposes was (at the time of writing) the US ASCII URI. By the time of the second edition of the XML spec we find http://www.w3.org/TR/2000/REC-xml-20001006#sec-external-ent clarifies the algorithm [[[ URI references require encoding and escaping of certain characters. The disallowed characters include all non-ASCII characters, plus the excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the number sign (#) and percent sign (%) characters and the square bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters must be escaped as follows: Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or more bytes. Any octets corresponding to a disallowed character are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value). The original character is replaced by the resulting character sequence. ]]] More recently we find another erratum 26 (a clairfication) at http://www.w3.org/XML/xml-V10-2e-errata#E26 This specifies when the % escaping happens. (note the linked version has color highlighting that is helpful) [[[ [ Definition: The SystemLiteral is called the entity's system identifier. It is meant to be converted to a URI reference (as defined in [IETF RFC 2396], updated by [IETF RFC 2732]), as part of the process of dereferencing it to obtain input for the XML processor to construct the entity's replacement text.] It is an error for a fragment identifier (beginning with a # character) to be part of a system identifier. Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs. A URI might thus be relative to the document entity, to the entity containing the external DTD subset, or to some other external parameter entity. System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to [IETF RFC 2396] and [IETF RFC 2732], must be escaped before a URI can be used to retrieve the referenced resource. The characters to be escaped are the contol characters #x0 to #x1F and #x7F (most of which cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F. Since escaping is not always a fully reversible process, it must be performed only when absolutely necessary and as late as possible in a processing chain. In particular, neither the process of converting a relative URI to an absolute one nor the process of passing a URI reference to a process or software component responsible for dereferencing it should trigger escaping. When escaping does occur, it must be performed as follows: 1. Each disallowed character to be escaped is represented in UTF-8 [IETF RFC 2279] as one or more bytes. 2. The resulting bytes are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value). 3. The original character is replaced by the resulting character sequence. ]]] This clarity about when %-escaping happens for XML System Literals also applies to M&S para 204 according to my reading (although it is certainly arguable). If so, this reverses the meaning in terms of the identity of the IRI. It is not "absolutely necessary" to perform the escaping during tidying of an RDF graph. Graph tidying is distinct from dereferencing. Thus the phrases "as part of the process of dereferencing" and "only when absolutely necessary and as late as possible in a processing chain" both indicate that RDF does not % escaping of URIs except when doing something like locating a schema as in Mike's system. === Summary of my position. The past is unclear, let's try and make the right decision for the future.
Received on Friday, 1 March 2002 13:00:33 UTC