URI escaping in RDF M&S

Misha and I drilled down on the RDF M&S spec to try and understand what it
says.

There is a very significant erratum to the XML spec that impacts RDF M&S.

My understanding at the end of this was that:

When it was written RDF M&S says that IRIs (the term is not used) are %
escaped to make US ASCII URIs.

Now, despite not having changed, M&S says that IRIs are not % escaped.

[
Aside:

If I understand Misha's position, it is that the XML erratum is merely a
clarification so that while RDF M&S may have appeared to say that IRIs are %
escaped, it never actually did. The erratum now shows that it in fact all
along said that IRIs are not % escaped.
]

===

More detail
===========

RDF M&S para 204

http://lists.w3.org/Archives/Public/www-archive/2001Jun/att-0021/00-part#204

[[[

Note: Although non-ASCII characters in URIs are not allowed by [URI], [XML]
specifies a convention to avoid unnecessary incompatibilities in extended
URI syntax. Implementors of RDF are encouraged to avoid further
incompatibility and use the XML convention for system identifiers. Namely,
that a non-ASCII character in a URI be represented in UTF-8 as one or more
bytes, and then these bytes be escaped with the URI escaping mechanism
(i.e., by converting each byte to %HH, where HH is the hexadecimal notation
of the byte value).
]]]

This introduces a dependency between RDF M&S and the (changing) document

http://www.w3.org/TR/REC-xml

which when M&S was written identified a logical document based on

http://www.w3.org/TR/1998/REC-xml-19980210

with some errata

http://www.w3.org/XML/xml-19980210-errata

The particular section referred to was

http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent

[[[
An XML processor should handle a non-ASCII character in a URI by
representing the character in UTF-8 as one or more bytes, and then escaping
these bytes with the URI escaping mechanism (i.e., by converting each byte
to %HH, where HH is the hexadecimal notation of the byte value).
]]]

I read that combination as suggesting that the identified for RDF purposes
was (at the time of writing) the US ASCII URI.


By the time of the second edition of the XML spec we find
http://www.w3.org/TR/2000/REC-xml-20001006#sec-external-ent
clarifies the algorithm
[[[
URI references require encoding and escaping of certain characters. The
disallowed characters include all non-ASCII characters, plus the excluded
characters listed in Section 2.4 of [IETF RFC 2396], except for the number
sign (#) and percent sign (%) characters and the square bracket characters
re-allowed in [IETF RFC 2732]. Disallowed characters must be escaped as
follows:

Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or
more bytes.

Any octets corresponding to a disallowed character are escaped with the URI
escaping mechanism (that is, converted to %HH, where HH is the hexadecimal
notation of the byte value).

The original character is replaced by the resulting character sequence.
]]]

More recently we find another erratum 26 (a clairfication)

at http://www.w3.org/XML/xml-V10-2e-errata#E26

This specifies when the % escaping happens.
(note the linked version has color highlighting that is helpful)
[[[
[ Definition: The SystemLiteral is called the entity's system identifier. It
is meant to be converted to a URI reference (as defined in [IETF RFC 2396],
updated by [IETF RFC 2732]), as part of the process of dereferencing it to
obtain input for the XML processor to construct the entity's replacement
text.] It is an error for a fragment identifier (beginning with a #
character) to be part of a system identifier. Unless otherwise provided by
information outside the scope of this specification (e.g. a special XML
element type defined by a particular DTD, or a processing instruction
defined by a particular application specification), relative URIs are
relative to the location of the resource within which the entity declaration
occurs. A URI might thus be relative to the document entity, to the entity
containing the external DTD subset, or to some other external parameter
entity.

System identifiers (and other XML strings meant to be used as URI
references) may contain characters that, according to [IETF RFC 2396] and
[IETF RFC 2732], must be escaped before a URI can be used to retrieve the
referenced resource. The characters to be escaped are the contol characters
#x0 to #x1F and #x7F (most of which cannot appear in XML), space #x20, the
delimiters '<' #x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B,
'}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all
characters above #x7F. Since escaping is not always a fully reversible
process, it must be performed only when absolutely necessary and as late as
possible in a processing chain. In particular, neither the process of
converting a relative URI to an absolute one nor the process of passing a
URI reference to a process or software component responsible for
dereferencing it should trigger escaping. When escaping does occur, it must
be performed as follows:

1. Each disallowed character to be escaped is represented in UTF-8 [IETF RFC
2279] as one or more bytes.

2. The resulting bytes are escaped with the URI escaping mechanism (that is,
converted to %HH, where HH is the hexadecimal notation of the byte value).

3. The original character is replaced by the resulting character sequence.

]]]


This clarity about when %-escaping happens for XML System Literals also
applies to M&S para 204 according to my reading (although it is certainly
arguable).

If so, this reverses the meaning in terms of the identity of the IRI. It is
not "absolutely necessary" to perform the escaping during tidying of an RDF
graph. Graph tidying is distinct from dereferencing. Thus the phrases "as
part of the process of dereferencing" and "only when absolutely necessary
and as late as possible in a processing chain" both indicate that RDF does
not % escaping of URIs except when doing something like locating a schema as
in Mike's system.

===


Summary of my position. The past is unclear, let's try and make the right
decision for the future.

Received on Friday, 1 March 2002 13:00:33 UTC