Re: escaping % in RDF URI references

Hello Jeremy,

Sorry for not responding back to you earlier, I was traveling
for the last two weeks, and will be on vacation until up and
including next Monday.

Control characters should not be allowed in URIs/IRIs.

The potential of control characters got into RDF I think from the
fact that The XML text was carefully crafted with the restrictions
(actually now to some extent removed in XML 1.1) of XML in mind.

In addition, there has been a prolonged discussion about whether
to allow spaces and other ASCII characters not allowed in URIs
(such as '>' and '<',...). After feedback from XML Schema (re.
space), rather strong mails to the TAG and clear opinions at
the IETF in San Francisco, we have had to removed these
characters again from the definition. For details,
please see http://www.w3.org/International/iri-edit.
The current text still allows for specifications to allow such
characters before a transform to IRIs. But in particular for
RDF, this is not really advisable, because either this complicates
the parser (defining that the parser changes " " to "%20", and
so on), or it will create differences for comparison of IRIs
across specs (because with character-by-character comparison,
" " and "%20" do not compare equally).

As for the texts, I think both of them have advantages and
disadvantages. The best thing is to hurry up with the IRI
spec and remove these problems.


Regards,    Martin.


At 16:29 03/09/11 +0200, Jeremy Carroll wrote:

>Jeremy:
> > Personally my preference would be to follow Martin Durst's advice ...
>[here
> > at least :) ].
>
>Brian:
> > Are you suggesting soliciting further advice?
>
>Yes - Martin any comments,
>
>would it be better to go with our current text
>[[
>6.4 RDF URI References
>A URI reference within an RDF graph (an RDF URI reference) is a Unicode
>string [UNICODE] that would produce a valid URI character sequence (per
>RFC2396 [URI], sections 2.1) representing an absolute URI with optional
>fragment identifier when subjected to the encoding described below.
>
>The encoding consists of:
>
>1. encoding the Unicode string as UTF-8 [RFC-2279], giving a sequence of
>octet values.
>%-escaping octets that do not correspond to permitted US-ASCII characters.
>2. The disallowed octets that must be %-escaped include all those that do
>not correspond to US-ASCII characters, and the excluded characters listed in
>Section 2.4 of [URI], except for the number sign (#), percent sign (%), and
>the square bracket characters re-allowed in [RFC-2732].
>
>Disallowed octets must be escaped with the URI escaping mechanism (that is,
>converted to %HH, where HH is the 2-digit hexadecimal numeral corresponding
>to the octet value).
>
>Two RDF URI references are equal if and only if they compare as equal,
>character by character, as Unicode strings.
>
>Note: RDF URI references are compatible with the anyURI datatype as defined
>by XML schema datatypes [XML-SCHEMA2], constrained to be an absolute rather
>than a relative URI reference.
>
>Note: RDF URI references are compatible with International Resource
>Identifiers as defined by [XML Namespaces 1.1].
>
>Note: The restriction to absolute URI references is found in this abstract
>syntax. When there is a well-defined base URI, concrete syntaxes, such as
>RDF/XML, may permit relative URIs as a shorthand for such absolute URI
>references.
>]]
>
>or text based on
>http://www.w3.org/TR/xml-names11/#IRIs
>
>[[
>Work is currently in progress to produce an RFC defining Internationalized
>Resource Identifiers (IRIs). Since this work is not yet complete, in this
>section we give a syntactic definition of IRIs for the purposes of this
>specification. We expect to issue an erratum replacing this section with a
>reference to the RFC when it is published. Users defining namespaces are
>advised to restrict namespace names to URIs until software supporting IRIs
>is in common use.
>
>For a more general definition and discussion of IRIs see [IRI draft] (work
>in progress).
>
>URI references are restricted to a subset of the ASCII characters; IRI
>references allow some of the disallowed ASCII characters as well as most
>Unicode characters from #xA0 onwards.
>
>[Definition: The additional characters allowed in IRIs are: ]
>
>+ space #x20
>
>+ the delimiters < #x3C, > #x3E and " #x22
>
>+ the unwise characters { #x7B, } #x7D, | #x7C, \ #x5C, ^ #x5E and ` #x60
>
>+ the Unicode plane 0 characters #xA0 - #xD7FF, #xF900-#xFDCF, #xFDF0-#xFFEF
>
>+ the Unicode plane 1-14 characters #x10000-#x1FFFD ... #xE000-#xEFFD
>
>[Definition: An IRI reference is a string that can be converted to a URI
>reference by escaping all additional characters as follows: ]
>
>1. Each additional character is converted to UTF-8 [Unicode 3.2] as one or
>more bytes.
>
>2. The resulting bytes are escaped with the URI escaping mechanism (that is,
>converted to %HH, where HH is the hexadecimal notation of the byte value).
>
>The original character is replaced by the resulting character sequence.
>
>
>]]
>
>Noting that RDF Core WG has declined a comment suggesting using the term IRI
>thoughout, so that the  definition would remain a definition of "RDF URI
>references".
>
>A specific question is ctrl characters - should they be allowed or not?
>
>Jeremy

Received on Friday, 12 September 2003 08:31:16 UTC