- From: Jeremy Carroll <jjc@hpl.hp.com>
- Date: Mon, 16 Apr 2007 12:48:59 +0100
- To: Michael Kifer <kifer@cs.sunysb.edu>
- CC: Dave Reynolds <der@hplb.hpl.hp.com>, Christian de Sainte Marie <csma@ilog.fr>, RIF WG <public-rif-wg@w3.org>
Yes: IRIs are a superset of URIs.
Supporting text below.
On the question of character sets the difference is as follows:
[[
A Uniform Resource Identifier (URI) is a compact sequence of
characters
]] [1] and
[[
A URI is a sequence of characters from a
very limited set: the letters of the basic Latin alphabet, digits,
and a few special characters.
]] [1]
versus
[[
An IRI is a sequence of characters from the
Universal Character Set (Unicode/ISO 10646).
]] [2]
i.e. both are simply a sequence of characters (i.e. abstract letters)
the definition of 'character' is given in BCP 19
[[
A member of a set of elements used for the organization, control, or
representation of data.
]] [3]
The set of letters used for URIs is a subset of that used for IRIs (and
a small subset!)
Neither specification (RFC 3986 URIs, or RFC 3987 IRIs) requires any
specific encoding of such characters. As is, any sequence of characters
from the URI set, when encoded in US-ASCII come to a sequence of bytes.
When the same sequence is encoded as UTF-8 it comes to the same sequence
of bytes. So even at the binary level, the typical use of both
specifications is compatible.
On the more general question of the relationship between the two:
Supporting text:
=================
1.1
[[
This document defines a new protocol element called Internationalized
Resource Identifier (IRI) by extending the syntax of URIs to a much
wider repertoire of characters.
]] [2]
[[
2.1. Summary of IRI Syntax
IRIs are defined similarly to URIs in [RFC 3986], but the class of
unreserved characters is extended by adding the characters of the UCS
(Universal Character Set, [ISO10646]) beyond U+007F, subject to the
limitations given in the syntax rules below and in section 6.1.
Otherwise, the syntax and use of components and reserved characters
is the same as that in [RFC 3986].
]] [2]
A detailed study of the rules in section 2.2 shows that this goal is
achieved, and the "limitations" do not contradict the fact that all URIs
are IRIs.
Jeremy
[1]
http://rfc.net/rfc3986.txt
[2]
http://rfc.net/rfc3987.txt
[3]
http://rfc.net/bcp19.html
Note: I understand that the chairs are minded to not yet table this
issue for discussion. If it is contentious then that is understandable.
I expect Dave will prod me again when they do. I strongly support the
use of IRIs.
Michael Kifer wrote:
>> 1. They are a superset of URIs and specifying the superset seems like
>> the safe default course. If someone especially wanted a dialect with
>> syntactic restriction to URIs then they could add that restriction in
>> the dialect.
>
> Can somebody give a synopsis of URI vs. IRI?
> On the surface, it seems that IRIs are a superset, but
> in the last telecon I asked if this is true and somebody (forgot who) said
> that they aren't because IRIs use unicode and uris ascii.
>
> In any case, I made some small changes along the lines of what was
> discussed, which states that rif:uri can be a uri or a iri. Also, I
> proposed to the chairs (I think somebody also mentioned this at the
> telecon) to call this thing rif:resource. The issue whether it will be a
> uri or an iri can be decided later. If uris are a subset of iris then
> deciding either way for now (provided that we call it rif:resource) will be
> acceptable and can be changed later. If one is not a subset of the other
> then still the decision can be changed later without major consequences.
>
>
> --michael
>
--
Hewlett-Packard Limited
registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England
Received on Monday, 16 April 2007 11:49:46 UTC