Re: [Admin] Agenda for RIF telecon 17 April from Jeremy Carroll on 2007-04-16 (public-rif-wg@w3.org from April 2007)

From: Jeremy Carroll <jjc@hpl.hp.com>
Date: Mon, 16 Apr 2007 12:48:59 +0100
To: Michael Kifer <kifer@cs.sunysb.edu>
CC: Dave Reynolds <der@hplb.hpl.hp.com>, Christian de Sainte Marie <csma@ilog.fr>, RIF WG <public-rif-wg@w3.org>
Message-ID: <462362AB.7080905@hpl.hp.com>

Yes: IRIs are a superset of URIs.

Supporting text below.

On the question of character sets the difference is as follows:

[[
  A Uniform Resource Identifier (URI) is a compact sequence of
    characters
]] [1] and
[[
  A URI is a sequence of characters from a
    very limited set: the letters of the basic Latin alphabet, digits,
    and a few special characters.
]] [1]

versus

[[
An IRI is a sequence of characters from the
    Universal Character Set (Unicode/ISO 10646).
]] [2]

i.e. both are simply a sequence of characters (i.e. abstract letters) 
the definition of 'character' is given in BCP 19
[[
    A member of a set of elements used for the organization, control, or
    representation of data.
]] [3]

The set of letters used for URIs is a subset of that used for IRIs (and 
a small subset!)

Neither specification (RFC 3986 URIs, or RFC 3987 IRIs) requires any 
specific encoding of such characters. As is, any sequence of characters 
from the URI set, when encoded in US-ASCII come to a sequence of bytes. 
When the same sequence is encoded as UTF-8 it comes to the same sequence 
of bytes. So even at the binary level, the typical use of both 
specifications is compatible.

On the more general question of the relationship between the two:


Supporting text:
=================

1.1
[[
   This document defines a new protocol element called Internationalized
    Resource Identifier (IRI) by extending the syntax of URIs to a much
    wider repertoire of characters.
]] [2]

[[
2.1.  Summary of IRI Syntax

    IRIs are defined similarly to URIs in [RFC 3986], but the class of
    unreserved characters is extended by adding the characters of the UCS
    (Universal Character Set, [ISO10646]) beyond U+007F, subject to the
    limitations given in the syntax rules below and in section 6.1.

    Otherwise, the syntax and use of components and reserved characters
    is the same as that in [RFC 3986].
]] [2]


A detailed study of the rules in section 2.2 shows that this goal is 
achieved, and the "limitations" do not contradict the fact that all URIs 
are IRIs.

Jeremy

[1]
http://rfc.net/rfc3986.txt
[2]
http://rfc.net/rfc3987.txt
[3]
http://rfc.net/bcp19.html

Note: I understand that the chairs are minded to not yet table this 
issue for discussion. If it is contentious then that is understandable. 
I expect Dave will prod me again when they do. I strongly support the 
use of IRIs.



Michael Kifer wrote:
>> 1. They are a superset of URIs and specifying the superset seems like 
>> the safe default course. If someone especially wanted a dialect with 
>> syntactic restriction to URIs then they could add that restriction in 
>> the dialect.
> 
> Can somebody give a synopsis of URI vs. IRI?
> On the surface, it seems that IRIs are a superset, but
> in the last telecon I asked if this is true and somebody (forgot who) said
> that they aren't because IRIs use unicode and uris ascii.
> 
> In any case, I made some small changes along the lines of what was
> discussed, which states that rif:uri can be a uri or a iri.  Also, I
> proposed to the chairs (I think somebody also mentioned this at the
> telecon) to call this thing rif:resource. The issue whether it will be a
> uri or an iri can be decided later. If uris are a subset of iris then
> deciding either way for now (provided that we call it rif:resource) will be
> acceptable and can be changed later. If one is not a subset of the other
> then still the decision can be changed later without major consequences.
> 
> 
> 	--michael  
> 

-- 
Hewlett-Packard Limited
registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England

Received on Monday, 16 April 2007 11:49:46 UTC