Re: A new RFC for Web Addresses/Hypertext References: Background wrt LEIRIs

Hello Henry,

Many thanks for this very good overview. I'm cross-posting this to the 
IRI list ( because Lisa at one point proposed to have 
this kind of discussion there, as well as to the Apps Discuss list 
( to reach out to the relevant people in the IETF. 
I have also copied Lisa and Alex directly. I guess this is overall a bit 
too agressive of a cross-posting (but please tell me if you think I have 
missed somebody important). However, I hope we can converge quickly on 
where to move forward with what bits of the discussion/work.

On 2009/04/29 0:31, Henry S. Thompson wrote:
> Hash: SHA1
> There are currently five documents in this space (that I am aware of):
>   [URI] The current RFC governing URIs:
>   [IRI] The current RFC governing IRIs:
>   [IRI-BIS] The most recent draft of a planned update for the RFC
>             governing IRIs:
>   [LEIRI] A W3C Note defining Legacy Extended IRIs (extracted from [IRI-BIS]):
>   [WEBADDR] A preliminary draft of a possible RFC for Web Addresses
>             (extracted from HTML5 [1]):
> [not yet in RFC format,
>                                                converted version expected
>                                                RSN]
> On the TAG telcon of 2009-04-17, there was some sense that this is too
> many specs in the same space. . .


> In order to contextualize and perhaps stimulate a possible effort to
> seek a rationalization here, here's _my_ understanding of how we got
> here.
> [URI] is the mature stage of a spec. which has been revised a number
>    of times.  It carries a certain amount of historical baggage with
>    it, particularly its restriction to 7-bit characters, but that also
>    ensures wide interoperability and preserves access to legacy
>    applications.
> [IRI] was intended to address the needs of the expanding Internet
>    and Web community, allowing most of Unicode into most parts of IRIs.
>    Rather than require upgrades in a wide range of applications and
>    uses, it did not set up IRIs as a _replacement_ for URIs across the
>    board, but as a _complement_ to URIs.  It therefore included an
>    explicit trancoding algorithm, for converting IRIs to URIs.
> [IRI-BIS] was initiated by the editors of [IRI] to correct several
>    errata to [IRI] and to address the exclusion from [IRI] of certain
>    characters and character ranges.

Yes. In that sense, it is not a separate document from [IRI], just an 
update. That's very usual for IETF work, a bit less for W3C work if one 
looks only at Recommendations.

So the effective number of documents is down from five to four.

> [LEIRI] had its origins in the XML family of W3C specifications.
>    The XML specification itself [2], as well as a number of other
>    XML-related specifications (including XML Base, XML Schema, XPointer
>    Framework, XML Signature) all involve appeal to a process for
>    converting arbitrary strings which are intended to identify web
>    resources into URIs.  They all incorporate more-or-less identical
>    prose excerpted from the XLink specification [3] which specifies how
>    this is to be done.
>    The XML Core WG has long been unhappy with this state of affairs,
>    and the impending release of new editions of several of these specs
>    encouraged the WG to try to establish a single normative reference
>    for the concept of a string for identifying web resources in XML
>    documents and a process for converting them to URIs, which
>    acknowledged and built on the IRI specification.
>    After drafting a document to serve this purpose, discussion with the
>    editors of [IRI-BIS] convinced all concerned that since a new
>    version of the IRI spec was already in progress, the best thing to
>    do, to respect precedent and to avoid unnecessary proliferation, was
>    to include the relevant definitions in [IRI-BIS], and in fact that
>    has been done [4].  Once it became apparent, however, that the
>    progress of [IRI-BIS] to Draft Standard status was likely to be
>    considerably delayed for reasons outside its editors' control, the
>    Core WG, with the agreement and co-operation of the editors of
>    [IRI-BIS], published [LEIRI] as a Working Group Note, so that the
>    re-issue of new editions of the relevant XML-familty specs could go
>    ahead.  The intention is to issue a revision of [LEIRI] replacing
>    its contents with a reference to [IRI-BIS] as soon as [IRI-BIS]
>    becomes a Draft Standard.

Yes. If that works out as described above (which I very much hope it 
will), then [LEIRI] will silently disappear.

This would reduce the number of documents from four down to three.

For the IETF side, I'd like to give a bit more background on LEIRIs.

The main thing it does is to allow ASCII characters not allowed in URIs 
(and therefore not allowed in IRIs) back into what's otherwise 
essentially IRIs.

The reason for why a number of XML specs (as listed above) differ from 
the IRI spec is that these specs adopted a very early and simple 
definition of IRIs (before the name IRI every existed). Later, when the 
IRI spec got tightened, these specs didn't want to follow this 
tightening because, while XML is very strict in what it accepts and what 
not, it doesn't want to retract promises once given. Another reason is 
that in XML, context and escaping conventions allow to include 
essentially any character, whereas for URIs and IRIs in general, this is 
not the case.

> [WEBADDR] had in some ways a similar origin to [LEIRI], starting out
>    as a section of the HTML5 spec which addressed the process by which
>    existing browsers process strings to produce URIs which can be
>    dereferenced.

Yes indeed. It changes a space to %20, the same as for LEIRIs.

>    It differs from [LEIRI] in the exact set of
>    characters which it escapes,

Has anybody done an analysis?

It seems to provide more detail about '[' and ']', escaping them 
depending on context. It could be that that's also necessary for LEIRIs.

But "any occurrences of percent-encoding in the Web address will be 
double-encoded at this step." looks extremely scary.

>    and in the special handling it mandates
>    for the encoding of characters in the 'query' part of a URI.

See more about that in my reply to Dan.

> I am sure that the above summaries can be improved.  In particular it
> would be helpful have clear statements from their respective
> authors/owners as to what the _requirements_ for the three new
> documents ([IRI-BIS], [LEIRI] and [WEBADDR]) are.  Only after we have
> those would it make sense to turn to the question of whether we can
> merge some or all of them.

Okay, I'm usually not good at requirements, but for [IRI-BIS], they 
might look about as follows:

- Be usable in general, not just in a specific context
  (such as XML, HTML,...)
- Move to Draft Standard (or, if that turns out to not be possible,
   make sure we can do so on the next round.
- Try to avoid fragmentation (terms such as "Human Readable Resource
   Identifiers" or "Web Addresses" or so can lead to quite a bit of
   confusion when the main goal is to deal with occasional legacy data
   that nobody should have produced anyway)
- Include the (currently ongoing) update of IDNA (in particular affects
   section 4, Bidi, and references); that's what's currently holding
   back progress.

Are these the things that you looked for when you said 'Requirements'?
Or something else?

Regards,    Martin.

> ht
> [1]
> [2]
> [3]
> [4]
> - --
>         Henry S. Thompson, School of Informatics, University of Edinburgh
>                           Half-time member of W3C Team
>        10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
>                  Fax: (44) 131 651-1426, e-mail:
>                         URL:
> [mail really from me _always_ has this .sig -- mail without it is forged spam]
> Version: GnuPG v1.2.6 (GNU/Linux)
> iD8DBQFJ9yFQkjnJixAXWBoRAq5tAJwMb/0jpU6XwLbYNqyt2s4uNwTcQACdHx4B
> F/J04oFFOeDHZLTT9Y0qkT0=
> =f6+L

#-# Martin J. Dürst, Professor, Aoyama Gakuin University

Received on Friday, 1 May 2009 07:20:18 UTC