Re: Confusing use of "URI" to refer to IRIs, and IRI handling in the DOM from Henri Sivonen on 2008-06-29 (public-html@w3.org from June 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sun, 29 Jun 2008 13:49:50 +0300
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Justin James <j_james@mindspring.com>, "'Smylers'" <Smylers@stripey.com>, "'HTML WG'" <public-html@w3.org>
Message-Id: <0FD1D1CF-F82D-4939-B686-7C117882FD16@iki.fi>
On Jun 29, 2008, at 12:03, Julian Reschke wrote:

> Justin James wrote:
>> I posit that this use case is irrelevantly small; it only seems to  
>> apply to
>> people attempting to write applications that implement a particular  
>> spec, or
>> maybe people writing an "URIBuilder" type library component or  
>> something.
>
> It affects anybody who consumes HTML. The fact that HTML5-URLs are  
> something different means that you can't use out of the box URI/IRI  
> libraries and reminding readers of this spec by *not* using the term  
> URL would be helpful.

That's missing the point. The point is that URI/IRI specs don't give  
full reality-based Web-compatible details, so if you use an out-of-the- 
box pure URI/IRI library, you software isn't compatible with existing  
Web content.

Also, comprehensive libraries don't just implement the RFCs and be  
done. In Validator.nu, I use the most comprehensive IRI library for  
Java that I could find: the Jena IRI library. The Jena IRI library  
already acknowledges the existence of a multitude of URLish specs: It  
already supports conformance modes for six (6!: IRI, RDF, URI, XLink,  
XML Schema and XML System ID) specs! Unfortunately, none of those  
specs is fully Web-compatible. I'd like to see a seventh, Web- 
compatible mode implementing Web URLs in a future version.

>> To "real world" people, this is Yet Another Spec That Shall Be  
>> Ignored. By
>> trying to find some way to have all of these slightly different  
>> items play
>> nicely with each other, we're dancing around the elephant in the  
>> room (I
>> know, Managerial Speak) which is that there should only be one *RI/ 
>> L spec.
>> PERIOD.

Indeed, there should be one reality-based Web-compatible spec with  
full error recovery details for Web addresses aka. URLs.

>> So let's stop this silly dance, get with the *RI/L group, and tell  
>> them,
>> "this is broken, please provide us with 1 unified spec that makes  
>> sense."
>> But for us to keep trying to Band-Aid the broken *RI/L situation  
>> within the
>> HTML spec itself is pretty pointless. *RI/L is meta to HTML, and  
>> not within
>> our purview.

The *RI/L group seems to be unwilling to make their specs  
comprehensive in a way that is compatible with existing content, and  
this stuff needs to be specced *somewhere*, so for the time being it's  
in the HTML5 spec. It would indeed better to spin URLs to a self- 
contained spec, but we don't have enough *competent* and *willing*  
editors to go around.

What to call these addresses is a total bikeshed, but I think they  
should be called URLs, because that's the name people use for the kind  
of addresses that work in browsers (i.e. on the Web). Where  
disambiguation is *actually* needed, the two kinds of URLs can be  
referred to as Web URLs and IETF URLs.

> The URI/IRI specs aren't broken.

They don't define error handling in such a way that implementing  
software to spec results in software that works with existing content.  
In my opinion, that counts as broken.

> Lots of software implements URI/IRI processing, and browsers are  
> only one part of it.

As far as Web content goes, non-browser software that is meant to  
dereference Web addresses needs to do it in a browser-compatible way  
in order to be compatible with the Web. As a developer of such  
software, I want a reality-based description of what my software needs  
to do to be compatible with Web content. So far, I have written  
software to the IRI spec and I'm rather unhappy to find that I've  
written software to a polite fiction of what people wish the Web to be  
like. (I *wish* it were UTF-8-only, too!)

> You simply can't break all the other software by making incompatible  
> changes to these specs.

The software is already broken from the point of view of its users if  
it isn't compatible with existing Web content.

> Browsers do not treat URLs as specified, so the best thing is to  
> write down what they do, and try to discourage the incompatible  
> processing.

I think the best thing to do is:
  1) Specify in detail what needs to be done in order to dereference  
addresses in existing content.
  2) Implement what needs to be done in multiple programming languages  
and give away libraries under an extremely liberal license so that no  
one has an excuse to avoid the libraries for licensing reasons.
  3) Tell authors to encode their pages in UTF-8 (which they won't all  
do citing excuses such as imagined or measured but trivial when  
gzipped byte count inefficiencies).

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Sunday, 29 June 2008 10:50:33 UTC