Re: IRIs in href from Martin Duerst on 2007-11-06 (www-validator@w3.org from November 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Tue, 06 Nov 2007 18:33:21 +0900
To: www-validator@w3.org
Message-Id: <6.0.0.20.2.20071106173641.05c1ea90@localhost>
Frank Ellermann wrote:

>olivier Thereaux wrote:
>
>> Now, if today the HTML 4.01 and XHTML 1.0 specs and above were  
>> updated to say "IRIs" instead of "URIs", what would you do?
>
>Maybe ditch the W3C and post the reasons in an Internet Draft.
>I'd certainly consider it as unethical.

I can understand your feelings, but they'd only clarify what
they meant when they recommended, in
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1,
to convert URIs with non-ASCII characters to UTF-8 and
then to use percent-encoding, or what they meant when
they used %URI; but declared it as being CDATA in the DTD.


>RFC 3987 does not "update" 3986.

Very correct. It was never intended as that, it was intended
to serve as a stable specification for all those specs
(including HTML4) that wanted to use the concept but had
to use circumscriptive language.


>The spec.s should be updated
>with s/2396/3986/g, s/3066/4646/g, and similar clerical tasks,
>e.g. explaining why xml:lang is forced to be still an NMTOKEN
>wrt these document types.
>
>But for incompatible modifications we need new document types.

The new document type would not at all differ in functionality
from the old one. The only changes might be comments and
the names of parameter entities, but as with programs, that
doesn't change the functionality at all.


>Not worldwide "upgrade your browser" campaigns, some users 
>can't, and besides it's completely unnecessary, all IRIs by
>definition have an equivalent URI working with "any browser".

Yes, but people who actually can read and understand
"испытание" better than "testing", испытание
may be very helpful, whereas
%D0%B8%D1%81%D0%BF%D1%8B%D1%82%D0%B0%D0%BD%D0%B8%D0%B5
would be just garbage for them.


>> Saying that IRIs should not be used because they break in
>> legacy software, is an argument I have sympathy for, but
>> have trouble accepting.
>
>I'm not suprised if folks active in the W3C don't care much
>about "backwards compatibility".  But admittedly I was very
>suprised when you introduced "let's not care about formally
>valid" as new concept.

Formally valid means valid according to the DTD, I guess.
In this respect, IRIs have always been valid.

>A user armed with an old text mode
>browser could take out the ICANN IDN test, AFAIK "formally
>invalid" is a FAIL in any accesibility test, isn't it ?

I'm not sure what you are after here, but if you want to claim
that IRIs somehow are anti-accessibility, then I think you
should consider the following two points:
a) There are temporary accessibility issues and long-term
   accessibility issues. Temporary accessibility issues are
   issues of the kind "The current screen readers/audio
   browsers/... only support foo, so in order to be accessible,
   use foo, not bar". Once the technology has caught up
   (and accessibility technology improves in the same way
   other technology improves), such a requirement may no
   longer apply.
b) A Russian screen reader will definitely do a better job
   with испытание than with
   %D0%B8%D1%81%D0%BF%D1%8B%D1%82%D0%B0%D0%BD%D0%B8%D0%B5.
   Other screen readers may have problems, but then,
   испытание will mostly appear in Russian
   documents.


>> This reminds me of the situation whereby, in Japan, one  
>> still can't safely use unicode in mails, because so many
>> MUAs or webmails just don't support it.
>
>Maybe they have plausible reasons why they don't need or
>don't like it.  BTW, the (formally valid) IDN test page
>I've created last week was the first XHTML page where I
>actually needed UTF-8.  Now I'm curious what browsers do
>with an IRI in a legacy charset.  RFC 3987 allows this.

For some tests, please see
http://www.sw.it.aoyama.ac.jp/2005/iritest/,
in particular the "Legacy Human" section at
http://www.sw.it.aoyama.ac.jp/2005/iritest/HTML/index.html. 


>>> Sooner or later validators will be fixed to validate
>>> URIs, what with all those "URI exploits" we've seen in
>>> the last weeks for XP after the installation of IE7.
> 
>> This is irrelevant to the discussion about IRIs. Please
>> don't use internationalization as a scapegoat for bad
>> coding.
>
>It's relevant for the discussion of bug 4916 submitted 
>by you 2007-08-07.  If that bug is fixed it might also
>detect IRIs where only URIs are allowed.

In the original mail, Olivier actually wrote:

>>>>
1) a parser to check that a given string is a proper URI/IRI
This surely already exists, hopefully as open source code, or even  
better, as a perl module. Does anyone want to investigate this?
>>>>

That then got shortened in the bug report.

But there is a rather fundamental reason why this actually
may be a bad idea: URIs/IRIs are supposed to be very
flexible. If somebody came along tomorrow with a very
great idea for an extension to the URI syntax, and the
community agreed with that extension, even if it wouldn't
fit the current syntax definition, then this would lead
to an update of the URI spec.

Let's show you the idea behind the above with a somewhat
more concrete example:
If you want to create some software that tries to spot
potential mistakes in an HTML document, I'd guess you'd
surely flag something like <a href='htpp://www.w3.org'...
But even actually reading the URI spec in detail, there's
nothing there that says it's illegal. Somebody could
register the "htpp:" scheme at any time. As another
example, consider the following:
   <img src='http://example.org/top.html'>
Again, this clearly looks like a mistake, one wouldn't
use a link to a Web page in an src attribute. But you
never know when some browsers might actually implement
something like a thumbnail view of a web page in such
a case (apart from the fact that you also don't know
that top.html is a Web page and not some image).
Again, <img src='mailto:abc@example.com'> looks like
nonsense, but again, it may make sense in the future.

The point is that URIs and IRIs are intended to be a very
general mechanism to connect resources on the Web, and
that any restriction has to be considered very carefully.


>Admittedly almost impossible for a validator based on
>DTDs, maybe you end up with a clumsy hack working only
>for a few very important document types.  

Yes, trying to restrict the syntax in a field declared
as CDATA (for attributes) or PCDATA (for elements) based
only on the name of the field, the name of a parameter
entity, or a comment found nearby would be difficult.


>>> I can still tell you the day when the W3C validator
>>> started to flag &#128; as invalid on a windows-1252
>>> page. I was working on this page, it was stunning.
> 
>> There once was a bug, and IIRC it was fixed in a few
>> hours.
>
>Two days after 911, it's good if it only took you a few
>hours.  But it took me several months to figure out why
>I need octet 128 instead of NCR &#128;.

Sorry to be a bit direct here, but if it took you several
months to figure out why you need octet 128 rather than
NCR &#128;, then at least at that point in time, you
didn't really know much about the fundamentals of Web
internationalization. If you look at the SGML declaration
and at the DTD, it's very clear that &#128; is illegal
and non-valid.


>> Now, how is that relevant to the discussion at hand?
>
>If everybody and his dog start to use IRIs in document
>types where it's not permitted, and some time later an
>improved validator informs them that this was invalid,
>the disturbed users will be annoyed.

Well, this is a circular argument. "Let's annoy users
now so that we don't need to annoy them later." doesn't
make sense if "Let's not annoy them at all." is the
best option anyway.


>> The current XHTML DTD says that DTDs are CDATA
>
>Sure, the details are specified in the prose, the DTD
>only uses an entity name %URI;  It could also use %FOO;
>or %IRI; as name.  Likewise the RFC 2396 in the DTD is 
>only a comment.  

Exactly. Validation means validation according to the DTD,
and parameter entity names and comments don't affect
this process.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 6 November 2007 09:34:54 UTC