Re: IRIs in href from Frank Ellermann on 2007-11-06 (www-validator@w3.org from November 2007)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Tue, 6 Nov 2007 13:52:42 +0100
To: www-validator@w3.org
Message-ID: <fgpp01$j9d$1@ger.gmane.org>
Martin Duerst wrote:

> they'd only clarify whatthey meant when they recommended,
> in http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1,
> to convert URIs with non-ASCII characters to UTF-8 and
> then to use percent-encoding

The given example states that it's <strong>illegal</strong>,
after that it explains a best guess implementation clearly
written before you published RFC 3987.  It doesn't address
IDNs, IDNs didn't exist 1999.  When browsers try to guess
what broken URIs mean they could run into the recent flood
of "XP with IE7" security issues. 

>> for incompatible modifications we need new document types.
 
> The new document type would not at all differ in functionality
> from the old one. The only changes might be comments and
> the names of parameter entities, but as with programs, that
> doesn't change the functionality at all.

It's a "formally valid" experiment with unencoded IRIs in links,
that can be (legally) relevant for accesibility.  It might also
help for an RFC 3987 implementation and interoperability report,
"it's illegal but it works" wouldn't be convincing (of course
there would be still atom and xmpp if all else fails).  Maybe
somebody creates a corresponding schema allowing to check IRI
syntax.
 
> %D0%B8%D1%81%D0%BF%D1%8B%D1%82%D0%B0%D0%BD%D0%B8%D0%B5
> would be just garbage for them.

The W3C validator apparently hates this in a system identifier:
http://hmdmhdfmhdjmzdtjmzdtzktdkztdjz.googlepages.com/IDN-XML-test.htm
(sorry, I can't read your unencoded ISO-2022-JP examples at the
 moment)

> Formally valid means valid according to the DTD, I guess.

No, I meant the prose 2396 specification of URI, not CDATA.

> Temporary accessibility issues are issues of the kind "The
> current screen readers/audio browsers/... only support foo,
> so in order to be accessible, use foo, not bar". Once the
> technology has caught up (and accessibility technology 
> improves in the same way other technology improves), such
> a requirement may no longer apply.

"Temporary" can be a rather long time, RFC 2277 talks about
50 years wrt UTF-8.  Worldwide upgrades take some time.  The
"real" IDN TLD test started less than four weeks ago, and on
another list you argued that not much will happen before real
IDN TLDs are introduced.

> For some tests, please see
> http://www.sw.it.aoyama.ac.jp/2005/iritest/,

Thanks, Firefox 2.0.0.9 fails already in the Latin-1 "Bücher"
test, of course it works for UTF-8.  

What I had in mind would be minimally harder, using "Bücher"
in an unencoded IDN on a Web page using a legacy charset.
Obviously I can forget this for now, if it doesn't work in 
an <ipath> it also won't work in an <ihost>.

> URIs/IRIs are supposed to be very flexible.

Actually I'm lost with LEIRIs, HRRIs, options allowing to 
use unencoded ASCII characters in IRIs not permitted in
URIs, and the recent discussion about allowing unencoded
square brackets outside of <IP-literal>.  

With URIs it's clear, if they're valid they must match the
generic STD 66 syntax.  No unencoded spaces etc.

> If somebody came along tomorrow with a very great idea
> for an extension to the URI syntax, and the community
> agreed with that extension, even if it wouldn't fit the
> current syntax definition, then this would lead to an 
> update of the URI spec.

There's no "updates RFC 3986" in the URI template draft.

> If you want to create some software that tries to spot
> potential mistakes in an HTML document, I'd guess you'd
> surely flag something like <a href='htpp://www.w3.org'...

Flag and warn yes, but it's no STD 66 syntax error.  The
tool could restrict schemes to registered schemes and allow
to configure additional unregistered schemes.

> example, consider the following:
>   <img src='http://example.org/top.html'>
> Again, this clearly looks like a mistake

The "top.html" is just a name, admittedly a bad name if
the resource is something that can be displayed as image.
A legal URI.  OTOH "bücher.html" isn't a legal URI.

> Again, <img src='mailto:abc@example.com'> looks like
> nonsense, but again, it may make sense in the future.

"Syntactically valid" isn't the same as "makes sense",
I think we don't disagree about this.  Where we might
disagree is about "syntactically invalid".  Browsers
are forced to make sense out of (some kinds of) garbage,
but a syntax check is supposed to report syntax errors.

> if it took you several months to figure out why you
> need octet 128 rather than NCR &#128;, then at least
> at that point in time, you didn't really know much
> about the fundamentals of Web internationalization.

In 2001 I knew _nothing_ about it, I was armed with a
Netscape 2.02 not supporting UTF-8 and treating &#128;
as Euro, an O'Reilly book with "XHTML" in its title
published 2000, the W3C validator for online syntax
checks, and a box with local codepages "850" + 437.

> Well, this is a circular argument. "Let's annoy users
> now so that we don't need to annoy them later." doesn't
> make sense if "Let's not annoy them at all." is the
> best option anyway.

"Let's not annoy them at all" won't fly if the STD 66
syntax is checked later.  It was good when the validator
finally (2001-09-13) informed me that &#128; is crap, it
would have been better if it had done that a few months
earlier.  

 Frank
Received on Tuesday, 6 November 2007 13:08:50 UTC