checklink: convert URIs to UTF-8 from Bjoern Hoehrmann on 2001-07-25 (www-validator@w3.org from July 2001)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Wed, 25 Jul 2001 19:03:55 +0200
To: www-validator@w3.org
Message-ID: <l7utlt867cv8l5so1t91oin7qcdtv9eg0s@4ax.com>

Hi,

   Several Technical Reports define how non-ASCII characters in URIs
should be handled, this is convert the non-ASCII characters to UTF-8 and
apply the URI encoding to it. Additionally HTML 4 suggests:

[...]
  Note. Some older user agents trivially process URIs in HTML using the
  bytes of the character encoding in which the document was received.
  Some older HTML documents rely on this practice and break when
  transcoded. User agents that want to handle these older documents
  should, on receiving a URI containing characters outside the legal
  set, first use the conversion based on UTF-8. Only if the resulting
  URI does not resolve should they try constructing a URI based on the
  bytes of the character encoding in which the document was received.
[...]

While the Validator already does this [1] (if the charset parameter with
charset=utf-8 will be added to the HTTP header), I can't see this issue
addressed in the checklink script. I suggest to implement what HTML 4
recommends. I'd provide a patch, but I'm currently not that familiar
with it...

Both, the checklink script and the validator should warn the user if
they encounter improperly escaped URIs.

[1] I strongly recommend that the URI package 1.15 is installed on the
    production server. It conforms to RFC 2732 (see my request and
    discussion in mid may on the libwww@perl.org mailing list) and
    current reports require compliance for that.
-- 
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Received on Wednesday, 25 July 2001 13:04:38 UTC