checklink: convert URIs to UTF-8

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Wed, Jul 25 2001

  • Next message: Karl Dubost: "xml-stylesheet PI"

    From: Bjoern Hoehrmann <derhoermi@gmx.net>
    To: www-validator@w3.org
    Date: Wed, 25 Jul 2001 19:03:55 +0200
    Message-ID: <l7utlt867cv8l5so1t91oin7qcdtv9eg0s@4ax.com>
    Subject: checklink: convert URIs to UTF-8
    
    Hi,
    
       Several Technical Reports define how non-ASCII characters in URIs
    should be handled, this is convert the non-ASCII characters to UTF-8 and
    apply the URI encoding to it. Additionally HTML 4 suggests:
    
    [...]
      Note. Some older user agents trivially process URIs in HTML using the
      bytes of the character encoding in which the document was received.
      Some older HTML documents rely on this practice and break when
      transcoded. User agents that want to handle these older documents
      should, on receiving a URI containing characters outside the legal
      set, first use the conversion based on UTF-8. Only if the resulting
      URI does not resolve should they try constructing a URI based on the
      bytes of the character encoding in which the document was received.
    [...]
    
    While the Validator already does this [1] (if the charset parameter with
    charset=utf-8 will be added to the HTTP header), I can't see this issue
    addressed in the checklink script. I suggest to implement what HTML 4
    recommends. I'd provide a patch, but I'm currently not that familiar
    with it...
    
    Both, the checklink script and the validator should warn the user if
    they encounter improperly escaped URIs.
    
    [1] I strongly recommend that the URI package 1.15 is installed on the
        production server. It conforms to RFC 2732 (see my request and
        discussion in mid may on the libwww@perl.org mailing list) and
        current reports require compliance for that.
    -- 
    Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
    am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
    25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/