Re: html, http, urls and internationalisation from Francois Yergeau on 1996-01-28 (ietf-http-wg@w3.org from January to March 1996)

From: Francois Yergeau <yergeau@alis.ca>
Date: Sun, 28 Jan 1996 15:32:27 -0500
To: html-wg@oclc.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, maits@dkuug.dk
Message-Id: <199601282032.PAA26457@genstar.alis.ca>

on my business card (there's a c-cedilla in there, for those who lost
the 8th bit), how do you know what charset my business card uses?
What bit pattern do you send my server to fetch that document?

There are not that many ways out of this problem: either URLs contain
an indication of their charset, or the whole world agrees on a single
implicit charset.  Today, the world *has* agreed on the charset for
octets values < 127, and this must be taken into account for a wider
solution.

Personally, I like the implicit UTF-8 idea: any non-ASCII character
must be sent to a server as its UTF-8 encoding, either URL-encoded
(the %XX hack) or not.  A server receiving a non-ASCII octet (or its
URL-encoding) must interpret it as part of the UTF-8 encoding of a
character.  No ambiguity, no need to tag the charset, and good
compatibility with today's situation (ASCII only, in practice).

>I think that just using some kind of UCS would make it hard
>when we have an environment where the html is in 8859-1 - that
>would be mixing apples and oranges and thus very hard to maintain.

For one thing, there is plenty of HTML *not* in 8859-1.  Apart from
that, most software (including HTML parsers) recognize only ASCII as
syntax-significant, either passing 8-bit characters untouched and
uninterpreted or chopping off the 8th bit, damaging 8859-1 just as
surely as UTF-8.

-- 
François Yergeau <yergeau@alis.com>
Alis Technologies inc., Montréal
Tél: +1 (514) 747-2547
Fax: +1 (514) 747-2561

Received on Sunday, 28 January 1996 12:37:30 UTC