- From: Francois Yergeau <yergeau@alis.ca>
- Date: Sun, 28 Jan 1996 15:32:27 -0500
- To: html-wg@oclc.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, maits@dkuug.dk
on my business card (there's a c-cedilla in there, for those who lost the 8th bit), how do you know what charset my business card uses? What bit pattern do you send my server to fetch that document? There are not that many ways out of this problem: either URLs contain an indication of their charset, or the whole world agrees on a single implicit charset. Today, the world *has* agreed on the charset for octets values < 127, and this must be taken into account for a wider solution. Personally, I like the implicit UTF-8 idea: any non-ASCII character must be sent to a server as its UTF-8 encoding, either URL-encoded (the %XX hack) or not. A server receiving a non-ASCII octet (or its URL-encoding) must interpret it as part of the UTF-8 encoding of a character. No ambiguity, no need to tag the charset, and good compatibility with today's situation (ASCII only, in practice). >I think that just using some kind of UCS would make it hard >when we have an environment where the html is in 8859-1 - that >would be mixing apples and oranges and thus very hard to maintain. For one thing, there is plenty of HTML *not* in 8859-1. Apart from that, most software (including HTML parsers) recognize only ASCII as syntax-significant, either passing 8-bit characters untouched and uninterpreted or chopping off the 8th bit, damaging 8859-1 just as surely as UTF-8. -- François Yergeau <yergeau@alis.com> Alis Technologies inc., Montréal Tél: +1 (514) 747-2547 Fax: +1 (514) 747-2561
Received on Sunday, 28 January 1996 12:37:30 UTC