- From: Misha Wolf <misha.wolf@reuters.com>
- Date: Wed, 11 Feb 1998 19:08:39 +0000 (GMT)
- To: oren <oren@capella.co.il>
- Cc: www html editor <www-html-editor@w3.org>
Oren Ben-Kiki wrote: > There is some unclarity and what seems to be a real problem with the > mechanism described in Section 5.2.2 of the current HTML 4.0 > specification, with regard to using the META tag to specify a character > set. I couldn't resolve the following from the text as it currently > stands: [lots of questions] I think that most of your questions are answered by the following two facts: 1. An HTML document can use only one charset. 2. ASCII is a subset of almost all charsets. The only exceptions I can think of just now are EBCDIC and UTF-16. As you have interpreted the HTML specification differently, we need to review our wording. I think that the only question this leaves unanswered is how does one handle ISO 10646/Unicode encoded using UTF-16. I don't think EBCDIC can be handled by information associated directly with the document (as opposed to information supplied separately, eg as part of an HTTP header). Section 5.2.1 includes the following text: Notes on specific encodings When HTML text is transmitted in UTF-16 (charset=UTF-16), text data should be transmitted in network byte order ("big-endian", high-order byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], clause C3, page 3-1. Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UTF-16 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal FFFE, a character guaranteed never to be assigned. Thus, a user-agent receiving a hexadecimal FFFE as the first bytes of a text would know that bytes have to be reversed for the remainder of the text. Hence, the algorithm goes like this: - Does the document start with a BOM? - If yes, the charset is UTF-16. - If no, look for an <META http-equiv="Content-Type" content="text/html; charset=..."> element (encoded using ASCII). - If you find it, obey it. If the above does not deal with your questions, please reply. ---------------------------------------------------------------------------- Misha Wolf Email: misha.wolf@reuters.com 85 Fleet Street Standards Manager Voice: +44 171 542 6722 London EC4P 4AJ Reuters Limited Fax : +44 171 542 8314 UK ---------------------------------------------------------------------------- 12th International Unicode Conference, 8-10 Apr 1998, Tokyo, www.unicode.org 7th World Wide Web Conference, 14-18 Apr 1998, Brisbane, www7.conf.au ------------------------------------------------------------------------ Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
Received on Wednesday, 11 February 1998 14:09:33 UTC