Re: META character set specifications

Oren Ben-Kiki wrote:

> There is some unclarity and what seems to be a real problem with the
> mechanism described in Section 5.2.2 of the current HTML 4.0
> specification, with regard to using the META tag to specify a character
> set. I couldn't resolve the following from the text as it currently
> stands:

[lots of questions]

I think that most of your questions are answered by the following two facts:

1.  An HTML document can use only one charset.

2.  ASCII is a subset of almost all charsets.  The only exceptions I can 
    think of just now are EBCDIC and UTF-16.

As you have interpreted the HTML specification differently, we need to 
review our wording.

I think that the only question this leaves unanswered is how does one handle 
ISO 10646/Unicode encoded using UTF-16.  I don't think EBCDIC can be handled 
by information associated directly with the document (as opposed to 
information supplied separately, eg as part of an HTTP header).

Section 5.2.1 includes the following text:

   Notes on specific encodings 

   When HTML text is transmitted in UTF-16 (charset=UTF-16), text data 
   should be transmitted in network byte order ("big-endian", high-order 
   byte first) in accordance with [ISO10646], Section 6.3 and [UNICODE], 
   clause C3, page 3-1. 

   Furthermore, to maximize chances of proper interpretation, it is 
   recommended that documents transmitted as UTF-16 always begin with a 
   ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF, also called 
   Byte Order Mark (BOM)) which, when byte-reversed, becomes hexadecimal 
   FFFE, a character guaranteed never to be assigned. Thus, a user-agent 
   receiving a hexadecimal FFFE as the first bytes of a text would know that 
   bytes have to be reversed for the remainder of the text. 

Hence, the algorithm goes like this:

-  Does the document start with a BOM?

-  If yes, the charset is UTF-16.

-  If no, look for an 

      <META http-equiv="Content-Type" content="text/html; charset=...">

   element (encoded using ASCII).

-  If you find it, obey it.

If the above does not deal with your questions, please reply.

----------------------------------------------------------------------------
  Misha Wolf            Email: misha.wolf@reuters.com      85 Fleet Street
  Standards Manager     Voice: +44 171 542 6722            London EC4P 4AJ
  Reuters Limited       Fax  : +44 171 542 8314            UK
----------------------------------------------------------------------------
12th International Unicode Conference, 8-10 Apr 1998, Tokyo, www.unicode.org
   7th World Wide Web Conference, 14-18 Apr 1998, Brisbane, www7.conf.au



------------------------------------------------------------------------
Any views expressed in this message are those of the individual  sender,
except  where  the  sender  specifically  states them to be the views of
Reuters Ltd.

Received on Wednesday, 11 February 1998 14:09:33 UTC