Re: several messages about Unicode details in HTML from Ian Hickson on 2008-02-29 (public-html@w3.org from February 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 29 Feb 2008 20:47:48 +0000 (UTC)
To: Brian Smith <brian@briansmith.org>
Cc: 'HTML WG' <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0802292032560.6407@hixie.dreamhostps.com>
On Fri, 29 Feb 2008, Brian Smith wrote:
> Ian Hickson wrote:
> > >
> > > However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed to 
> > > be signatureless), do we really want to drop the BOM silently? 
> > > Shouldn't it count as a character that is in error?
> > 
> > Do the UTF-16LE and UTF-16BE specs make a leading BOM an error?
> > 
> > If yes, then we don't have to say anything, it's already an error.
> > 
> > If not, what's the advantage of complaining about the BOM in this 
> > case?
> 
> See http://unicode.org/faq/utf_bom.html#28:
> 
> "In particular, whenever a data stream is declared to be UTF-16BE, 
> UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used."

Right, so it's already an error, so we don't have to do anything.


> > > Likewise, if an encoding signature BOM has been discarded and the 
> > > first logical character of the stream is another BOM, shouldn't that 
> > > also count as a character that is in error?
> >
> > The spec says: "Given an encoding, the bytes in the input stream must 
> > be converted to Unicode characters for the tokeniser, as described by 
> > the rules for that encoding, except that leading U+FEFF BYTE ORDER 
> > MARK characters must not be stripped by the encoding layer."
> 
> That is wrong. See http://unicode.org/faq/utf_bom.html#38. Only the 
> first character in a stream may be a byte order mark. Otherwise, they 
> are to be treated as a ZWNBSP for backwards compatibility.

Oops, I didn't mean that to imply that there could be multiple BOMs. 
Fixed to be in the singular.


On Fri, 29 Feb 2008, Brian Smith wrote:
> 
> "Where the byte order is explicitly specified, such as in UTF-16BE or 
> UTF-16LE, then all U+FEFF characters-even at the very beginning of the 
> text-are to be interpreted as zero width no-break spaces."
> 
> So, an initial U+FEFF is never an error, even for the -BE and -LE 
> variants.

Oh. This is confusing. So we _do_ need to say it's an error?

In HTML, you can't ever start with a U+FEFF character if it's not a BOM 
(it's invalid -- you can't have character data before the DOCTYPE). In 
addition, the spec allows a leading U+FEFF character -- regardless of the 
encoding -- to allow streams to be converted to other encodings without 
having to worry about whether the leading BOM will change meaning or not.


> But, in -BE and -LE, it isn't a BOM, but a ZWNBSP. And, also, producers 
> of documents should never use U+FEFF anywhere in the document unless it 
> is used as a BOM, which by definition can't exist in a -BE/-LE document.

Ok so here's the status right now as I see it:

   We allow a leading BOM even if the encoding layer doesn't.
   We require that the encoding layer not strip the leading BOM.
   We strip one leading BOM if present.

That seems to take care of UTF-16*, though admittedly not in a way that 
necessarily agrees with the spirit (or letter) of those specs.


On Fri, 29 Feb 2008, Brian Smith wrote:
> > On Sat, 23 Jun 2007, istein E. Andersen wrote:
> > > >> 
> > > >>> Bytes or sequences of bytes in the original byte stream that 
> > > >>> could not be converted to Unicode characters must be converted 
> > > >>> to U+FFFD REPLACEMENT CHARACTER code points.
> > >
> > > Unicode 5.0 remains vague on this point. (E.g., definition D92 
> > > defines well-formed and ill-formed UTF-8 byte sequences, but 
> > > conformance requirement C10 only requires ill-formed sequences to be 
> > > treated as an error condition and suggests that a one-byte 
> > > ill-formed sequence may be either filtered out or replaced by a 
> > > U+FFFD replacement character.) More generally, character encoding 
> > > specifications can hardly be expected to define proper error 
> > > handling, since they are usually not terribly preoccupied with 
> > > mislabelled data.
> > 
> > They should define error handling, and are defective if they don't. 
> > However, I agree that many specs are defective. This is certainly not 
> > limited to character encoding specifications.
> 
> Unicode does define the error handling explicitly. An implementation 
> must handle an ill-formed sequence by "signaling an error, filtering the 
> code unit out, or representing the code unit with a marker such as 
> U+FFFD replacement character." The second option (silently discarding 
> bad data) is bad, but requiring all implementations to do any U-FFFD 
> substitution is too much of a burden. A lot of deployed UTF-8 decoders 
> do not do substitution, and in some platforms it is not possible to 
> implement a new UTF-8 decoder efficiently (as efficiently as the 
> built-in one, at least).

The point is that Unicode _doesn't_ define exactly how many bytes form one 
ill-formed sequence. Unicode doesn't define the error handling in enough 
detail to get interoperable handling of arbitrary non-conforming byte 
streams.


On Fri, 29 Feb 2008, Brian Smith wrote:
> 
> > Section "8.2 Parsing HTML documents" is indeed exclusively for user 
> > agent implementors and conformance checker implementors. For authors 
> > and authoring tool implementors, you want section "8.1 Writing HTML 
> > documents" and section "3.7.5.4. Specifying the document's character 
> > encoding" (which is linked to from 8.1). These give the flipside of 
> > these requirements, the authoring side.
> 
> * Section 8.1 says that any document may start with a BOM. However, some 
> encodings do not allow a BOM at the beginning (UTF-16BE/UTF-16LE).

Right, this allows it even for those encodings, so that authors don't have 
to worry about removing the BOM as they change encoding.


> And, obviously, some encodings cannot encode the BOM.

Sure. Obviously if the encoding can't encode a BOM, you can't include one. :-)


> The statement should be changed to say that the BOM is only allowed if 
> the encoding allows it.

Why?


> * 3.7.5.4 (The META element) is the not correct place to define encoding 
> requirements for authors. It is counter-intuitive to have to look in the 
> definition of the META element to find out that you can use the BOM or 
> the Content-Type header to specify the encoding. The encoding 
> requirements should be in section 8, and it should be emphasized that 
> the encoding should be given in the Content-Type ("transport layer") 
> whenever possible. The fact that the encoding is determined based on 
> Content-Type, then the BOM, the XML declaration, then <META> is relavant 
> for content authors as well as parser implementers.

Yeah, there's a comment in the source to that effect. The problem is that 
if we move it into section 8, we have to duplicate it, once for authors 
and once for conformance checkers, because of the way that section 8 is 
written. I haven't yet quite worked out how to do it.


On Fri, 29 Feb 2008, Brian Smith wrote:
> 
> Geoffrey Sneddon wrote:
> >
> > I don't see anything making a BOM illegal in UTF-16LE/UTF-16BE, in 
> > fact, the only mention I find of it with regards to either in Unicode 
> > 5.0 is "In UTF-16(BE|LE), an initial byte sequence <(FE FF|FF FE)> is 
> > interpreted as U+FEFF zero width no-break space."
> 
> Right, a BOM cannot appear in a -BE/-LE document. The Unicode 5.0 
> specification has seperate recommendations for when to produce a -BE/-LE 
> document with a leading U+FEFF (don't do it), and how to process 
> documents that disregard that reocmmendation (treat it as a ZWNBS).

Sure. HTML5 layers on top of that, and makes a leading U+FEFF ok, and says 
how to handle it (drop it).

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 29 February 2008 20:48:06 UTC