Re: Character encoding errors (detailed review of parsing algorithm) from Ian Hickson on 2007-08-02 (public-html@w3.org from August 2007)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 2 Aug 2007 07:11:06 +0000 (UTC)
To: Henri Sivonen <hsivonen@iki.fi>
Cc: "public-html@w3.org WG" <public-html@w3.org>
Message-ID: <Pine.LNX.4.64.0708020703280.9342@dhalsim.dreamhost.com>

On Wed, 1 Aug 2007, Henri Sivonen wrote:
> > 
> > They're not parse errors, they're errors at the character encoding 
> > layer. IMHO that's out of scope for this spec.
> 
> OK. (The writers of the XML spec felt differently about the scope of 
> their spec, though.)

This wouldn't be the first time the editors of the XML spec and myself 
disagreed on something. ;-)

> > In particular I don't think any of the text for parse errors need 
> > apply to encoding errors, the encoding specs should be the ones that 
> > make such errors non-conforming. No?
> 
> I agree in principle. I guess this is one of the cases where the spec is 
> already logically sufficient but having a one-sentence note hinting at 
> the consequences of other specs would go a long way disambiguating 
> things for the kinds of reading scenarios mentioned in 
> http://diveintomark.org/archives/2004/08/16/specs . (Compare with 
> http://www.w3.org/mid/01575703-3A06-4F51-BE27-86A9EBB44C54@iki.fi )

Would a non-normative note help here? Something like:

   Note: Bytes or sequences of bytes in the original byte stream that did 
   not conform to the encoding specification (e.g. invalid UTF-8 byte 
   sequences in a UTF-8 input stream) are errors that conformance 
   checkers are expected to report.

...to be put after the paragraph that reads "Bytes or sequences of bytes 
in the original byte stream that could not be converted to Unicode 
characters must be converted to U+FFFD REPLACEMENT CHARACTER code points".

(Note that not all bytes or sequences of bytes in the original byte stream 
that could not be converted to Unicode characters are necessarily errors. 
It could just be that the encoding has a character set that isn't a subset 
of Unicode, e.g. the Apple logo found in most Apple character sets doesn't 
have a non-PUA analogue in Unicode. Its presence in an HTML document isn't 
an error as far as I'm concerned.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Thursday, 2 August 2007 07:11:19 UTC