Re: Character encoding errors (detailed review of parsing algorithm)

On Aug 2, 2007, at 10:11, Ian Hickson wrote:

> Would a non-normative note help here? Something like:
>
>    Note: Bytes or sequences of bytes in the original byte stream  
> that did
>    not conform to the encoding specification (e.g. invalid UTF-8 byte
>    sequences in a UTF-8 input stream) are errors that conformance
>    checkers are expected to report.
>
> ...to be put after the paragraph that reads "Bytes or sequences of  
> bytes
> in the original byte stream that could not be converted to Unicode
> characters must be converted to U+FFFD REPLACEMENT CHARACTER code  
> points".

Yes, this is what I meant with "a note hinting the consequences.

> (Note that not all bytes or sequences of bytes in the original byte  
> stream
> that could not be converted to Unicode characters are necessarily  
> errors.
> It could just be that the encoding has a character set that isn't a  
> subset
> of Unicode, e.g. the Apple logo found in most Apple character sets  
> doesn't
> have a non-PUA analogue in Unicode. Its presence in an HTML  
> document isn't
> an error as far as I'm concerned.)

Since XML and HTML5 are defined in terms of Unicode, characters  
there's nowhere to go except error and REPLACEMENT CHARACTER or the  
PUA for characters that aren't in Unicode. I'd steer clear of this in  
the spec an let decoders choose between de facto PUA assignments  
(like U+F8FF for the Apple logo) and errors.

Luckily, the encodings with the Apple logo and Armenian eternity sign  
aren't really Web-relevant. De facto PUA assignments used together  
with UTF-8 are much more Web-relevant. (As far as I'm concerned,  
using a Mac* encoding or ARMSCII on the Web for any characters is  
worse than using U+F8FF in UTF-8 to mean the Apple logo. :-)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 2 August 2007 09:02:42 UTC