- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 2 Aug 2007 12:02:20 +0300
- To: Ian Hickson <ian@hixie.ch>
- Cc: "public-html@w3.org WG" <public-html@w3.org>
On Aug 2, 2007, at 10:11, Ian Hickson wrote: > Would a non-normative note help here? Something like: > > Note: Bytes or sequences of bytes in the original byte stream > that did > not conform to the encoding specification (e.g. invalid UTF-8 byte > sequences in a UTF-8 input stream) are errors that conformance > checkers are expected to report. > > ...to be put after the paragraph that reads "Bytes or sequences of > bytes > in the original byte stream that could not be converted to Unicode > characters must be converted to U+FFFD REPLACEMENT CHARACTER code > points". Yes, this is what I meant with "a note hinting the consequences. > (Note that not all bytes or sequences of bytes in the original byte > stream > that could not be converted to Unicode characters are necessarily > errors. > It could just be that the encoding has a character set that isn't a > subset > of Unicode, e.g. the Apple logo found in most Apple character sets > doesn't > have a non-PUA analogue in Unicode. Its presence in an HTML > document isn't > an error as far as I'm concerned.) Since XML and HTML5 are defined in terms of Unicode, characters there's nowhere to go except error and REPLACEMENT CHARACTER or the PUA for characters that aren't in Unicode. I'd steer clear of this in the spec an let decoders choose between de facto PUA assignments (like U+F8FF for the Apple logo) and errors. Luckily, the encodings with the Apple logo and Armenian eternity sign aren't really Web-relevant. De facto PUA assignments used together with UTF-8 are much more Web-relevant. (As far as I'm concerned, using a Mac* encoding or ARMSCII on the Web for any characters is worse than using U+F8FF in UTF-8 to mean the Apple logo. :-) -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 2 August 2007 09:02:42 UTC