[whatwg] Internal character encoding declaration from Henri Sivonen on 2006-03-11 (public-whatwg-archive@w3.org from March 2006)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sun, 12 Mar 2006 00:49:03 +0200
Message-ID: <10ED0AA5-1B5D-4B07-9CA5-D777C8E334A6@iki.fi>

On Mar 11, 2006, at 17:10, Henri Sivonen wrote:

> Initialize a character decoder that the bytes 0x20?0x7E (inclusive)  
> as well as 0x09, 0x0A and 0x0D decode to the Unicode code points of  
> the same (zero-extended) value and maps all other bytes to U+FFFD  
> and raises a REWIND flag

On further reflection, it occurred to me that emitting the  
Windows-1252 characters instead of U+FFFD would be a good  
optimization for the common case where the encoding later turns out  
to be Windows-1252 or ISO-8859-1. This would require more that one  
bookkeeping flag, though.

> If a start tag other than html or head is seen, emit an easy parse  
> error.

Same with character data.

> Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)

Except for the ISO-8859-* family the easy error recovery should be  
emitting the characters according to the corresponding Windows-*  
family superset.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/

Received on Saturday, 11 March 2006 14:49:03 UTC