Re: Fallbeck to UTF-8

Scripsit Frank Ellermann:

> I'd prefer a completely unlikely "SBCS" with proper subset ASCII
> permitting all octets from 0x80 up to 0xFF.

That sounds like the best option, and it's a simple one, except for the 
explanations. It's not any particular encoding but rather an open class 
of encodings. But it will do fine, since it's a correct guess in a vast 
majority of cases (including all the pages that are really windows-1252 
or just Ascii plus pages in different national encodings), and it's 
really irrelevant what the octets 0x80 to 0xFF mean in such encodings. 
Some of them might be undefined, for a particular encoding, but we don't 
know the real encoding.

> And at the end, after
> all other errors based on this assumption are reported, one final
> "you lose - unknown charset" (optional as gimmick:  "whatever it
> is, it's certainly not UTF-8", if that is known in your scenario).


Well, hopefully nothing like that. I think the report should be 
_preceded_ by a clear note, and might end with a note too (since people 
may miss the initial note). It could, directly or indirectly (via a 
link) say something like the following:

The document cannot be validated, since the character encoding has not 
been specified.
However, tentative validation was carried out based on the assumption 
that the encoding
is some 8-bit encoding where the first 128 code positions are as in 
US-ASCII. You should
specify the encoding as described in HTML specifications and resubmit 
the document
for validation.


Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/ 

Received on Thursday, 29 November 2007 22:22:31 UTC