Re: Auto-detect and encodings in HTML5

Leif Halvard Silli wrote:
> John Cowan On 09-05-28 23.08:
>> Leif Halvard Silli scripsit:
>>
>>> <meta name="Title" charset="Beagle Kennel van der Liniehoeve">
>>
>> Well, this does say "charset" rather than "content".
> 
> Yes, currently HTML doesn't have any @charset attribute. @charset is 
> only a new invention of the HTML 5 draft.

(It's newly specified in HTML 5, but it's been supported by the major 
web browsers for practically forever.)

> if I read the data correctly, then the HTML 5 draft algorithm that 
> Philip used, was unable to decode the correct charset info in the 
> _first_ meta element.

I looked for the first charset in a <meta content>, and independently 
looked for the first <meta charset>, so that particular page was counted 
in both of those columns of the table. The "sniffer" column is the one 
that matched the algorithm in HTML 5, which stops after finding the 
first thing that looks like a charset specification, and for this page 
it reported windows-1252.

> Measured against HTML 4, there seems to be _several_ errors in the 
> analysis/findings that is presented on that page. For instance, roughly 
> all the pages mentioned under the following fragment seems to have OK 
> charset info in their meta elements (and there are many other examples 
> of the same) - despite Philip's page saying there were errors:
> 
> http://philip.html5.org/data/charsets.html#charset-en

Most of those pages are sending HTTP headers like "Content-Type: 
text/html; charset=en" - the HTML has nothing to do with it. They're 
marked as 'invalid' because "en" is not a known character encoding. 
('invalid' in that data just means the page's bytes couldn't be decoded 
with the specified encoding.)

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Friday, 29 May 2009 00:14:16 UTC