W3C home > Mailing lists > Public > www-international@w3.org > July to September 2004

Re: faq suggestions

From: Jungshik Shin <jshin@i18nl10n.com>
Date: Mon, 23 Aug 2004 12:40:18 +0900
Message-ID: <41296722.3060500@i18nl10n.com>
To: Tex Texin <tex@i18nguy.com>
CC: www-international@w3.org

Tex Texin wrote:

Hi Tex,

> With respect to user agents reparsing documents from the beginning, can you say
> which ones do this?

   Mozilla does and apparently MS IE does, too. Otherwise, it wouldn't 
be  able to handle some html documents I came across with 'meta' rather 
'deep' inside the document with non-ASCII characters (in CSS font 
specification and title) before that.

> They are not obligated to and the wording of the standards implies that the
> encoding "switch" from the initial value to the value specified in the charset
> statement, occurs at the point the statement is parsed.

  I have yet to check the spec. about this. Even though they're not 
obligated to, practically they have to because I've seen quite a lot of 
documents with 'meta charset' buried deep inside with non-ASCII 
characters before it. Needless to say, I frown upon those documents, but 
I couldn't track down every one of them.

> On a separate point I wonder if you meant ASCII-compatible or simply ASCII.

I meant 'ASCII-compatible' (not pure ASCII). A couple of months ago, I 
submitted a patch to Nutch (an open source crawler/search engine) to 
parse the first 4(?) kB of html documents to find 'meta charset' 
declaration assuming they're in Windows-1252 (nothing special about 
Windows-1252 other than that octets between 0x80 - 0xaf are valid as 
well as 0xb0 through 0xff). Mozilla does something similar.

Jungshik
Received on Monday, 23 August 2004 03:44:02 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:03 GMT