- From: Brian Wilson <bloo@blooberry.com>
- Date: Sat, 08 Mar 2008 01:10:33 +0100
- To: www-validator@w3.org
Frank Ellermann wrote: > Unrelated, I like your "encoding divination flowchart", > <http://nikitathespider.com/articles/EncodingDivination.html> > the HTML5 folks could use it to prove that they cannot > really make it worse, because it already is FUBAR... ;-) This was a great chart and immediately bookmarked it because of what I'm working on today... I'm trying to mine interesting things from the validation research I did and I've run into a bit of a wall. I don't really understand the validator's heuristic in choosing what charset to use when validating (especially when the charset is set to 'automatically detect'). To determine the charset to use, there are generally 3 sources: - HTTP header charset value of the content-type field - META content-type, charset component value - XML encoding attribute In the validation response, it tells you what charset was chosen. The validator also produces W04 (no charset detected), and W18-W20 (combinations of mismatches between the above 3 sources). W24 also exists, but I don't know what exactly triggers that. I previously did an analysis of these same three factors independently of the validation phase for these URLs, but the analysis and validation were separated by about 2 months - there could definitely have been some drift occurring between the results because of the elapsed time. Biggest difference I noticed: - Cases where the validator reported a charset but my analysis could not find any of the three reported types. Is there a fallback that the validator uses that might be causing the validator to achieve traction where my tool didn't? Olivier wrote: > * stats on the documents themselves. Doctype, mime type, charset. > Ideally, whether charset is in HTTP, XML decl, meta. I have statistics on all the above, and I'm trying to develop something interesting to present about charsets. Towards that end, I've compiled the following statistics, hoping that some of it might be interesting. With 3 variables in my analysis portion, and at *least* 5 variables on the validator side, presenting all of this in a way that would be non-dull is a..."challenge". 8-} Olivier mentioned that there is previous research on this area - any pointers to that research? I haven't seen it...maybe that research has found a way to tame the multi-headed charset beast. Leading back to the original quote, does the validator employ any large part of that "Encoding divination flowchart" to determine encoding? My analysis was much simpler, as in: "is there a value there or not". You can stop reading now if you don't want to be bored. 8-} So far, I've aggregated the statistics into 36 different views on all these different character set sources and reported values and warnings. I'm *still* not exhausting all the possible variants here either. This feels like WAY TOO MUCH DETAIL, but maybe some wiser heads have thoughts on how to make better sense of this. Here's the current little mini-list of stats I'm gathering. (MAMA is the name of the analysis tool I wrote/used): MAMA Charset combinations: ------------------------ ...1 charset source: META only ...1 charset source: XML only ...1 charset source: HTTP only ...2 charset sources only: HTTP and META .........all agree ...2 charset sources only: HTTP and XML .........all agree ...2 charset sources only: META and XML .........all agree ...All charset sources present .........all agree Validator charset versus MAMA charsets: ------------------------ ...Neither Validator nor MAMA can detect charset ...No Validator charset but MAMA detected charset ...Yes Validator charset but no MAMA charset detected ...Yes Validator charset but found at least 1 Conflicting charsets Warning ...Yes Validator charset but got No Charset Warnings Warning 04: No character encoding found ------------------------ ...No W04 warning issued, but no Validator charset detected ...No W04 warning issued, and Validator charset detected ...No W04 warning issued, and MAMA charset detected ...No W04 warning issued, but no MAMA charset detected ...W04 warning issued, and no Validator charset detected ...W04 warning issued, but Validator charset detected ...W04 warning issued, but MAMA charset detected ...W04 warning issued, and no MAMA charset detected Warning 18: Character encoding mismatch (HTTP header/XML encoding) ------------------------ ...W18 issued. MAMA detected HTTP header encoding but no XML encoding ...W18 issued. MAMA detected XML encoding but no HTTP header encoding ...W18 issued. MAMA detected both HTTP header encoding and XML encoding - Both same - Both different Warning 19: Character encoding mismatch (HTTP header/META element encoding) ------------------------ ...W19 issued. MAMA detected HTTP header encoding but no META encoding ...W19 issued. MAMA detected META encoding but no HTTP header encoding ...W19 issued. MAMA detected both HTTP header encoding, META encoding - Both same - Both different Warning 20: Character encoding mismatch (XML encoding/META element encoding) ------------------------ ...W20 issued. MAMA detected XML encoding but no META encoding ...W20 issued. MAMA detected META encoding but no XML encoding ...W20 issued. MAMA detected both META encoding and XML encoding - Both same - Both different Would presenting URLs satisfying any of these comparison criteria types be interesting? Any and all advice appreciated! Thanks, -Brian
Received on Saturday, 8 March 2008 00:10:59 UTC