Validator charset (was: Re: Bug 85/4494 (keeping track of validation statistics...)

Frank Ellermann wrote:
> Unrelated, I like your "encoding divination flowchart",
> <http://nikitathespider.com/articles/EncodingDivination.html>
> the HTML5 folks could use it to prove that they cannot
> really make it worse, because it already is FUBAR... ;-)

This was a great chart and immediately bookmarked it because of what I'm 
working on today...

I'm trying to mine interesting things from the validation research I did 
  and I've run into a bit of a wall. I don't really understand the 
validator's heuristic in choosing what charset to use when validating 
(especially when the charset is set to 'automatically detect').

To determine the charset to use, there are generally 3 sources:
   - HTTP header charset value of the content-type field
   - META content-type, charset component value
   - XML encoding attribute

In the validation response, it tells you what charset was chosen.
The validator also produces W04 (no charset detected), and W18-W20 
(combinations of mismatches between the above 3 sources). W24 also 
exists, but I don't know what exactly triggers that.

I previously did an analysis of these same three factors independently 
of the validation phase for these URLs, but the analysis and validation 
were separated by about 2 months - there could definitely have been some 
drift occurring between the results because of the elapsed time.

Biggest difference I noticed:
- Cases where the validator reported a charset but my analysis could not 
find any of the three reported types.
Is there a fallback that the validator uses that might be causing the 
validator to achieve traction where my tool didn't?

Olivier wrote:
 > * stats on the documents themselves. Doctype, mime type, charset.
 > Ideally, whether charset is in HTTP, XML decl, meta.

I have statistics on all the above, and I'm trying to develop something 
interesting to present about charsets. Towards that end, I've compiled 
the following statistics, hoping that some of it might be interesting. 
With 3 variables in my analysis portion, and at *least* 5 variables on 
the validator side, presenting all of this in a way that would be 
non-dull is a..."challenge". 8-}

Olivier mentioned that there is previous research on this area - any 
pointers to that research? I haven't seen it...maybe that research has 
found a way to tame the multi-headed charset beast.

Leading back to the original quote, does the validator employ any large 
part of that "Encoding divination flowchart" to determine encoding? My 
analysis was much simpler, as in: "is there a value there or not".

You can stop reading now if you don't want to be bored. 8-} So far, I've 
aggregated the statistics into 36 different views on all these different 
character set sources and reported values and warnings. I'm *still* not 
exhausting all the possible variants here either. This feels like WAY 
TOO MUCH DETAIL, but maybe some wiser heads have thoughts on how to make 
better sense of this.

Here's the current little mini-list of stats I'm gathering. (MAMA is the 
name of the analysis tool I wrote/used):

MAMA Charset combinations:
------------------------
...1 charset source: META only
...1 charset source: XML only
...1 charset source: HTTP only
...2 charset sources only: HTTP and META
.........all agree
...2 charset sources only: HTTP and XML
.........all agree
...2 charset sources only: META and XML
.........all agree
...All charset sources present
.........all agree

Validator charset versus MAMA charsets:
------------------------
...Neither Validator nor MAMA can detect charset
...No Validator charset but MAMA detected charset
...Yes Validator charset but no MAMA charset detected
...Yes Validator charset but found at least 1 Conflicting charsets Warning
...Yes Validator charset but got No Charset Warnings

Warning 04: No character encoding found
------------------------
...No W04 warning issued, but no Validator charset detected
...No W04 warning issued, and Validator charset detected
...No W04 warning issued, and MAMA charset detected
...No W04 warning issued, but no MAMA charset detected
...W04 warning issued, and no Validator charset detected
...W04 warning issued, but Validator charset detected
...W04 warning issued, but MAMA charset detected
...W04 warning issued, and no MAMA charset detected

Warning 18: Character encoding mismatch
(HTTP header/XML encoding)
------------------------
...W18 issued. MAMA detected HTTP header encoding but no XML encoding
...W18 issued. MAMA detected XML encoding but no HTTP header encoding
...W18 issued. MAMA detected both HTTP header encoding and XML encoding
       - Both same
       - Both different

Warning 19: Character encoding mismatch
(HTTP header/META element encoding)
------------------------
...W19 issued. MAMA detected HTTP header encoding but no META encoding
...W19 issued. MAMA detected META encoding but no HTTP header encoding
...W19 issued. MAMA detected both HTTP header encoding, META encoding
       - Both same
       - Both different

Warning 20: Character encoding mismatch
(XML encoding/META element encoding)
------------------------
...W20 issued. MAMA detected XML encoding but no META encoding
...W20 issued. MAMA detected META encoding but no XML encoding
...W20 issued. MAMA detected both META encoding and XML encoding
       - Both same
       - Both different

Would presenting URLs satisfying any of these comparison criteria types 
be interesting?

Any and all advice appreciated!

Thanks,
-Brian

Received on Saturday, 8 March 2008 00:10:59 UTC