Re: Validator charset (was: Re: Bug 85/4494 (keeping track of validation statistics...)

Hi Brian,

On Mar 7, 2008, at 19:10 , Brian Wilson wrote:
>> <http://nikitathespider.com/articles/EncodingDivination.html>
>>
> This was a great chart and immediately bookmarked it because of what  
> I'm working on today...

It's indeed very good. So much that it got me writing, and rambling,  
at lenght about character encodings, and of course linking back to the  
chart:
http://www.w3.org/QA/2008/03/html-charset.html

that entry would certainly need a bit of re-working, but it's there.

> I'm trying to mine interesting things from the validation research I  
> did  and I've run into a bit of a wall. I don't really understand  
> the validator's heuristic in choosing what charset to use when  
> validating

Quick answer: we use the perl HTML::Encoding module.

Longer answer: the validator uses looks (in order) at
* HTTP Content-Type
* then (if applicable) xml declaration
* then looks for a meta
* then falls back to utf-8

AFAIK the w3c markup validator is not following the recommendation of  
trying a iso-8859-1 fallback (as is the rule, kinda, for text/*)  
because… it's just a bad one. However, I think in the future it would  
be nice for the validator to try utf-8, then windows-1252, then iso- 
latin-1, as fallbacks.

> In the validation response, it tells you what charset was chosen.
> The validator also produces W04 (no charset detected), and W18-W20  
> (combinations of mismatches between the above 3 sources). W24 also  
> exists, but I don't know what exactly triggers that.

W24 is for encoding aliases that happen to "work" but are not  
recommended. I am doubtful it is very frequent, or very reliable for  
that matter.


> Olivier mentioned that there is previous research on this area - any  
> pointers to that research? I haven't seen it...maybe that research  
> has found a way to tame the multi-headed charset beast.

I don't think previous research showed the many facets, just one.

> You can stop reading now if you don't want to be bored.

Sorry, I'm not bored yet :).

> Here's the current little mini-list of stats I'm gathering. (MAMA is  
> the name of the analysis tool I wrote/used):
>
> MAMA Charset combinations:
> ------------------------
> ...1 charset source: META only
> ...1 charset source: XML only
> ...1 charset source: HTTP only
> ...2 charset sources only: HTTP and META
> .........all agree
> ...2 charset sources only: HTTP and XML
> .........all agree
> ...2 charset sources only: META and XML
> .........all agree
> ...All charset sources present
> .........all agree

Numbers on all these above would be interesting, in particular as a  
way to re-think, revise or confirm the detection rules.

> Validator charset versus MAMA charsets:
> ------------------------
> ...Neither Validator nor MAMA can detect charset
> ...No Validator charset but MAMA detected charset
> ...Yes Validator charset but no MAMA charset detected
> ...Yes Validator charset but found at least 1 Conflicting charsets  
> Warning
> ...Yes Validator charset but got No Charset Warnings

For these, I don't know if anyone other than me would be interested :)  
but I'd love to get a few URIs for each of these cases.


For the rest... That might be overkill :) But who knows. If it doesn't  
cost you too much time to come up with one or two examples for each,  
I'd happily look at them and try to find bugs in either MAMA or the  
validator, or both, or just differences in implementations.

> Warning 04: No character encoding found
> ------------------------
> ...No W04 warning issued, but no Validator charset detected
> ...No W04 warning issued, and Validator charset detected
> ...No W04 warning issued, and MAMA charset detected
> ...No W04 warning issued, but no MAMA charset detected
> ...W04 warning issued, and no Validator charset detected
> ...W04 warning issued, but Validator charset detected
> ...W04 warning issued, but MAMA charset detected
> ...W04 warning issued, and no MAMA charset detected
>
> Warning 18: Character encoding mismatch
> (HTTP header/XML encoding)
> ------------------------
> ...W18 issued. MAMA detected HTTP header encoding but no XML encoding
> ...W18 issued. MAMA detected XML encoding but no HTTP header encoding
> ...W18 issued. MAMA detected both HTTP header encoding and XML  
> encoding
>      - Both same
>      - Both different
>
> Warning 19: Character encoding mismatch
> (HTTP header/META element encoding)
> ------------------------
> ...W19 issued. MAMA detected HTTP header encoding but no META encoding
> ...W19 issued. MAMA detected META encoding but no HTTP header encoding
> ...W19 issued. MAMA detected both HTTP header encoding, META encoding
>      - Both same
>      - Both different
>
> Warning 20: Character encoding mismatch
> (XML encoding/META element encoding)
> ------------------------
> ...W20 issued. MAMA detected XML encoding but no META encoding
> ...W20 issued. MAMA detected META encoding but no XML encoding
> ...W20 issued. MAMA detected both META encoding and XML encoding
>      - Both same
>      - Both different
>
> Would presenting URLs satisfying any of these comparison criteria  
> types be interesting?
>
> Any and all advice appreciated!
>
> Thanks,
> -Brian
>
>


Thanks!
-- 
olivier Thereaux - W3C - http://www.w3.org/People/olivier/
W3C Open Source Software: http://www.w3.org/Status

Received on Monday, 10 March 2008 20:37:57 UTC