mime & web and charsets

For the Internet draft on "Internet Media Types and the Web"
I was thinking we might also look on the charset registry,
and the information that might belong there.  To this end,
I'm planning to go back to calling this "MIME and the Web"
and not just "Internet Media Types and the Web".


For example, one might consider extending the MIME charset 
registry itself to include information such as:

(a) For any registered charset, any alternative charsets
for which there is significant content where the alternative
charset is actually meant/used, with information about:

   * BAD CONTENT: what percentage of content is
     are mislabeled (at the time of submission of the alternative
     so mislabeled, with a sample of, say, half a dozen different
     web pages which exhibit this mislabeling)

   * BAD COMPOSERS: which deployed content producers (email
     composers or web content authoring or creation tools) mislabel
     content intended to be in the alternative charset.
     (Content creators should be identified by version, and
      market share)

These should identify which MIME types the charset mislabeling
is used for: only text/html, also text/plain, or other charsets?

and
   * FORGIVING CONSUMERS: A survey (at the time of submission)
     of the behavior of widely deployed consumers of which
     use the alternative charset instead of the labeled one.
     A "consumer" for HTML would be a web browser or a search
     engine.

With this information in the MIME charset registry or some other
registry which is open for qualified submission, it will be
possible to reference the MIME registry of charsets when
giving advice to "forgiving consumers".

(b) "Sniffing info":  I'm a little less certain about this, but
given there are tables for charset info, is there a way to put
this into the MIME charset registry, e.g., for any registered charset,
allow include info for how that charset might be guessed, or 
some kind of general confidence level for deciding between
alternatives? Compared to using "magic numbers" in content, 
do the heuristics for charset sniffing have a wider distribution of 
confidence levels?

If there are other charsets for which "sniffing" might
be ambiguous, a pointer to those...

Just trying to see if there's some way of aligning the
charset registry not only with formal definitions but also
with deployed infrastructure & agents.

Larry
--
http://larry.masinter.net

Received on Thursday, 7 October 2010 19:31:54 UTC