- From: Larry Masinter <masinter@adobe.com>
- Date: Thu, 7 Oct 2010 12:31:15 -0700
- To: "julian.reschke@gmx.de" <julian.reschke@gmx.de>, Bjoern Hoehrmann <derhoermi@gmx.net>
- CC: "www-tag@w3.org" <www-tag@w3.org>
For the Internet draft on "Internet Media Types and the Web" I was thinking we might also look on the charset registry, and the information that might belong there. To this end, I'm planning to go back to calling this "MIME and the Web" and not just "Internet Media Types and the Web". For example, one might consider extending the MIME charset registry itself to include information such as: (a) For any registered charset, any alternative charsets for which there is significant content where the alternative charset is actually meant/used, with information about: * BAD CONTENT: what percentage of content is are mislabeled (at the time of submission of the alternative so mislabeled, with a sample of, say, half a dozen different web pages which exhibit this mislabeling) * BAD COMPOSERS: which deployed content producers (email composers or web content authoring or creation tools) mislabel content intended to be in the alternative charset. (Content creators should be identified by version, and market share) These should identify which MIME types the charset mislabeling is used for: only text/html, also text/plain, or other charsets? and * FORGIVING CONSUMERS: A survey (at the time of submission) of the behavior of widely deployed consumers of which use the alternative charset instead of the labeled one. A "consumer" for HTML would be a web browser or a search engine. With this information in the MIME charset registry or some other registry which is open for qualified submission, it will be possible to reference the MIME registry of charsets when giving advice to "forgiving consumers". (b) "Sniffing info": I'm a little less certain about this, but given there are tables for charset info, is there a way to put this into the MIME charset registry, e.g., for any registered charset, allow include info for how that charset might be guessed, or some kind of general confidence level for deciding between alternatives? Compared to using "magic numbers" in content, do the heuristics for charset sniffing have a wider distribution of confidence levels? If there are other charsets for which "sniffing" might be ambiguous, a pointer to those... Just trying to see if there's some way of aligning the charset registry not only with formal definitions but also with deployed infrastructure & agents. Larry -- http://larry.masinter.net
Received on Thursday, 7 October 2010 19:31:54 UTC