- From: Ian Hickson <ian@hixie.ch>
- Date: Thu, 13 Aug 2009 23:59:22 +0000 (UTC)
On Wed, 5 Aug 2009, Anne van Kesteren wrote: > On Wed, 05 Aug 2009 02:01:59 +0200, Ian Hickson <ian at hixie.ch> wrote: > > I'm pretty sure that character encoding support in browsers is more of > > a "collect them all" kind of thing than really based on content that > > requires it, to be honest. > > Really? I think a lot of them are actually used. I'm pretty sure not all of them are common. > If you know anything I'd love to trim the amount of encodings the Web > needs to a smaller list than what we currently ship with. Ideally this > becomes a fixed list across all Web languages. That would be nice. > > If someone can provide a firm list of encodings that they are > > confident are required for a certain substantial percentage of the > > Web, I'm happy to add the list to the spec. > > Can you not do a survey on your large dataset of data to find this out? > I read somewhere also that Adam Barth was able to add code to Google > Chrome to figure out a better algorithm for Content-Type sniffing. Maybe > something similar could be done here? For various reasons, my usual techniques for obtaining data aren't suitable for encoding-related work. Could MAMA or Opera be instrumented instead? > We've encountered problems by the way with using the Unicode encoding > matching algorithm. Particularly on some Asian sites. I think we need to > switch HTML5 back to something more akin to WebKit/Gecko/Trident. I > realize this means more magic lists, but the current algorithm does not > seem to cut it. E.g. sites rely on the fact that EUC_JP is not a > recognized encoding but EUC-JP is. If you let me know what the algorithm should be, I can do that. Is it just underscores that must not be ignored? Maybe we can just do a delta spec on the Unicode algorithm? (i.e. say "do what Unicode says except..."). -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 13 August 2009 16:59:22 UTC