[whatwg] Spec comments, sections 1-2

On Wed, 5 Aug 2009, Anne van Kesteren wrote:
> On Wed, 05 Aug 2009 02:01:59 +0200, Ian Hickson <ian at hixie.ch> wrote:
> > I'm pretty sure that character encoding support in browsers is more of 
> > a "collect them all" kind of thing than really based on content that 
> > requires it, to be honest.
> 
> Really? I think a lot of them are actually used.

I'm pretty sure not all of them are common.


> If you know anything I'd love to trim the amount of encodings the Web 
> needs to a smaller list than what we currently ship with. Ideally this 
> becomes a fixed list across all Web languages.

That would be nice.


> > If someone can provide a firm list of encodings that they are 
> > confident are required for a certain substantial percentage of the 
> > Web, I'm happy to add the list to the spec.
> 
> Can you not do a survey on your large dataset of data to find this out? 
> I read somewhere also that Adam Barth was able to add code to Google 
> Chrome to figure out a better algorithm for Content-Type sniffing. Maybe 
> something similar could be done here?

For various reasons, my usual techniques for obtaining data aren't 
suitable for encoding-related work. Could MAMA or Opera be instrumented 
instead?


> We've encountered problems by the way with using the Unicode encoding 
> matching algorithm. Particularly on some Asian sites. I think we need to 
> switch HTML5 back to something more akin to WebKit/Gecko/Trident. I 
> realize this means more magic lists, but the current algorithm does not 
> seem to cut it. E.g. sites rely on the fact that EUC_JP is not a 
> recognized encoding but EUC-JP is.

If you let me know what the algorithm should be, I can do that. Is it just 
underscores that must not be ignored? Maybe we can just do a delta spec on 
the Unicode algorithm? (i.e. say "do what Unicode says except...").

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Thursday, 13 August 2009 16:59:22 UTC