[whatwg] Entity parsing

On Thu, 28 Jun 2007, ?istein E. Andersen wrote:
> > 
> > I don't really know how to do more research -- it's quite hard to 
> > programatically tell when an entity should be expanded and when it 
> > shouldn't.
> 
> True, but this is not completely insurmountable ??? or, rather: useful 
> information can be extracted without necessarily making these decisions 
> explicitly.
> 
> I do not know what you have done already, but something like the following
> for each entity &ref; would be useful for the discussion:
>     ??? total number of "&ref";
>     ??? number of "&ref;";
>     ??? number of "&ref" followed by /[a-zA-Z0-9]/;
>     ??? the N most frequent matches of /[a-zA-Z0-9]*&ref[a-zA-Z0-9&]+/.
> 
> Without any real data, arguing, e.g., that conforming HTML 4.01 
> documents that are currently handled correctly by Firefox and Safari 
> must be handled differently in the future for the sake of backwards 
> compatibility is not really persuasive.

Sadly none of the arguments in any direction right now are particularly 
persuasive.

I'm not really convinced that the data that the above proposed survey 
might collect would actually help, since it doesn't tell us the what was 
intended by the author. You'd be surprised at how often people use 
ampersands in text in ways that have nothing to do with entities but in 
ways which could get interpreted as entities.


> The implication seems to be that R&eacutesum&eacute can be found on the Web
> and therefore should be supported. But Google also tells us something else:
> 
>     (1) "r&eacutesum??": 572
>     (2) +r??sum??: 114,000,000
>     (3) résum&eacute -"résumés": 16,300
>     (4) +"r????sum????": 1,000
> 
> Actually, (1) does not only cover r&eacutesum&eacute, but also code like 
> r&eacutesum??, so the number of occurrences that can be saved by 
> parser quirks is lower than 572.

The number of occurences of "r&eacutesum?? "is at least two (the two hits
I looked at both worked in IE and did not in Firefox).


Am I correct in assuming that you would like the spec changed? What would 
you like the spec changed to, exactly?

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 27 June 2007 16:43:39 UTC