- From: Řistein E. Andersen <html5@xn--istein-9xa.com>
- Date: Thu, 28 Jun 2007 01:24:07 +0200
On 26 Jun 2007, at 4:35AM, Ian Hickson wrote: > The informal research I did when updating the spec suggests that the > current state of the spec is what is better. (It is difficult to say anything sensible without knowing either the nature of the research undertaken or the options under consideration.) > I don't really know how to do more research > -- it's quite hard to programatically tell when an entity > should be expanded and when it shouldn't. True, but this is not completely insurmountable ? or, rather: useful information can be extracted without necessarily making these decisions explicitly. I do not know what you have done already, but something like the following for each entity &ref; would be useful for the discussion: ? total number of "&ref"; ? number of "&ref;"; ? number of "&ref" followed by /[a-zA-Z0-9]/; ? the N most frequent matches of /[a-zA-Z0-9]*&ref[a-zA-Z0-9&]+/. Without any real data, arguing, e.g., that conforming HTML 4.01 documents that are currently handled correctly by Firefox and Safari must be handled differently in the future for the sake of backwards compatibility is not really persuasive. The only argument for following IE that I have been able to find in the archives is the following in a post from Simon Pieters on 14th Aug 2006 in the thread ?Parsing Entities?: > I guess that for compat with IE and the Web[1] we have to treat > "Résumé" as if it were "Résumé". [...] > [1] http://www.google.com/search?q=R%26eacutesum%C3%A9 The implication seems to be that Résumé can be found on the Web and therefore should be supported. But Google also tells us something else: (1) "résum?": 572 (2) +r?sum?: 114,000,000 (3) résumé -"résumés": 16,300 (4) +"r??sum??": 1,000 Actually, (1) does not only cover résumé, but also code like r&eacutesum?, so the number of occurrences that can be saved by parser quirks is lower than 572. As could be expected, (1) is quite rare compared to (2), all the correctly encoded variants. Whether 0.0005% should be regarded as significant (supposing that r?sum? is representative) may be a contentious issue, but it is interesting to note that other errors ? unwanted conversion of & to & in (3) and a typical encoding problem in (4) ? are actually significantly more common, and these cannot be corrected at all. -- ?istein E. Andersen
Received on Wednesday, 27 June 2007 16:24:07 UTC