Parsing: Trailing garbage in doctype FPI (was: Re: Doctype usage data)

On Thu, 28 Feb 2008 02:58:53 +0100, Philip Taylor <pjt47@cam.ac.uk> wrote:

>
> I've got some data about doctypes at  
> http://philip.html5.org/data/doctypes.html (125K pages from dmoz.org)  
> and http://philip.html5.org/data/doctypes-alexa.html (about 400 from  
> Alexa's list). I'm not entirely sure what this could be useful for, but  
> I'll point out a couple of things here.

This is very useful information for Opera. We can determinate what would  
break when implementing HTML5 doctype switching. Thank you for this data.


> 0.1% replaced the "...//EN" with their own language code, e.g.  
> http://www.edelweiss-reizen.nl has <!DOCTYPE html PUBLIC "-//W3C//DTD  
> XHTML 1.0 Strict//NL"  
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

What's interesting to look at is doctypes that would be quirky if they  
ended in //EN (which are quirky in IE and Opera but not in Firefox or  
Safari).


http://www.nic.funet.fi/~magi/metsola/
http://www.pinocchioarredi.it/
http://www.ultimahora.es/
http://www.cinarstvi.cz/
http://5w40.de/
http://www.campingplatz-reinhardshagen.de/
http://www.deutsche-fachwerkstrasse.de/
http://www.grasdorf.de/
http://www.protz-werder.de/
http://www.schacher-immobilien.de/
http://www.cameratasantcugat.com/
http://www.hagiva.co.il/
http://www.vargagabor.hu/
http://www.ilserbatoio.it/
http://www.cab.it/
http://kgit.amu.edu.pl/
http://www.osp.os.pl/
http://stdk.narod.ru/
http://www.palkin.ru/
http://www.slm-nsk.ru/
http://www.sunwaytours.ru/
http://www.losnuevostangos.com.ar/
http://www.elesis.com.tr/
http://usinfo.state.gov/esp/home/topics/us_society_values/geografia.html
http://www.minotel.com/home.asp?xlanguage=DE
http://www.vs-aigen.salzburg.at/ (ends in "//EN conova")
http://www.eng-joheco.com/
http://www.caissepoplevis.com/
http://www.judo-store.com/
http://aziende.lab4.net/
http://powermetal.altervista.org/
http://www.architettopalladini.it/
http://deamicis-spa.com/
http://www.balparaplan.webm.ru/
http://www.taxi-office.ru/ (ends in "//RUS")
http://www.serdardenktas.com/

The pages above render better in quirks mode than in standards mode in  
Opera and Firefox (I didn't test all in Firefox though).


http://www.quintomiglio.com/

The page above renders better in standards mode than in quirks mode in  
Opera and Firefox.


http://www.gedankenblicke.net/

This one renders better in standards mode than almost standards mode in  
Opera, but the same in Firefox, so it's probably a bug in Opera's almost  
standards mode.


The rest of the about 60 pages I looked at looked ok in either quirks mode  
or standards mode. This means that Opera would break about 0.05% of pages  
of this sample if we implemented HTML5 doctype switching, assuming that  
the remaining pages I didn't look at were the same.


I think this is pretty convincing that HTML5 needs to ignore what is in  
place of the "EN" at the end of the FPIs, that is instead of matching that  
the FPI is e.g. -//W3C//DTD HTML 3.2//EN, check that it starts with  
-//W3C//DTD HTML 3.2//.

For the FPIs that end in //EN//2.0 and the like, I'd suggest to just drop  
them from the list since there are equivalent FPIs that end in //EN and  
the //2.0 would be treated as trailing garbage.

-- 
Simon Pieters
Opera Software

Received on Monday, 3 March 2008 08:48:37 UTC