Doctype usage data

I've got some data about doctypes at 
http://philip.html5.org/data/doctypes.html (125K pages from dmoz.org) 
and http://philip.html5.org/data/doctypes-alexa.html (about 400 from 
Alexa's list). I'm not entirely sure what this could be useful for, but 
I'll point out a couple of things here.


Summary of some of the dmoz.org data:
48% of the pages have no doctype at all.
24% have a doctype that is quirks mode in HTML5. ("HTML5" could 
equivalently be "Firefox", since they have almost identical mode selection.)
23% are almost-standards (limited-quirks) mode.
5% are standards mode.
Also, 4% are Strict. 18% are XHTML 1.0; 24% are HTML4. Only 0.2% use 
single quotes.


The data includes a comparison of the standards/quirks mode decisions 
that IE7 and HTML5 would make. There is mostly good agreement; the main 
difference is the ~1% that are treated as standards mode in IE7 and as 
quirks mode in HTML5, and half of those are from <!doctype html public 
"-//w3c//dtd html 4.0 transitional//en" 
"http://www.w3.org/tr/rec-html40/loose.dtd">. It would be interesting to 
see if those pages would work better if treated as HTML5 standards mode 
instead (i.e. being more compatible with IE, less with Firefox).


http://www.thermaglaze.com/ has <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 
1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" [ 
<!ATTLIST a target CDATA #IMPLIED> ]> -- some people really want to use 
<a target="_blank">, and will do anything to make it work while still 
having the validator claim their page is okay.

0.1% replaced the "...//EN" with their own language code, e.g. 
http://www.edelweiss-reizen.nl has <!DOCTYPE html PUBLIC "-//W3C//DTD 
XHTML 1.0 Strict//NL" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Lots of people end up with incorrect doctypes due to typos (e.g. about 
0.05% wrote "-//WC3/..."), escaping quotes with backslashes, globally 
search-and-replacing 'html' with 'php', and various other issues. It's 
nice that "<!doctype html>" is easy to write, since people clearly 
aren't great at copying boilerplate code.

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Thursday, 28 February 2008 01:59:01 UTC