Re: Identifying (X)HTML without MIME

On Tue, 9 Nov 2004 08:05:01 +1100, Trejkaz Xaoza <trejkaz@xaoza.net> wrote:

> If it doesn't start with "<?xml" but has a DOCTYPE near the top, then  
> it's SGML, and you perform similar rules based on what you see after it.

As far as I see it, an XHTML document can start like this:

   1. <?xml ...>
   2. <!DOCTYPE ...>
   3. <html xmlns="http://...">

Not all are valid prologs of an XHTML document, but some are as XML  
documents. The XML declaration is nonetheless optional, so any valid XHTML  
document may start with just a DOCTYPE. So may HTML documents as well, so  
then you actually have to parse the DOCTYPE to know what type of (X)HTML  
document it is.

What I'd do, is the following:

   1.   Trigger XML parsing mode if:
   1.1. The document starts with <?xml ...>
   1.2. The document element is <html> with an attribute called 'xmlns'
        whos value is 'http://www.w3.org/1999/xhtml'.

   2.   Trigger SGML parsing mode if:
   2.1. The document starts with a DOCTYPE that says it's HTML:

        <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" ...>
        <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" ...>
        <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" ...>

        You may of course cater for more HTML versions than 4.01, but that
        would be just the same; add the DOCTYPE's to your checker.

   2.2. The document element is <html> with no 'xmlns' attribute.

I could have added the point «1.3. The document starts with a DOCTYPE that  
says it's XHTML», but that isn't necessary as all XHTML documents must  
have the <html> elment in the XHTML namespace.

I would also do the check in this order, so that you fall back to SGML if  
any XHTML checks fail. Falling back to XML from SGML would give a much  
higher fail-rate, I think.

-- 
Asbjørn Ulsberg         -=|=-        asbjornu@hotmail.com
«He's a loathsome offensive brute, yet I can't look away»

Received on Monday, 8 November 2004 21:41:45 UTC