- From: Asbjørn Ulsberg <asbjorn@tigerstaden.no>
- Date: Mon, 08 Nov 2004 22:42:58 +0100
- To: trejkaz@xaoza.net, "James Cerra" <jfcst24_public@yahoo.com>
- Cc: www-html@w3.org
On Tue, 9 Nov 2004 08:05:01 +1100, Trejkaz Xaoza <trejkaz@xaoza.net> wrote:
> If it doesn't start with "<?xml" but has a DOCTYPE near the top, then
> it's SGML, and you perform similar rules based on what you see after it.
As far as I see it, an XHTML document can start like this:
1. <?xml ...>
2. <!DOCTYPE ...>
3. <html xmlns="http://...">
Not all are valid prologs of an XHTML document, but some are as XML
documents. The XML declaration is nonetheless optional, so any valid XHTML
document may start with just a DOCTYPE. So may HTML documents as well, so
then you actually have to parse the DOCTYPE to know what type of (X)HTML
document it is.
What I'd do, is the following:
1. Trigger XML parsing mode if:
1.1. The document starts with <?xml ...>
1.2. The document element is <html> with an attribute called 'xmlns'
whos value is 'http://www.w3.org/1999/xhtml'.
2. Trigger SGML parsing mode if:
2.1. The document starts with a DOCTYPE that says it's HTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" ...>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" ...>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" ...>
You may of course cater for more HTML versions than 4.01, but that
would be just the same; add the DOCTYPE's to your checker.
2.2. The document element is <html> with no 'xmlns' attribute.
I could have added the point «1.3. The document starts with a DOCTYPE that
says it's XHTML», but that isn't necessary as all XHTML documents must
have the <html> elment in the XHTML namespace.
I would also do the check in this order, so that you fall back to SGML if
any XHTML checks fail. Falling back to XML from SGML would give a much
higher fail-rate, I think.
--
Asbjørn Ulsberg -=|=- asbjornu@hotmail.com
«He's a loathsome offensive brute, yet I can't look away»
Received on Monday, 8 November 2004 21:41:45 UTC