- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sun, 05 Sep 2004 21:09:38 +0200
- To: public-qa-dev@w3.org
Hi, to "fix" e.g. <http://www.w3.org/Bugs/Public/show_bug.cgi?id=14> we need to revise how `check` determines how to process text/html resources. The question is how it should do that exactly which includes whether #14 is actually a bug that should be fixed. The HTML Working Group has been asked a number of times how to "sniff" XHTML documents and refrained from comment. For browsers they made it clear that those should not sniff for XHTML but rather ignore both the XHTML and HTML specifications and process text/html as tag soup. Well, Steven actually said "documents served as text/html should be treated as HTML and not as XHTML" but that would break most documents or cause undefined behavior due to shorttags and stuff. So what he meant was tag soup. Since we cannot do that and they are unlikely to provide input on this matter we need to come up with a proper algorithm on our own. So how shall that look like? Using SGML::Parser::OpenSP we can do something like package Handler; use strict; use warnings; sub new { bless {}, shift } sub start_dtd { my $self = shift; my $doct = shift; # ignore specified document type declarations without # public or system identifier and implied document type # declarations (which have just a GeneratedSystemId key) return unless exists $doct->{ExternalId}{PublicId} or exists $doct->{ExternalId}{SystemId}; my $puid = $doct->{ExternalId}{PublicId}; # no public identifier means HTML die "HTML" unless defined $puid; # split public identifier at // my @comp = split(/\/\//, $puid); # malformed public identifiers mean HTML die "HTML" unless @comp > 2; # we might want something different than \s and \S here # but it is not clear to me what exactly we should expect die "HTML" unless $comp[2] =~ /^DTD\s+(\S+)/; # the first token of the public text description must include # the string "XHTML", see XHTML M12N section 3.1, and see also # http://w3.org/mid/41584c61.156809450@smtp.bjoern.hoehrmann.de die "HTML" unless $1 =~ /XHTML/; # otherwise considers this document XHTML die "XHTML" } sub start_element { my $self = shift; my $elem = shift; # no xmlns attribute means HTML die "HTML" unless exists $elem->{Attributes}{XMLNS}; my $xmlns = $elem->{Attributes}{XMLNS}; # this should use the corresponding helper function to deal # with some potential edge cases but it is not in CVS yet die "HTML" unless $xmlns->{Defaulted} eq "specified"; # see above die "HTML" unless "http://www.w3.org/1999/xhtml" eq join '', map { $_->{Data} } @{$xmlns->{CdataChunks}}; die "XHTML" } Instead of dying it would call egp->halt() and return HTML/XHTML through other means. This assumes that our sgml.soc is passed as catalog. If we remove the "DOCTYPE html ..." entry from sgml.soc (we can and should do that if we implement doctype defaulting through doctype rewriting which we can and should do) this will not read any document type definition and should thus be reasonably fast. In prose description, we will process a document using the HTML 4.01 SGML declaration unless either, when processed using the HTML 4.01 document type declaration by default, * the document has a document type declaration with a public identifier that when split at // has a third component which matches /^DTD\s+(\S+)/ for which $1 matches /XHTML/ * no public/system identifier but a <html> root element with an explicitly *specified* xmlns attribute with a value of "http://www.w3.org/1999/xhtml" Now there are many possible variations from this rather simple algorithm to get "better" results, for example if the xmlns attribute value is "http://www.w3c.org/1999/xhtml", but we need to draw a line somewhere. For example, due to default type configurations on some web servers, a document starting with <?xml version="1.0" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> <svg width="300px" height="100px" version="1.1" xmlns="http://www.w3.org/2000/svg"> ... might be delivered as text/html and at a first sight it might make sense to treat this document as XML but I do not really think the parse mode detection code is the proper place to suggest to fix the MIME type for the document, higher level content handlers would be a better place. For example, if we determined a HTML parse mode and the root element is not "html", we could stop further validation and just tell the user to fix the document and/or MIME type. http://lists.w3.org/Archives/Public/www-archive/2004Sep/0007.html has a number of test cases (70) to test whatever algorithm we come up with in a `make test` fashion. I have already some more test cases locally, and I would thus like to maintain them in CVS somewhere. I would like to know whether there are any good reasons to use a different algorithm to determine the parse mode, whether everyone is okay to use SGML::Parser::OpenSP to do that, where I could maintain the tests in CVS and where code as the fragment above should go at this point (CVS repository, module names, etc.) regards.
Received on Sunday, 5 September 2004 19:10:21 UTC