W3C home > Mailing lists > Public > www-validator@w3.org > October 2000

Re: [PATCH] DOCTYPE Override

From: <pdf@bizfon.com>
Date: Tue, 10 Oct 2000 10:39:51 -0400
To: www-validator@w3.org
Message-ID: <85256974.00508E05.00@Enterprise>


Hehe... sorry Terje!  I'll try to keep my interesting problems to a minimum.  :)
Thanks!
-Pete





Terje Bless <link@tss.no> on 10/10/2000 12:21:50 AM

To:   W3C Validator <www-validator@w3.org>
cc:    (bcc: Peter Foti)

Subject:  [PATCH] DOCTYPE Override



Well, since Peter just couldn't leave well enough alone but just /had/ get
me started on this; here is a patch that adds a DOCTYPE override function
to the Validator. :-)


A couple of points:

* This is seriously nasty code! Testing is essential before it should
  be used in production. It's also a quick hack and it will have to be
  completely rewritten at some point. You have been warned.
* It requires HTML::Parser. The corollary being that I dump all[0] the
  nitty details on HTML::Parser so if HTML::Parser chokes on something,
  we blow up too.
* In addition to the patch, I've attached a necessary configuration
  file for mapping version names to DOCTYPE declarations.


The flip side is that HTML::Parser looks to work great (for HTML!) so we
can probably soon[1] ditch all that skanky code that deals with content
sniffing (charset, META elements, DOCTYPEs, you name it) and use
HTML::Parser callbacks instead. We're fucked if we get fed XML, of course,
but XML support isn't that great in any case. For starters we can disable
the stuff for XML and then look at using XML::Parser to provide equivalent
functionality.

XML support needs to be revised as soon as I can find a decent XML parser
that groks Schemas, but that's a whole new ball game.


Anyways, gotta go. It's 6am up here on the North Pole and I'm supposed to
be at work in two hours. I should know better then to start looking at
interesting problems after 8pm... :-(



[0] - And when I say "all" I really mean *all*. :-)
[1] - For certain odd definitions of the word.  :-)

--
By definition there is *no*way* any problem can be my fault. Any problems you
think you can find in my code are all in your imagination. If you continue
with such derranged imaginings then I may be forced to perform corrective
brain surgery... with an axe. - Stephen Harris <sweh@spuddy.mew.co.uk> in asr.



diff -r -u /usr/local/validator/htdocs/index.html
/tmp/validator/htdocs/index.html
--- /usr/local/validator/htdocs/index.html Fri
Apr 28 11:05:35 2000
+++ /tmp/validator/htdocs/index.html    Tue Oct 10 05:37:29
2000
@@ -59,7 +59,27 @@
   </p>
 
   <form method="get" action="/check">
-
Address: <input name="uri" size="50" />
+    Address: <input name="uri"
size="50" /><br />
+    Doctype:
+ <select name="doctype">
+
<option>Inline</option>
+       <option>XHTML 1.0 Strict</option>
+
<option>XHTML 1.0 Transitional</option>
+       <option>XHTML 1.0
Frameset</option>
+   <option>HTML 4.01 Strict</option>
+      <option>HTML 4.01
Transitional</option>
+    <option>HTML 4.01 Frameset</option>
+    <option>HTML
2.0</option>
+   <option>HTML 3.0 (AdvaSoft version)</option>
+     <option>HTML
3.2</option>
+   <option>HTML 3.2 + Style</option>
+      <option>HTML
Pro</option>
+   <option>Spyglass HTML 2.0 Extended</option>
+      <option>HTML
Level Cougar</option>
+    <option>HTML 4.0 Strict</option>
+       <option>HTML
4.0 Transitional</option>
+     <option>HTML 4.0 Frameset</option>
+   </select>
<table cellpadding="0" cellspacing="0">
   <!--
     <tr>
diff -r -u
/usr/local/validator/httpd/cgi-bin/check /tmp/validator/httpd/cgi-bin/check
---
/usr/local/validator/httpd/cgi-bin/check      Tue Oct 10 05:35:47 2000
+++
/tmp/validator/httpd/cgi-bin/check Tue Oct 10 05:40:25 2000
@@ -22,6 +22,7 @@
use CGI::Carp;
 use CGI qw(:cgi -newstyle_urls -private_tempfiles);
 use
Text::Wrap;
+use HTML::Parser;
#############################################################################
@@
-38,7 +39,7 @@
 #
 # Define global variables
 use vars qw($VERSION $DATE
$MAINTAINER);             # Strings we need.
-use vars qw($frag $pub_ids
$element_uri $file_type); # Cfg hashes.
+use vars qw($frag $pub_ids $element_uri
$file_type $doctypes); # Cfg hashes.
 
 #
 # Paths and file locations
@@ -49,6
+50,7 @@
 my $fpis_db   = $html_path . 'config/fpis.cfg';
 my $frag_db   =
$html_path . 'config/frag.cfg';
 my $type_db   = $html_path . 'config/type.cfg';
+my $dtds_db   = $html_path . 'config/doctypes.cfg';
 my $sgmlstuff = $html_path
. 'sgml-lib';
 my $sgmldecl  = $sgmlstuff . '/REC-html40-19980424/HTML4.decl';
my $xhtmldecl = $sgmlstuff . '/REC-xhtml1-20000126/xhtml1.dcl';
@@ -110,6 +112,7
@@
 $pub_ids     = &read_cfg($fpis_db); # Errors  -> fragment identifier
$element_uri = &read_cfg($elem_db); # Element -> URI fragment
 $file_type   =
&read_cfg($type_db); # Content -> File -type
+$doctypes    =
&read_cfg($dtds_db); # Name    -> doctype
 
 #
 # Set up signal handlers.
@@
-251,7 +254,13 @@
 #  4. if there is an xmlns= attribute, check for XML
well-formedness
 #  5. if there is no xmlns= attribute, validate as HTML using
the doctype
 #     inferred by the check_for_doctype function
+
 #
+# Override
DOCTYPE.
+if (defined $q->param('doctype') and not $q->param('doctype') =~
/Inline/i) {
+  $File->{Content} = &supress_doctype($File->{Content});
+
unshift @{$File->{Content}}, $doctypes->{$q->param('doctype')};
+}
 
 #
 # Try
to extract or guess the DOCTYPE for HTML and XHTML files.
@@ -1377,4 +1386,18 @@
$file =~ s(\015)    {\n}g; # Turn ASCII CR   into native newline.
 
   return
[split /\n/, $file];
+}
+
+#
+# Supress any existing DOCTYPE by commenting it
out.
+sub supress_doctype {
+  no strict 'vars';
+  my $file = shift;
+  local
$HTML = '';
+
+  HTML::Parser->new(
+             default_h     => [sub {$HTML
.= shift}, 'text'],
+             declaration_h => [sub {$HTML .= '<!-- ' .
$_[0] . ' -->'}, 'text']
+            )->parse(join "\n", @{$file});
+  return
[split /\n/, $HTML];
 }

HTML 0.0            <!DOCTYPE html PUBLIC "-//IETF//DTD HTML Level 0//EN//2.0">
Strict HTML 0.0               <!DOCTYPE html PUBLIC "-//IETF//DTD HTML Strict
Level 0//EN//2.0">
HTML 1.0              <!DOCTYPE html PUBLIC "-//IETF//DTD
HTML 2.0 Level 1//EN">
Strict HTML 1.0             <!DOCTYPE html PUBLIC
"-//IETF//DTD HTML 2.0 Strict Level 1//EN">
Strict HTML 2.0            <!DOCTYPE
html PUBLIC "-//IETF//DTD HTML 2.0 Strict//EN">
HTML 2.0               <!DOCTYPE
html PUBLIC "-//IETF//DTD HTML 2.0//EN">
HTML 2.1E                <!DOCTYPE html
PUBLIC "-//IETF//DTD HTML 2.1E//EN">
HTML 3.0 (AdvaSoft version)  <!DOCTYPE html
PUBLIC "-//AS//DTD HTML 3.0 asWedit + extensions//EN">
HTML 3.0 (Beta)
     <!DOCTYPE html PUBLIC "-//IETF//DTD HTML 3.0//EN">
Strict HTML 3.0 (Beta)
     <!DOCTYPE html PUBLIC "-//W3O//DTD W3 HTML Strict 3.0//EN//">
Hotjava-HTML
          <!DOCTYPE html PUBLIC "-//Sun Microsystems Corp.//DTD HotJava
HTML//EN">
Strict Hotjava-HTML           <!DOCTYPE html PUBLIC "-//Sun
Microsystems Corp.//DTD HotJava Strict HTML//EN">
Netscape-HTML
<!DOCTYPE html PUBLIC "-//WebTechs//DTD Mozilla HTML 2.0//EN">
Strict
Netscape-HTML       <!DOCTYPE html PUBLIC "-//Netscape Comm. Corp. Strict//DTD
HTML//EN">
MSIE-HTML               <!DOCTYPE html PUBLIC "-//Microsoft//DTD
Internet Explorer 2.0 HTML//EN">
Strict MSIE-HTML       <!DOCTYPE html PUBLIC
"-//Microsoft//DTD Internet Explorer 2.0 HTML Strict//EN">
MSIE 3.0 HTML
     <!DOCTYPE html PUBLIC "-//Microsoft//DTD Internet Explorer 3.0 HTML//EN">
Strict MSIE 3.0 HTML          <!DOCTYPE html PUBLIC "-//Microsoft//DTD Internet
Explorer 3.0 HTML Strict//EN">
ORA HTML Extended v1.0        <!DOCTYPE html
PUBLIC "-//OReilly and Associates//DTD HTML Extended 1.0//EN">
ORA HTML Extended
Relaxed v1.0   <!DOCTYPE html PUBLIC "-//OReilly and Associates//DTD HTML
Extended Relaxed 1.0//EN">
HTML 2.2                <!DOCTYPE html PUBLIC
"-//IETF//DTD HTML V2.2//EN">
HTML 1996-01              <!DOCTYPE html PUBLIC
"-//W3C//DTD HTML 1996-01//EN">
HTML 3.2                <!DOCTYPE html PUBLIC
"-//W3C//DTD HTML 3.2 Final//EN">
HTML 3.2 + Style           <!DOCTYPE html
PUBLIC "-//W3C//DTD HTML Experimental 970421//EN">
HTML Pro            <!DOCTYPE
html PUBLIC "+//Silmaril//DTD HTML Pro v0r11 19970101//EN">
Spyglass HTML 2.0
Extended  <!DOCTYPE html PUBLIC "-//Spyglass//DTD HTML 2.0 Extended//EN">
HTML
Level Cougar        <!DOCTYPE html PUBLIC "
http://www.w3.org/MarkUp/Cougar/Cougar.dtd">
HTML 4.0 Strict
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN">
HTML 4.0 Transitional
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
HTML 4.0 Frameset
     <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN">
HTML 4.01 Strict
     <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
HTML 4.01 Transitional
     <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
HTML 4.01
Frameset       <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN">
XHTML
1.0 Strict          <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
XHTML 1.0 Transitional
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
XHTML 1.0 Frameset
     <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
Received on Tuesday, 10 October 2000 10:40:16 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:13:54 GMT