[whatwg] Allow trailing slash in always-empty HTML5 elements? from Lachlan Hunt on 2006-12-02 (public-whatwg-archive@w3.org from December 2006)

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Sun, 03 Dec 2006 00:27:55 +1100
Message-ID: <45717F5B.7080907@lachy.id.au>
Elliotte Harold wrote:
> Lachlan Hunt wrote:
>> HTML and XML have significantly different parsing requirements and 
>> they absolutely must be treated as significantly different file 
>> formats.  Any attempt to treat them as the same format is an extremely 
>> bad idea.
> 
> That's only true to the extent that some people seem to insist on making 
> them needlessly different. HTML is tantalizingly close to well-formed 
> XML. They both derive from SGML. They both use angle bracketed tags. 
> They both define a tree structure. Indeed in many cases an HTML document 
> is an XML document.

In many more cases, an HTML document or even an XHTML 1.0 as text/html 
document is just tag soup.

> This enables the use of the very powerful XML toolchain for processing 
> HTML.

The XHTML serialisation allows for the very powerful XML toolchain for 
processing (X)HTML.  You just need to stick an HTML serialiser on to the 
end of it.

> In fact, prior to the widespread adoption of XML there were, near 
> as I could tell, no reliable open means of parsing HTML documents.

HTML 2.0 to 4.01 documents could, in the same way you're insisting on 
using XML tools on the back end, be reliably parsed using SGML tools. 
Now that HTML 5 is no longer based on SGML tools, it will require the 
use of an HTML5 parser instead, but the principle is the same.  It seems 
the only thing preventing that from happening right now is the current 
lack of implementations.  But given that HTML5 is a new language still 
under development, and the fact that such tools are being developed 
right now, it won't be a problem for much longer.

> There were a few proprietary, incompatible, buggy engines locked up in various 
> browsers; and that was about it.

OpenSP, which is free software,

> What I don't understand is why some members of this working group is so 
> dead set on actively preventing HTML from being XML. The non-draconian 
> error handling I understand.

Because the fact is that when authors try to use XHTML as text/html, 
they inevitibly fail to do so properly.  It takes considerable knowledge 
and skill to be aware of and handle all issues ranging from parsing, 
character encodings to scripts and stylesheets.

This is list of very common mistakes inevitably made by the vast 
majority real-world authors when they try and fail to use XHTML as 
text/html, which would cause significant problems with any attempt to 
serve as XML.

* Fatal well-formedness errors
   - Unencoded & and <
   - Unclosed elements
   - Unqutoed attrbutes
   - etc...

* Incorrect or omitted namespace declaration (xmlns attribute), or use
   of ill-formed MS Office xmlns garbage.

* Named entity references require validating parsers (or a Mozilla-like
   hack to parse a subset of the DTD for recognised DOCTYPEs)
   - (excluding &amp; &lt; &gt; &quot; and &apos;)
   - Lack of DOCTYPE in XHTML5 means that any others would be fatal

* Encoding should be declared within the XML declaration
   - When omitted, UTF-8 or UTF-16 must be used, unless specified at
     the protocol level (usually not done).
   - Many just use ISO-8859-1, Windows-1252, etc. specifed using <meta>
   - XML declaration triggers quirks mode in IE6 (text/html only).

* Badly encoded characters
   - e.g. use of Windows-1252 when ISO-8859-1 is declared

* Script and style elements are parsed differently
   - Not a problem for external scripts, but internal scripts
     are very common.

   - This *very common* technique doesn't work in XML:
     <script><!-- // Hide from older browsers
         // Script will not execute in XML
     //--></script>

   - On pages that don't use that comment, this would be fatal:
     <script>
         if (a < b & c) {
             // do someting
         }
     </script>

    - This can be worked around using a CDATA section, but
      <script>//<![CDATA[
          // Authors rarely do this!
      //]]></script>

* document.write() and document.writeln() do not work.

* DOM methods are case sensitive.
   - Although HTML5 is attempting to address many DOM API differences,
     several still remain for backwards compatibility.

* XML rules for CSS differ slightly from HTML.
   - e.g. No special treatment for the body element.
   - Case sensitivity of Selectors

Keep in mind that, although someone like yourself may be able to handle 
every single one of those issues with ease, you are in the minority. 
There is significant evidence to show that millions of authors make 
those mistakes very frequently, despite thinking they're using XHTML.

That is why I strongly believe that XHTML 1.0 Appendix C was a huge 
mistake and that continuing to allow authors to think they can use XHTML 
as text/html is extremely harmful for the future of XML, not beneficial 
to it.

> But why are you disappointed that <!DOCTYPE html> is well-formed XML?
> Why the active hostility to well-formedness?

Because it allows people like youself to continue thinking that it's ok 
to parse HTML with an XML parser, just because they happen to share a 
few similarities in their syntax, and despite that fact that an XML 
serialisation is being provided for exactly that purpose.

-- 
Lachlan Hunt
http://lachy.id.au/
Received on Saturday, 2 December 2006 05:27:55 UTC