Re: Why HTML should be taught as HTML without pretending it is XML from Robert Burns on 2007-07-23 (public-html@w3.org from July 2007)

From: Robert Burns <rob@robburns.com>
Date: Mon, 23 Jul 2007 10:07:53 -0500
To: Jon Barnett <jonbarnett@gmail.com>
Cc: public-html <public-html@w3.org>
Message-Id: <4B0E65A8-BE96-4D04-8C75-4FC3FF6A054C@robburns.com>
On Jul 23, 2007, at 8:30 AM, Jon Barnett wrote:

>
> On 7/21/07, Robert Burns <rob@robburns.com> wrote:
>
>>
>> It may not be a popularity  contest, but it is a relatively well-
>> beaten cowpath. By cowpath principle, authors are demonstrating that
>> an HTML syntax (and implementation enhancements) that allows writing
>> once and deploying either as XML or as text/html is desirable.
>
> If anything, authors are demonstrating that they think that they *are*
> using XML, even though they're serving it as text/html.  There is an
> astounding number of authors who write HTML for money and don't know
> the difference.  I used to be one of them.

Well, I'm not sure we can conclude from one anecdote, what everyone  
else thinks about their HTML. In some ways, though, I think you may  
be right: authors think they're wiring XHTML (they are using XML,  
even if they don't serve it as such) . And many of them are. Sure,  
there are all sorts of subtle distinctions that might hit these  
authors if they switched to actually vend their XHTML as XML. That's  
a bit of confusion that needs to be cleared up, and I would certainly  
take every opportunity to make sure others understood that.. However,  
it still indicates many authors wanting to author using XHTML: and  
following appendix C. If authors author with XHTML and follow  
appendix C, I see few difficulties they would face if they wanted to  
switch over. The biggest problems would have to do with the  
immaturity of XHTM implementations. Many fairly recent versions of  
popular browsers will break on things like HTML named character  
entities. I'm talking fatal parse errors here. The other problems  
anyone might encounter (other than immature implementations) relate  
to using non-standards or not following appendix C. This to me is a  
cowpath that is authors telling us in droves we want to author with  
XHTML, they may even want to vend that XHTML as XML. However, the  
implementations have not caught up to the authors.

Telling authors they're somehow made a mistake because their beating  
down a cowpath that, for some strange reason you think is misguided,  
does not make it any less of a cowpath. No one has ever, as far as I  
am aware, ever explained in a logical way, what could possibly be  
wrong with authoring content that adheres to XHTML appendix C. It has  
simply become a mantra amidst a certain web development clique.

> They are not demonstrating that they know the difference, but like to
> switch back and forth. There are better tools for that, if an author
> actually knows what he is doing and wants to do that.

I don't think they switch back and forth (though I know their are  
some who inadvisably promote that). I think they just like the  
authoring style of XHTML (less intricacies to remember) and they  
would like to take advantage of all of the other features of XML  
whenever the implementations catch up to the authors.

> Henri pointed out some differences other than syntactic differences
> that authors won't catch on to.  There are plenty others including
> parsing rules, DOM functions and more.

Those are very minor differences that would only be gotchas for those  
ignoring Appendix C. Often authors are told to go with external  
stylesheets and external scripts (so that takes care of CDATA  
sections). Do that;, don't count on implicit elements; use Unicode  
characters instead of named character entities and stick with DOM1  
through DOM3 and you'll be fine (oh and don't count on IE consuming  
your content). There's no need to raise the Homeland Security alert  
level over XHTML. It's just a few things to understand about it   
before vending as XML. However, all that has nothing to do with the  
other reason for following an appendix C syntax: for its consistency  
and readability.

>> So I would say that the XHTML1-appendix C-like syntax is one of those
>> cowpaths we should be considering even if we aren't just trying to
>> judge a popularity contest.
>
> Appendix C is one of the reasons we now have the problem of HTML pages
> pretending to be XML, and why authors would be royally confused if
> they ever tried to actually serve those pages as XML.

What exactly is the problem with their "pretending" to be XHTML  
pages? Will the server crash? Will it give me bad breath? There's a  
lot of scare language around this issue that just has no technical  
justification.

> So, as I take it, your reason for wanting to encourage XML-like syntax
> is to smooth the conversion of an HTML document to an XHTML document.
> I contend that is a bad reason because it will encourage confusion
> between the differences between the two.  There are useful tools out
> there to convert the syntax.  Even after converting syntax, authors
> still have to contend with differences in available DOM methods,
> content models, parsing rules, and server configuration.

No, that is not really my reason for wanting an appendix-C-like  
syntax. There are many reasons, but one of them is that it is a much  
easier syntax to understand. As a related story, I think Dreamweaver  
2003  has been touted as the first authoring tool to produce valid  
well-formed HTML (this may just be advertising lingo, but there's  
some truth to it). Even the authoring tools couldn't keep HTML  
minimization rules straight. With the introduction of XML and XHTML,  
I sensed a sudden light bulb went on simultaneously in web developers  
around the world. "Aha!" they said "That's what proper nesting is all  
about!" The authoring tools started to get it right. Authors finally  
understood.

And it's not just a pedagogical issue. XML actually separates two  
things that cannot be clearly separated in HTML: well-formedness and   
validity. Take Henri's favorite example from HTML5: <p><ol><//o></ 
p>.. In HTML5, this is perfectly valid and well-formed (presuming its  
properly placed in a larger document). It's a part of a valid DOM  
tree state. It's a valid XML serialization. However, it's not  
possible to express this in HTML4 with MIME type text/html (I was  
under the impression that it would be valid in HTML5, but Henri  
suggests otherwise).

Moreover (and this is what I want to illustrate), there's no way for  
the parser to determine whether this is invalid or ill-formed (it's  
got to be one of them because text/html HTML4 forces that by it's  
validity and implicit </p> rules).  Is it invalid in that the author  
put an ordered list in a paragraph where it didn't belong? Or is it  
ill-formed where the author included a closing </p> tag where it  
didn't belong. The parser has to make a decision on this and that  
decision will effect the rendering of the page. In XML this would be  
a clear invalidity violation and clearly not an ill-formedness error.  
XML has basically made invalidity less of a problem because it  
separates out the worst part that was lumpted together in text/html:  
ill-formedness. That is why the parser error rules are so strict in  
XML: because we're talking about ill-formedness errors and not simply  
invalidness errors. I know HTML5 wants to address ill-formedness by  
specifying a recovery for all the possible errors. However drawing on  
the existing implementations requires that most of those recovery  
techniques result in assuming ill-formedness (very unhelpful when  
trying to extend HTML's vocabulary).

Anyway, this is getting off topic. The main thing is that there are  
many reasons to go xml-like syntax: even for text/html. There are not  
really any horrifying impacts that some hint at in doing so. And I  
think it would be good for HTML5 to foster this cowpath. By  
minimizing the differences between text/html and XML, HTML5 can get  
the word out better on how to handle those subtle differences. Since  
we're defining our own DOM, we can also specify what DOM APIs should  
be there. I could see breaking from text/html if we had good reason  
to deprecate those DOM calls, but it shouldn't just be based on the  
fact that some implementations just didn't implement document.write()  
for XML. It should be because we want to deprecate document.write()  
(If that's what we indeed want to do).

Take care,
Rob
Received on Monday, 23 July 2007 15:08:15 UTC