Re: Why HTML should be taught as HTML without pretending it is XML from Robert Burns on 2007-07-23 (public-html@w3.org from July 2007)

From: Robert Burns <rob@robburns.com>
Date: Mon, 23 Jul 2007 11:55:29 -0500
To: Jon Barnett <jonbarnett@gmail.com>
Cc: public-html <public-html@w3.org>
Message-Id: <FB4C5528-387E-4324-BCA0-4AB9BC3FB745@robburns.com>
On Jul 23, 2007, at 11:19 AM, Jon Barnett wrote:

>
> On 7/23/07, Robert Burns <rob@robburns.com> wrote:
>>
>> Well, I'm not sure we can conclude from one anecdote, what everyone
>> else thinks about their HTML.
>
> The measurable study would be the number of pages with XHTML DOCTYPEs,
> served as HTML, and containing markup that would have unintended
> consequences if served as XML.

This would not tell us how many people thought they were serving  
XHTML as XML. I would show us how many people were authoring to XHTML  
(possibly appendix C) perhaps for validation and authoring simplicity  
reasons, but vending as text/html. It would also might serve as a  
measure of how many pages will face some difficulties when switching  
to XML. There's nothing wrong with producing XHTML code that meets  
the appendix C guidelines and serving as text/html.


> I'm afraid I can't offer anything
> other than anecdotes (experience on lots of forums, personal
> conversations, etc., the fact my college professor was teaching
> exactly what I learned to a number of other students), but the fact
> that this page exists says something:
>
> http://www.hixie.ch/advocacy/xhtml

I'm not sure it odes say anything. There are lots of authors  
authoring pages as XHTML and serving it as text/html (presumably  
following more or less appendix C or they would be failing now).


>> Telling authors they're somehow made a mistake because their beating
>> down a cowpath that, for some strange reason you think is misguided,
>> does not make it any less of a cowpath.
> It's how you interpret the cowpath.  I interpret it to mean that
> authors misunderstand how XHTML actually works.  I think that teaching
> HTML as having XHTML-like syntax would lead to shock when the author
> first tries to do <p><ol></ol></p>

That's invalid HTML4.01 and invalid XHTML1.0. This will be a problem  
when we introduce HTML5 and authors run their documents through an  
HTML5 conformance checker if the conformance checker doesn't throw up  
an error. An XHTML1 validator should throw up an error for that too.

>> No one has ever, as far as I
>> am aware, ever explained in a logical way, what could possibly be
>> wrong with authoring content that adheres to XHTML appendix C. It has
>> simply become a mantra amidst a certain web development clique.
>> ...
>> Those are very minor differences that would only be gotchas for those
>> ignoring Appendix C. Often authors are told to go with external
>> stylesheets and external scripts (so that takes care of CDATA
>> sections). Do that;, don't count on implicit elements; use Unicode
>> characters instead of named character entities and stick with DOM1
>> through DOM3 and you'll be fine (oh and don't count on IE consuming
>> your content). There's no need to raise the Homeland Security alert
>> level over XHTML. It's just a few things to understand about it
>> before vending as XML. However, all that has nothing to do with the
>> other reason for following an appendix C syntax: for its consistency
>> and readability.
>
> All of those things you just mentioned are caveats when serving XHTML
> as text/html, and none of them are mentioned in in XHTML 1.0 Appendix
> C.

No, those are not caveats for serving XHTML1.0 as text/html.  Those  
are caveats for those who, though they are successfully serving  
XHTML1 as text/html, want to move to XML instead.

However, from appendix C[1]:
quote/
C.4. Embedded Style Sheets and Scripts
Use external style sheets if your style sheet uses < or & or ]]> or  
--. Use external scripts if your script uses < or & or ]]> or --.  
Note that XML parsers are permitted to silently remove the contents  
of comments. Therefore, the historical practice of "hiding" scripts  
and style sheets within "comments" to make the documents backward  
compatible is likely to not work as expected in XML-based user agents.

/unquote

The other points only relate to vending that XHTML content as XML.  
Authors vending as text/html need not concern themselves with those  
(which is why they're not mentioned in appendix C; perhaps there  
should have been an appendix E: moving your appendix C content to XML).

> To that, I'll add that document.createElement(), one of the most basic
> DOM methods, creates an element without a namespace.  If this
> quasi-XHTML eventually gets served as XHTML, even
> document.createElement would have unintended consequenses.
>
>> And it's not just a pedagogical issue. XML actually separates two
>> things that cannot be clearly separated in HTML: well-formedness and
>> validity. Take Henri's favorite example from HTML5: <p><ol><//o></
>> p>.. In HTML5, this is perfectly valid and well-formed (presuming its
>> properly placed in a larger document). It's a part of a valid DOM
>> tree state. It's a valid XML serialization. However, it's not
>> possible to express this in HTML4 with MIME type text/html (I was
>> under the impression that it would be valid in HTML5, but Henri
>> suggests otherwise).
>
> It's mentioned here:
> http://www.whatwg.org/specs/web-apps/current-work/#element- 
> restrictions

I see. I hadn't yet read that part of the spec. I definitely support  
this forking here, but we need to be extra careful about informing  
readers of our recommendation about that. We might even want to  
include some notation on the semantics chapter to help draw attention  
to these varied content models. I had been wondering how those  
content models were going to work in text/html, but I assumed that  
the testing had already been done.

>> Is it invalid in that the author
>> put an ordered list in a paragraph where it didn't belong? Or is it
>> ill-formed where the author included a closing </p> tag where it
>> didn't belong.
>
> The latter.  It's invalid (or malformed) because there's a closing
> </p> tag where it didn't belong.  The <p> element was implicitly
> closed when the parser reached the opening <ol> tag.

No, you missed the point. It is definitely not the latter. The  
author's intention was to invalidly place an ordered list into a  
paragraph. The parser mis-guesses that it's instead an ill-formed  
document fragment. That's the point of the example. Most of the HTML  
error recover behaves this way. The point of the example is we start  
from the author's intention: here to do something invalid. The text/ 
html parser cannot tell the difference. So it assumes the wrong thing  
here (wrong as in not what the author intended). (again this is off- 
topic, but that is what XML introduces; a way for the parser to  
always tell the difference, though it has to be well-formed before it  
can move to the next step)

>> Anyway, this is getting off topic. The main thing is that there are
>> many reasons to go xml-like syntax: even for text/html.
>
> There is at least one good reason not to.  By teaching authors to use
> a "stricter" HTML syntax, authors expect the parser to follow that
> stricter syntax (e.g. expecting <p><ol></ol><p> to work)  This leads
> to unintended consequences.  (e.g. it gets parsed as <p></p><ol></ol>)
> These unintended consequences are analogous to the unintended
> consequences of serving XHTML as text/html.

The author's would still be validating their code. I have a feeling  
those that use XHTML doctypes probably also validate code (or those  
using any doctype for that matter). The xml-like syntax doesn't  
suggest anyone can go and create invalid documents: simply because  
they're not ill-formed. It's just that authors have, more  and more,  
used XHTML syntax and validate their code against an XHTML DTD.

> One can encourage authors to use good, consistent coding practice -
> quoted attributes, etc.  But, teaching XML-like HTML as The Way to
> write HTML would lead to those unintended consequences.

I still don't see what unintended consequences you're talking about.  
Will Hixie write them a nasty email? :-)

Take care,
Rob
Received on Monday, 23 July 2007 16:56:09 UTC