Re: Why HTML should be taught as HTML without pretending it is XML from Robert Burns on 2007-07-23 (public-html@w3.org from July 2007)

From: Robert Burns <rob@robburns.com>
Date: Mon, 23 Jul 2007 13:32:04 -0500
To: Jon Barnett <jonbarnett@gmail.com>
Cc: public-html <public-html@w3.org>
Message-Id: <97A1960A-D3B2-4411-8B7C-0908AB00FE49@robburns.com>
Hi Jon,


On Jul 23, 2007, at 1:02 PM, Jon Barnett wrote:

>
> On 7/23/07, Robert Burns <rob@robburns.com> wrote:
>>
>
>>
>> This would not tell us how many people thought they were serving
>> XHTML as XML. I would show us how many people were authoring to XHTML
>> (possibly appendix C) perhaps for validation and authoring simplicity
>> reasons, but vending as text/html. It would also might serve as a
>> measure of how many pages will face some difficulties when switching
>> to XML. There's nothing wrong with producing XHTML code that meets
>> the appendix C guidelines and serving as text/html.
>
> At risk of beating a dead horse...
>
> Is your premise that there are authors who:
> - Use an XHTML 1.0 DOCTYPE declaration

Yes.

> - Include the html/@xmlns attribute, and maybe even a <?xml?> PI.

Possibly.

> - Use XML syntax (ending empty elements with />, lowercase tags, etc)

Yes.

> - Follow the guidelines of XHTML 1.0 Appendix C

Yes.

> - Send this document as text/html

Yes.

> - Are fully aware that this document is not treated as XHTML by any
> UA, and is parsed as plain HTML by every UA

Yes (they're mostly not even concerned about this issue).

> - Never intend to switch to serving the document as
> application/xhtml+xml, or have the document parsed as XML, without
> mowing through a long list of caveats not covered by Appendix C

No, These authors jumped on the band-wagon because they're excited  
about the possibilities of XTHML and XML They want to switch some  
day. However, it is not their main motivation to simply avoid running  
HTMLTidy on their content. That's what makes it an even stronger  
cowpath: it's an indication of author's excitement about XML and  
XHTML and not an indication that authors are stupid (as others seem  
to imply).

> - ... and do all of these things solely because they prefer the
> syntax? (This is the important point, because I believe that's the
> crux of your premise)

No, not solely. It's the syntax, it's the excitement about the  
promise of XML and XHTML, it's the assuredness that once  
implementations are ready it will be a small step to changeover, etc.

> and then,
> - are aware that they would have this same feature-set by using an
> HTML 4.01 DOCTYPE and removing some / characters from the document
> (they're allowed in HTML 5)

Yes.

> By observing the cowpath, that's not what I see.

You just see some poor misguided saps? You haven't shown anything  
wrong with using this appendix C syntax. Yet you think anyone using  
it is doing something wrong. What are they doing wrong? It requires  
authors to overlook some validation errors (xml:lang and lang; the  
"/" characters, etc.). However, as I've said before that could be  
fixed by adding an appendix C DTD to the validator. These authors  
however, have gotten use to looking past those errors. They know  
which ones to ignore.

> For example, I used to (and still rarely do) write XHTML documents and
> serve them as text/html.  But I only did this with the intention that
> an XML parser could read it the same way, and that I intended to serve
> it as XML in the future.

Then you must understand what I'm talking about. :-) (I'm not using  
must in the RFC 2119 sense)

>
>>
>>
>> > I'm afraid I can't offer anything
>> > other than anecdotes (experience on lots of forums, personal
>> > conversations, etc., the fact my college professor was teaching
>> > exactly what I learned to a number of other students), but the fact
>> > that this page exists says something:
>> >
>> > http://www.hixie.ch/advocacy/xhtml
>>
>> I'm not sure it odes say anything. There are lots of authors
>> authoring pages as XHTML and serving it as text/html (presumably
>> following more or less appendix C or they would be failing now).
>>
>>
>> >> Telling authors they're somehow made a mistake because their  
>> beating
>> >> down a cowpath that, for some strange reason you think is  
>> misguided,
>> >> does not make it any less of a cowpath.
>> > It's how you interpret the cowpath.  I interpret it to mean that
>> > authors misunderstand how XHTML actually works.  I think that  
>> teaching
>> > HTML as having XHTML-like syntax would lead to shock when the  
>> author
>> > first tries to do <p><ol></ol></p>
>>
>> That's invalid HTML4.01 and invalid XHTML1.0. This will be a problem
>> when we introduce HTML5 and authors run their documents through an
>> HTML5 conformance checker if the conformance checker doesn't throw up
>> an error. An XHTML1 validator should throw up an error for that too.
>>
>> >> No one has ever, as far as I
>> >> am aware, ever explained in a logical way, what could possibly be
>> >> wrong with authoring content that adheres to XHTML appendix C.  
>> It has
>> >> simply become a mantra amidst a certain web development clique.
>> >> ...
>> >> Those are very minor differences that would only be gotchas for  
>> those
>> >> ignoring Appendix C. Often authors are told to go with external
>> >> stylesheets and external scripts (so that takes care of CDATA
>> >> sections). Do that;, don't count on implicit elements; use Unicode
>> >> characters instead of named character entities and stick with DOM1
>> >> through DOM3 and you'll be fine (oh and don't count on IE  
>> consuming
>> >> your content). There's no need to raise the Homeland Security  
>> alert
>> >> level over XHTML. It's just a few things to understand about it
>> >> before vending as XML. However, all that has nothing to do with  
>> the
>> >> other reason for following an appendix C syntax: for its  
>> consistency
>> >> and readability.
>> >
>> > All of those things you just mentioned are caveats when serving  
>> XHTML
>> > as text/html, and none of them are mentioned in in XHTML 1.0  
>> Appendix
>> > C.
>>
>> No, those are not caveats for serving XHTML1.0 as text/html.  Those
>> are caveats for those who, though they are successfully serving
>> XHTML1 as text/html, want to move to XML instead.
>>
>> However, from appendix C[1]:
>> quote/
>> C.4. Embedded Style Sheets and Scripts
>> Use external style sheets if your style sheet uses < or & or ]]> or
>> --. Use external scripts if your script uses < or & or ]]> or --.
>> Note that XML parsers are permitted to silently remove the contents
>> of comments. Therefore, the historical practice of "hiding" scripts
>> and style sheets within "comments" to make the documents backward
>> compatible is likely to not work as expected in XML-based user  
>> agents.
>>
>> /unquote
>>
>> The other points only relate to vending that XHTML content as XML.
>> Authors vending as text/html need not concern themselves with those
>> (which is why they're not mentioned in appendix C; perhaps there
>> should have been an appendix E: moving your appendix C content to  
>> XML).
>>
>> > To that, I'll add that document.createElement(), one of the most  
>> basic
>> > DOM methods, creates an element without a namespace.  If this
>> > quasi-XHTML eventually gets served as XHTML, even
>> > document.createElement would have unintended consequenses.
>> >
>> >> And it's not just a pedagogical issue. XML actually separates two
>> >> things that cannot be clearly separated in HTML: well- 
>> formedness and
>> >> validity. Take Henri's favorite example from HTML5: <p><ol><//o></
>> >> p>.. In HTML5, this is perfectly valid and well-formed  
>> (presuming its
>> >> properly placed in a larger document). It's a part of a valid DOM
>> >> tree state. It's a valid XML serialization. However, it's not
>> >> possible to express this in HTML4 with MIME type text/html (I was
>> >> under the impression that it would be valid in HTML5, but Henri
>> >> suggests otherwise).
>> >
>> > It's mentioned here:
>> > http://www.whatwg.org/specs/web-apps/current-work/#element-
>> > restrictions
>>
>> I see. I hadn't yet read that part of the spec. I definitely support
>> this forking here, but we need to be extra careful about informing
>> readers of our recommendation about that. We might even want to
>> include some notation on the semantics chapter to help draw attention
>> to these varied content models. I had been wondering how those
>> content models were going to work in text/html, but I assumed that
>> the testing had already been done.
>>
>> >> Is it invalid in that the author
>> >> put an ordered list in a paragraph where it didn't belong? Or  
>> is it
>> >> ill-formed where the author included a closing </p> tag where it
>> >> didn't belong.
>> >
>> > The latter.  It's invalid (or malformed) because there's a closing
>> > </p> tag where it didn't belong.  The <p> element was implicitly
>> > closed when the parser reached the opening <ol> tag.
>>
>> No, you missed the point. It is definitely not the latter. The
>> author's intention was to invalidly place an ordered list into a
>> paragraph. The parser mis-guesses that it's instead an ill-formed
>> document fragment. That's the point of the example. Most of the HTML
>> error recover behaves this way. The point of the example is we start
>> from the author's intention: here to do something invalid. The text/
>> html parser cannot tell the difference. So it assumes the wrong thing
>> here (wrong as in not what the author intended). (again this is off-
>> topic, but that is what XML introduces; a way for the parser to
>> always tell the difference, though it has to be well-formed before it
>> can move to the next step)
>
> I understood the point.  You end up with an unintended consequence
> because an author understood a strict syntax and not what the parser
> would actually do.  The author must because HTML syntax for what it is
> to prevent this.
>
> My interpretation of the mistake is correct: the parser followed the
> parsing rules of the spec.  It's not the parser's fault if the author
> was taught XML-like HTML and not the actual rules of HTML.

I still don't think you understand my point. The point is that there  
are two things (now that we can look at it from an XML perspective,  
but they were there all along): validity and ill-formedness. The text/ 
html parser can't be the measure of what's right here because it was  
created with no knowledge whatsoever that these two things are  
distinct. In the parsers view they are collapsed together and  
inseparable. The great thing about XML is that it has separated these  
two things. That creates opportunities for extensibility. It makes  
schema changes easier to cope with (the sorts of schema changes we'd  
like to do with HTML5, but can't because of this text/html limitation).

It's an off-topic point, but it's something cool about XML.

Take care,
Rob
Received on Monday, 23 July 2007 18:32:44 UTC