Re: Why HTML should be taught as HTML without pretending it is XML from Jon Barnett on 2007-07-23 (public-html@w3.org from July 2007)

From: Jon Barnett <jonbarnett@gmail.com>
Date: Mon, 23 Jul 2007 13:02:45 -0500
To: "Robert Burns" <rob@robburns.com>
Cc: public-html <public-html@w3.org>
Message-ID: <bde87dd20707231102r5644a4d5n6c4e51d8d6623e5a@mail.gmail.com>
On 7/23/07, Robert Burns <rob@robburns.com> wrote:
>

>
> This would not tell us how many people thought they were serving
> XHTML as XML. I would show us how many people were authoring to XHTML
> (possibly appendix C) perhaps for validation and authoring simplicity
> reasons, but vending as text/html. It would also might serve as a
> measure of how many pages will face some difficulties when switching
> to XML. There's nothing wrong with producing XHTML code that meets
> the appendix C guidelines and serving as text/html.

At risk of beating a dead horse...

Is your premise that there are authors who:
- Use an XHTML 1.0 DOCTYPE declaration
- Include the html/@xmlns attribute, and maybe even a <?xml?> PI.
- Use XML syntax (ending empty elements with />, lowercase tags, etc)
- Follow the guidelines of XHTML 1.0 Appendix C
- Send this document as text/html
- Are fully aware that this document is not treated as XHTML by any
UA, and is parsed as plain HTML by every UA
- Never intend to switch to serving the document as
application/xhtml+xml, or have the document parsed as XML, without
mowing through a long list of caveats not covered by Appendix C
- ... and do all of these things solely because they prefer the
syntax? (This is the important point, because I believe that's the
crux of your premise)
and then,
- are aware that they would have this same feature-set by using an
HTML 4.01 DOCTYPE and removing some / characters from the document
(they're allowed in HTML 5)

By observing the cowpath, that's not what I see.

For example, I used to (and still rarely do) write XHTML documents and
serve them as text/html.  But I only did this with the intention that
an XML parser could read it the same way, and that I intended to serve
it as XML in the future.

>
>
> > I'm afraid I can't offer anything
> > other than anecdotes (experience on lots of forums, personal
> > conversations, etc., the fact my college professor was teaching
> > exactly what I learned to a number of other students), but the fact
> > that this page exists says something:
> >
> > http://www.hixie.ch/advocacy/xhtml
>
> I'm not sure it odes say anything. There are lots of authors
> authoring pages as XHTML and serving it as text/html (presumably
> following more or less appendix C or they would be failing now).
>
>
> >> Telling authors they're somehow made a mistake because their beating
> >> down a cowpath that, for some strange reason you think is misguided,
> >> does not make it any less of a cowpath.
> > It's how you interpret the cowpath.  I interpret it to mean that
> > authors misunderstand how XHTML actually works.  I think that teaching
> > HTML as having XHTML-like syntax would lead to shock when the author
> > first tries to do <p><ol></ol></p>
>
> That's invalid HTML4.01 and invalid XHTML1.0. This will be a problem
> when we introduce HTML5 and authors run their documents through an
> HTML5 conformance checker if the conformance checker doesn't throw up
> an error. An XHTML1 validator should throw up an error for that too.
>
> >> No one has ever, as far as I
> >> am aware, ever explained in a logical way, what could possibly be
> >> wrong with authoring content that adheres to XHTML appendix C. It has
> >> simply become a mantra amidst a certain web development clique.
> >> ...
> >> Those are very minor differences that would only be gotchas for those
> >> ignoring Appendix C. Often authors are told to go with external
> >> stylesheets and external scripts (so that takes care of CDATA
> >> sections). Do that;, don't count on implicit elements; use Unicode
> >> characters instead of named character entities and stick with DOM1
> >> through DOM3 and you'll be fine (oh and don't count on IE consuming
> >> your content). There's no need to raise the Homeland Security alert
> >> level over XHTML. It's just a few things to understand about it
> >> before vending as XML. However, all that has nothing to do with the
> >> other reason for following an appendix C syntax: for its consistency
> >> and readability.
> >
> > All of those things you just mentioned are caveats when serving XHTML
> > as text/html, and none of them are mentioned in in XHTML 1.0 Appendix
> > C.
>
> No, those are not caveats for serving XHTML1.0 as text/html.  Those
> are caveats for those who, though they are successfully serving
> XHTML1 as text/html, want to move to XML instead.
>
> However, from appendix C[1]:
> quote/
> C.4. Embedded Style Sheets and Scripts
> Use external style sheets if your style sheet uses < or & or ]]> or
> --. Use external scripts if your script uses < or & or ]]> or --.
> Note that XML parsers are permitted to silently remove the contents
> of comments. Therefore, the historical practice of "hiding" scripts
> and style sheets within "comments" to make the documents backward
> compatible is likely to not work as expected in XML-based user agents.
>
> /unquote
>
> The other points only relate to vending that XHTML content as XML.
> Authors vending as text/html need not concern themselves with those
> (which is why they're not mentioned in appendix C; perhaps there
> should have been an appendix E: moving your appendix C content to XML).
>
> > To that, I'll add that document.createElement(), one of the most basic
> > DOM methods, creates an element without a namespace.  If this
> > quasi-XHTML eventually gets served as XHTML, even
> > document.createElement would have unintended consequenses.
> >
> >> And it's not just a pedagogical issue. XML actually separates two
> >> things that cannot be clearly separated in HTML: well-formedness and
> >> validity. Take Henri's favorite example from HTML5: <p><ol><//o></
> >> p>.. In HTML5, this is perfectly valid and well-formed (presuming its
> >> properly placed in a larger document). It's a part of a valid DOM
> >> tree state. It's a valid XML serialization. However, it's not
> >> possible to express this in HTML4 with MIME type text/html (I was
> >> under the impression that it would be valid in HTML5, but Henri
> >> suggests otherwise).
> >
> > It's mentioned here:
> > http://www.whatwg.org/specs/web-apps/current-work/#element-
> > restrictions
>
> I see. I hadn't yet read that part of the spec. I definitely support
> this forking here, but we need to be extra careful about informing
> readers of our recommendation about that. We might even want to
> include some notation on the semantics chapter to help draw attention
> to these varied content models. I had been wondering how those
> content models were going to work in text/html, but I assumed that
> the testing had already been done.
>
> >> Is it invalid in that the author
> >> put an ordered list in a paragraph where it didn't belong? Or is it
> >> ill-formed where the author included a closing </p> tag where it
> >> didn't belong.
> >
> > The latter.  It's invalid (or malformed) because there's a closing
> > </p> tag where it didn't belong.  The <p> element was implicitly
> > closed when the parser reached the opening <ol> tag.
>
> No, you missed the point. It is definitely not the latter. The
> author's intention was to invalidly place an ordered list into a
> paragraph. The parser mis-guesses that it's instead an ill-formed
> document fragment. That's the point of the example. Most of the HTML
> error recover behaves this way. The point of the example is we start
> from the author's intention: here to do something invalid. The text/
> html parser cannot tell the difference. So it assumes the wrong thing
> here (wrong as in not what the author intended). (again this is off-
> topic, but that is what XML introduces; a way for the parser to
> always tell the difference, though it has to be well-formed before it
> can move to the next step)

I understood the point.  You end up with an unintended consequence
because an author understood a strict syntax and not what the parser
would actually do.  The author must because HTML syntax for what it is
to prevent this.

My interpretation of the mistake is correct: the parser followed the
parsing rules of the spec.  It's not the parser's fault if the author
was taught XML-like HTML and not the actual rules of HTML.
Received on Monday, 23 July 2007 18:02:49 UTC