RE: HTML or XHTML - why do you use it?

On Mon, 6 Jan 2003, Peter Foti (PeterF) wrote:
>
> Your argument does not seem to take into consideration the case
> where an XHTML document is meant to be treated as HTML.

Well, more specifically, my argument is that the XHTML specification
was wrong to allow that.


> <Ian>
>  * Current UAs are HTML user agents (at best) and certainly not
>    XHTML user agents (certainly not when sent as text/html), so if
>    you send them XHTML you are sending them content in a language
>    which is not native to them, and relying on their error handling.
> </Ian>
> 
> As the XHTML recommendation stated, XHTML documents are intended to
> operate in HTML 4 conforming agents.

This isn't quite accurate -- XHTML documents (or rather, Appendix C
compliant XHTML 1.0 documents) are intended to operate in HTML Tag
Soup parsers. Strictly speaking, a compliant implementation of HTML
4.01 would be well within its rights to totally reject an XHTML
document, since XHTML documents are not valid HTML 4.01.


> <Ian>
>  * <script> and <style> elements in XHTML may not have their
>    contents commented out, a trick frequently used in HTML documents
>    to hide the contents of such elements from legacy UAs. [1]
>
> [1] Because in XHTML, <script> and <style> elements are #PCDATA
> blocks, not #CDATA blocks, and therefore <!-- and --> really _are_
> comments tags, and are not ignored by the HTML parser.
> </Ian>
> 
> This is interesting, and it leads me to wonder if this is a typo in
> the recommendation.

It's not -- XML doesn't have any content model which allows comment-
like markup to be ignored. Don't forget in XML parsers should get the
same result whether or not they parse the DTD (with a few exceptions
related to attributes and entities).


> <Ian>
>  * XHTML documents that use the "/>" notation, as in "<link />", are
>    not valid HTML documents.
> </Ian>
> 
> I don't really have a good argument for this case, other than HTML
> agents are generally very forgiving regarding valid documents.

Tag soup user agents are -- the only close-to compliant HTML UA I am
aware of, namely the W3C HTML validator, correctly treats <link/> as
either a NET SHORTTAG or as invalid markup (depending on the exact
SHORTTAG settings that are in effect when it is tested -- it has been
known to vary.)


> As stated in the HTML 4 documentation at:
>
>    http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.1
> 
> If a user agent encounters an attribute it does not recognize, it
> should ignore the entire attribute specification (i.e., the
> attribute and its value).

The slash in the form

   <foo/>

...is not an unrecognised attribute, it is the end of the start tag,
and the ">" is character data. This is known as the Null End Tag (NET)
SHORTTAG feature. See, e.g.:

   http://www.nyct.net/~aray/sgml/short/shorttag.html#NET


> <Ian>
>  * Document sent as text/html are handled as tag soup [2] by most
>    UAs. Since most authors only check their documents using one or
>    two UAs, rather than using a validator, this means that authors
>    are not checking for validity, and thus most XHTML documents on
>    the web now are invalid. Therefore the main advantage of using
>    XHTML, that it has to be valid, is lost if the document is then
>    sent as text/html.
> </Ian>
> 
>
> You are presuming that all authors will fail to validate their XHTML
> document.

No, I am presuming that _most_ authors will fail to do so. Given the
state of the Web, I feel this assumption is justified.


> This is an authoring issue and you can't use this as a reason why
> using text/html for XHTML is bad.

It is one of the primary reasons why, IMHO, it is bad.

The process usually goes like this:

  1. Authors write invalid XHTML (or XHTML that makes other
     assumptions that are only valid for tag soup or HTML UAs, and not
     XHTML UAs), and send it as text/html.

  2. Authors find everything works fine.

  3. Time passes.

  4. Author decides to send the same content as application/xhtml+xml,
     because it is, after all, XHTML.

  5. Author finds site breaks horribly.

  6. Author blames XHTML.

Steps 1 to 5 have been seen by every single person I have spoken to
who has switched to using the XHTML MIME type. The only reason step 6
didn't happen in those cases is that they were advanced authors who
understood how to fix their content.


As I say in one of the appendices, if the author in question is one
who really does validate their markup, and verifies that they haven't
commented out their script and style, and doesn't rely on HTML-
sepcific DOM and CSS features, etc, then sure. Use XHTML.

This doesn't apply to most authors.


> <Ian>
>  * If you ever switch your XHTML documents from text/html to
>    text/xml, then you will in all likelyhood end up with a
>    considerable number of XML errors, meaning your content won't be
>    readable by users. (Most XHTML documents do not validate.)
> </Ian>
> 
> This is the same argument as the previous, just in different
> clothing. I *do* write valid XHTML documents, and since I am writing
> them to act as HTML, I *don't* want to switch them from text/html to
> text/xml.

Then this document is not for you.


> <Ian>
>  * A CSS stylesheet written for an HTML document has subtly
>    different semantics in an XHTML context (e.g. the <body> element
>    is not magical in XHTML).
> </Ian> 
> 
> I agree... and that's why I want to serve those documents as
> text/html instead of text/xml. As I just wrote, I don't want to
> switch those documents from text/html to text/xml.

So you want HTML syntax and processing rules, and you want UAs to
treat the markup as HTML.

Why not just use HTML?


> <Ian>
>  * A script written for an HTML document has subtly different
>    semantics in an XHTML context (e.g. element names are uppercase in
>    HTML, lowercase in XHTML).
> </Ian>
> 
> I assume you are referring to the DOM for each of these? Again, this
> is not that big of an issue, especially since I have no intention of
> an HTML to XML conversion anytime soon.

Yes, I was referring to the DOM.

Note that it doesn't matter how soon you intend to move to an XML MIME
type; if you ever intend to, you'll hit the problems.


> <Ian>
>  * If a user saves an XHTML-as-text/html document to disk and later
>    reopens it locally, triggering the content type sniffing code
>    since filesystems typically do not include file type information,
>    the document could be reopened as XML, potentially resulting in
>    validation errors, parsing differences, or styling differences.
> </Ian>
> 
> It depends on what application the user has associated with the file
> extension, does it not? If the user saves the file with a .htm
> extension, then his/her HTML User Agent will most likely be the one
> to open the file.

Yes, it depends on many things, on some platforms, it depends on the
extension. That's why I said "could".

It has happened to me several times, on both Windows and Unix.


> <Ian>
>  * The only real advantage to using XHTML rather than HTML is that
>    it is then possible to use XML tools with it. However, if tools
>    are being used, then the same tools might as well produce HTML
>    for you. Alternatively, the tools could take SGML as input
>    instead of XML.
> </Ian>
> 
> No, they should not produce HTML (I presume you mean HTML 4 with
> missing end tags, etc.).

Yes, I mean HTML 4.01.


> If they did, then the XML tool would have to guess where elements
> ended if they re-opened the generated HTML file.

So why not use the SGML tools that have existed since before XML was
even an inkling in anyone's eye?


> SGML is too loose...

Note that XML is a simplified version of SGML. What is too loose about
it? I grant you it is more complicated than XML, but you need only use
an already existing SGML tool.


> Also, this is not the only real advantage.

What other advantages are there?


> <Ian>
>  * HTML 4.01 contains everything that XHTML contains, so there is
>    little reason to use XHTML in the real world. It appears the main
>    reason is simply "jumping on the bandwagon" of using the latest
>    and (perceived) greatest thing.
> </Ian>
> 
> True. However, documents that conform to XHTML may perform better
> than a document that conforms only to HTML 4 because all of the
> closing tags are defined.

This isn't strictly true. HTML is fully defined and not ambiguous,
even with omitted end tags, there is no ambiguity about where tags
wend and tags start, because of the strict parsing rules. In any case,
you are using the same parser, and, as you point out below, you don't
have to omit the tags in HTML anyway.


> The browser doesn't have to do any guess work to try to figure out
> where they go.

Right... instead it has to guess what to do with these unexpected "/"
characters, these "xmlns" and xml:lang attributes, etc.


> And you'll probably say that HTML documents can be written with all
> of their closing tags as well, but the documents will validate
> without them, making it more likely that the developer could miss
> some and not realize it.

If the document validates, there is no ambiguity about where the
elements end. It is fully defined.

For example:

   <p>Test<ol><li></ol>

...is _exactly_ equivalent to:

   <p>Test</p><ol><li></li></ol>

...and all UAs support this correctly as far as my testing has shown.


Basically, my argument is that if you know what you're doing, then
sure, go ahead, but that most people don't, and that for them it would
be a lot easier if they used HTML 4.01 now and thus were never tempted
to convert these documents to an XML MIME type.

Incidentally, do you have a URI to your XHTML pages? I would be
interested in seeing whether there were any obvious mistakes I could
point out to demonstrate my point.

Cheers,
-- 
Ian Hickson                                      )\._.,--....,'``.    fL
"meow"                                          /,   _.. \   _\  ;`._ ,.
http://index.hixie.ch/                         `._.-(,_..'--(,_..'`-.;.'

Received on Monday, 6 January 2003 17:48:24 UTC