Re: HTML or XHTML - why do you use it? from Tantek Çelik on 2003-01-07 (www-html@w3.org from January 2003)

From: Tantek Çelik <tantek@cs.stanford.edu>
Date: Mon, 06 Jan 2003 16:37:25 -0800
To: Ian Hickson <ian@hixie.ch>, "Peter Foti (PeterF)" <PeterF@systolicnetworks.com>
CC: "'Nick Boalch'" <nick@fof.durge.org>, "'www-html@w3.org'" <www-html@w3.org>
Message-ID: <BA3F60C5.1EBD0%tantek@cs.stanford.edu>
On 1/6/03 2:48 PM, "Ian Hickson" <ian@hixie.ch> wrote:

> 
> On Mon, 6 Jan 2003, Peter Foti (PeterF) wrote:
>> 
>> Your argument does not seem to take into consideration the case
>> where an XHTML document is meant to be treated as HTML.
> 
> Well, more specifically, my argument is that the XHTML specification
> was wrong to allow that.

It might be good send that feedback to the proper feedback email address
noted in the specification so that the working group can address it as a
potential errata item or change for the next version etc.  Apologies if you
have already done that.


>> <Ian>
>>  * Current UAs are HTML user agents (at best) and certainly not
>>    XHTML user agents (certainly not when sent as text/html), so if
>>    you send them XHTML you are sending them content in a language
>>    which is not native to them, and relying on their error handling.
>> </Ian>
>> 
>> As the XHTML recommendation stated, XHTML documents are intended to
>> operate in HTML 4 conforming agents.
> 
> This isn't quite accurate -- XHTML documents (or rather, Appendix C
> compliant XHTML 1.0 documents) are intended to operate in HTML Tag
> Soup parsers. Strictly speaking, a compliant implementation of HTML
> 4.01 would be well within its rights to totally reject an XHTML
> document, since XHTML documents are not valid HTML 4.01.

Ian, I have heard this assertion before, and while I would lean towards
believing you (since I presume you would make a thorough analysis before
making such a claim), it would help significantly if you could provide
references to ALL (that you know of at least) of the precise HTML 4.01 UA
compliance requirements which would require a compliant HTML4.01 UA to
reject a valid XHTML 1.0 document that uses the Appendix C guidelines.

Please cc such references to the abovementioned HTML 4.01 feedback email
address.

IMHO the HTML WG should look at errata'ing any such HTML 4.01 UA compliance
requirements in order that a compliant HTML 4.01 UA can accept valid XHTML
1.0 documents authored with the Appendix C guidlines.


>> <Ian>
>>  * <script> and <style> elements in XHTML may not have their
>>    contents commented out, a trick frequently used in HTML documents
>>    to hide the contents of such elements from legacy UAs. [1]
>> 
>> [1] Because in XHTML, <script> and <style> elements are #PCDATA
>> blocks, not #CDATA blocks, and therefore <!-- and --> really _are_
>> comments tags, and are not ignored by the HTML parser.
>> </Ian>
>> 
>> This is interesting, and it leads me to wonder if this is a typo in
>> the recommendation.
> 
> It's not -- XML doesn't have any content model which allows comment-
> like markup to be ignored. Don't forget in XML parsers should get the
> same result whether or not they parse the DTD (with a few exceptions
> related to attributes and entities).

Agreed.


>> <Ian>
>>  * XHTML documents that use the "/>" notation, as in "<link />", are
>>    not valid HTML documents.
>> </Ian>
>> 
>> I don't really have a good argument for this case, other than HTML
>> agents are generally very forgiving regarding valid documents.
> 
> Tag soup user agents are -- the only close-to compliant HTML UA I am
> aware of, namely the W3C HTML validator, correctly treats <link/> as
> either a NET SHORTTAG or as invalid markup (depending on the exact
> SHORTTAG settings that are in effect when it is tested -- it has been
> known to vary.)
> 
> 
>> As stated in the HTML 4 documentation at:
>> 
>>    http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#h-B.1
>> 
>> If a user agent encounters an attribute it does not recognize, it
>> should ignore the entire attribute specification (i.e., the
>> attribute and its value).
> 
> The slash in the form
> 
>  <foo/>
> 
> ...is not an unrecognised attribute, it is the end of the start tag,
> and the ">" is character data. This is known as the Null End Tag (NET)
> SHORTTAG feature. See, e.g.:
> 
>  http://www.nyct.net/~aray/sgml/short/shorttag.html#NET

Ok, so that's one.  Apparently, (use of) the SHORTTAG feature in HTML4.01
prevents a compliant HTML4.01 UA from accepting valid but compatible XHTML
1.0 documents.


>> <Ian>
>>  * Document sent as text/html are handled as tag soup [2] by most
>>    UAs. Since most authors only check their documents using one or
>>    two UAs, rather than using a validator, this means that authors
>>    are not checking for validity, and thus most XHTML documents on
>>    the web now are invalid. Therefore the main advantage of using
>>    XHTML, that it has to be valid, is lost if the document is then
>>    sent as text/html.
>> </Ian>
>> 
>> 
>> You are presuming that all authors will fail to validate their XHTML
>> document.
> 
> No, I am presuming that _most_ authors will fail to do so. Given the
> state of the Web, I feel this assumption is justified.

I don't doubt your assumption, just your conclusion.  The advantage of being
able to more strictly validate a document is still there.


>> This is an authoring issue and you can't use this as a reason why
>> using text/html for XHTML is bad.
> 
> It is one of the primary reasons why, IMHO, it is bad.
> 
> The process usually goes like this:
> 
> 1. Authors write invalid XHTML (or XHTML that makes other
>    assumptions that are only valid for tag soup or HTML UAs, and not
>    XHTML UAs), and send it as text/html.
> 
> 2. Authors find everything works fine.
> 
> 3. Time passes.
> 
> 4. Author decides to send the same content as application/xhtml+xml,
>    because it is, after all, XHTML.
> 
> 5. Author finds site breaks horribly.
> 
> 6. Author blames XHTML.
> 
> Steps 1 to 5 have been seen by every single person I have spoken to
> who has switched to using the XHTML MIME type. The only reason step 6
> didn't happen in those cases is that they were advanced authors who
> understood how to fix their content.

And there is still hope that these trailblazers will help with fixing the
source of the problem before too much of step 3. happens.  I feel it is
premature to give up because of this.


> As I say in one of the appendices, if the author in question is one
> who really does validate their markup, and verifies that they haven't
> commented out their script and style, and doesn't rely on HTML-
> sepcific DOM and CSS features, etc, then sure. Use XHTML.
> 
> This doesn't apply to most authors.

Agreed.


>> <Ian>
>>  * If you ever switch your XHTML documents from text/html to
>>    text/xml, then you will in all likelyhood end up with a
>>    considerable number of XML errors, meaning your content won't be
>>    readable by users. (Most XHTML documents do not validate.)
>> </Ian>
>> 
>> This is the same argument as the previous, just in different
>> clothing. I *do* write valid XHTML documents, and since I am writing
>> them to act as HTML, I *don't* want to switch them from text/html to
>> text/xml.
> 
> Then this document is not for you.
> 
> 
>> <Ian>
>>  * A CSS stylesheet written for an HTML document has subtly
>>    different semantics in an XHTML context (e.g. the <body> element
>>    is not magical in XHTML).
>> </Ian> 
>> 
>> I agree... and that's why I want to serve those documents as
>> text/html instead of text/xml. As I just wrote, I don't want to
>> switch those documents from text/html to text/xml.
> 
> So you want HTML syntax and processing rules, and you want UAs to
> treat the markup as HTML.

I think the key is, that there is a desire to let HTML UAs that don't
support XHTML treat the markup as HTML.  That is different than asking for
all UAs to treat the markup as HTML.

> Why not just use HTML?

A good question.  For some folk XHTML provides advantages that they feel are
worth the additional costs of using it (if any).


>> <Ian>
>>  * A script written for an HTML document has subtly different
>>    semantics in an XHTML context (e.g. element names are uppercase in
>>    HTML, lowercase in XHTML).
>> </Ian>
>> 
>> I assume you are referring to the DOM for each of these? Again, this
>> is not that big of an issue, especially since I have no intention of
>> an HTML to XML conversion anytime soon.
> 
> Yes, I was referring to the DOM.
> 
> Note that it doesn't matter how soon you intend to move to an XML MIME
> type; if you ever intend to, you'll hit the problems.

True enough.  However, I believe the author can just use lower case
element/attribute names (even in the HTML documents and related scripts),
and have it just work.


>> <Ian>
>>  * If a user saves an XHTML-as-text/html document to disk and later
>>    reopens it locally, triggering the content type sniffing code
>>    since filesystems typically do not include file type information,
>>    the document could be reopened as XML, potentially resulting in
>>    validation errors, parsing differences, or styling differences.
>> </Ian>
>> 
>> It depends on what application the user has associated with the file
>> extension, does it not? If the user saves the file with a .htm
>> extension, then his/her HTML User Agent will most likely be the one
>> to open the file.
> 
> Yes, it depends on many things, on some platforms, it depends on the
> extension. That's why I said "could".
> 
> It has happened to me several times, on both Windows and Unix.

Yes it depends on many things, which differ from platform to platform.


>> <Ian>
>>  * The only real advantage to using XHTML rather than HTML is that
>>    it is then possible to use XML tools with it. However, if tools
>>    are being used, then the same tools might as well produce HTML
>>    for you. Alternatively, the tools could take SGML as input
>>    instead of XML.
>> </Ian>
>> 
>> No, they should not produce HTML (I presume you mean HTML 4 with
>> missing end tags, etc.).
> 
> Yes, I mean HTML 4.01.
> 
> 
>> If they did, then the XML tool would have to guess where elements
>> ended if they re-opened the generated HTML file.
> 
> So why not use the SGML tools that have existed since before XML was
> even an inkling in anyone's eye?

Because they are not the same as the XML tools that exist now?


>> SGML is too loose...
> 
> Note that XML is a simplified version of SGML. What is too loose about
> it? I grant you it is more complicated than XML, but you need only use
> an already existing SGML tool.
> 
> 
>> Also, this is not the only real advantage.
> 
> What other advantages are there?
> 
> 
>> <Ian>
>>  * HTML 4.01 contains everything that XHTML contains, so there is
>>    little reason to use XHTML in the real world. It appears the main
>>    reason is simply "jumping on the bandwagon" of using the latest
>>    and (perceived) greatest thing.
>> </Ian>
>> 
>> True. However, documents that conform to XHTML may perform better
>> than a document that conforms only to HTML 4 because all of the
>> closing tags are defined.
> 
> This isn't strictly true. HTML is fully defined and not ambiguous,
> even with omitted end tags, there is no ambiguity about where tags
> wend and tags start, because of the strict parsing rules. In any case,
> you are using the same parser, and, as you point out below, you don't
> have to omit the tags in HTML anyway.
> 
> 
>> The browser doesn't have to do any guess work to try to figure out
>> where they go.
> 
> Right... instead it has to guess what to do with these unexpected "/"
> characters, these "xmlns" and xml:lang attributes, etc.

No kidding.


>> And you'll probably say that HTML documents can be written with all
>> of their closing tags as well, but the documents will validate
>> without them, making it more likely that the developer could miss
>> some and not realize it.
> 
> If the document validates, there is no ambiguity about where the
> elements end. It is fully defined.
> 
> For example:
> 
>  <p>Test<ol><li></ol>
> 
> ...is _exactly_ equivalent to:
> 
>  <p>Test</p><ol><li></li></ol>
> 
> ...and all UAs support this correctly as far as my testing has shown.
> 
> 
> Basically, my argument is that if you know what you're doing, then
> sure, go ahead, but that most people don't, and that for them it would
> be a lot easier if they used HTML 4.01 now and thus were never tempted
> to convert these documents to an XML MIME type.
> 
> Incidentally, do you have a URI to your XHTML pages? I would be
> interested in seeing whether there were any obvious mistakes I could
> point out to demonstrate my point.

Admittedly they are only test documents, but the pages linked from here:

 http://tantek.com/XHTML/Test/minimal.html

e.g.

 http://tantek.com/XHTML/Test/xhtmlwithoutprolog.html

should be valid XHTML 1.0 documents.

Thanks,

Tantek
Received on Monday, 6 January 2003 19:21:44 UTC