Re: Prevalence of ill-formed XHTML (was: Re: let authors choose text/html or application/xhtml+xml) from Robert Burns on 2007-08-31 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Fri, 31 Aug 2007 16:37:05 -0500
To: Philip Taylor <philip@zaynar.demon.co.uk>
Cc: "public-html@w3.org WG" <public-html@w3.org>
Message-Id: <EF6B3ED9-D10B-4C4B-AB13-54B9C06BF15B@robburns.com>
Hi Philip,

On Aug 31, 2007, at 3:06 PM, Philip Taylor wrote:

> Robert Burns wrote:
>> On Aug 31, 2007, at 7:50 AM, Dean Edridge wrote:
>>> How much longer do we need to go on pretending that XHTML can be  
>>> sent as text/html Dan? This is ridiculous. Hasn't the W3C learnt  
>>> it's lesson with XHTML's failure over the last 8 years.
>>>
>>> Exactly who benefits from the myth of XHTML being able to be sent  
>>> as text/html? Not you, me, the W3c or anyone, and certainly not  
>>> XHTML it self.
>> I"m not sure why you call it a myth. I'm sure we can find  
>> countless sites that serve valid XHTML files as text/html. This  
>> discussion keeps popping up, but so far no one has been able to  
>> articulate what the dangers are in doing so.
>
> There's a bigger countless number that serve invalid XHTML files as  
> text/html, and they are often invalid (in part) directly because of  
> confusion between XML syntax and HTML syntax.

There may be many pages like this, I don't know. However, I don't  
think your data bears that out. I also don't think the confusion is  
due to the differences between XML and HTML but rather the laxity of  
HTML parsers that authors use to test their pages.

> I believe that that confusion is a real danger: XHTML-as-text/html  
> has harder syntax than HTML, since you have to understand XML as  
> well as HTML-as-misunderstood-by-browsers,

I don't see how understanding XML as one writes HTML could be a bad  
thing. The main difference in writing XHTML-like HTML that authors  
need to understand is that some elements are void and therefore must  
never be closed except with a self-closing tag (e.g., if delivered as  
text/html never use <br></br>). However the concept of a void element  
is a real concept in text/html serialization too and including an  
explicit notation of that makes that concept easier for authors to  
understand. Glossing over the difference leads to confusion. In other  
words it is better to tell authors that there are HTML elements that  
can have a closing tag omitted (e.g., <p>) and there are elements  
that must have the closing tag omitted (e.g., <br/>). This difference  
specified in the HTML schema has a difference that can be specified  
in the appendix C text/html syntax.

> so people get it wrong more often; and encouraging people to use  
> the harder syntax will result in more errors and less happiness  
> among those who follow the advice. (I think this encouragement  
> issue is independent of DanC's proposal to permit the (discouraged)  
> practice, though.)

Since text/html is more strict in the sense that some end tags MUST  
be omitted, I don't see how the one is necessarily harder than the  
other. Rather it is not the XML syntax that is more difficult, but  
the XML parsers that are less forgiving. Take ATOM and RSS as an  
example. Those are XML's that are parsed with forgiving parsers.

> Of the top 200 sites in Alexa's list (which I assume means there is  
> a strong bias towards professionally-designed sites), looking at  
> the front page of each (which I assume is more likely to have been  
> checked with a validator than other less-prominent pages): 67  
> looked like they were using XHTML (they contained "DTD XHTML 1."  
> somewhere); 51 were ill-formed XML (or, specifically, causing parse  
> errors in libxml).
>
> I looked in more detail at the first half,

I wasn't sure what half you were referring to here. I assume you   
mean you looked at the 51 that were ill-formed XML. Is that right?

> grouping by the first reported error - see below for the list.  
> Unencoded ampersands in non-<script> contexts are errors in HTML  
> too, but I think most of the other issues are fine in HTML, and it  
> looks like many are caused by attempting to use HTML syntax in the  
> XHTML document.

As you say, this is not trying to use HTML syntax, this is attempting  
to use syntax that is forgiven by the HTML parsers (though not  
necessarily adhering to HTML syntax).

> Of the pages without parse errors, most were valid - the only  
> exceptions were http://www.msn.com (duplicate ID value) and sort of  
> http://www.bbc.co.uk (sends invalid HTML4 to the validator, but  
> sends valid XHTML1 to me - is that location-based?).

I'm getting the invalid HTML4 and it just looks like plain-old  
invalid (just FYI).

> (I have no idea how many would actually work as application/xhtml 
> +xml, given the differences in the DOM and document.write and  
> everything else.)

Yeah, I don't think that's relevant to the discussion anyway. As Dean  
Eldridge said: "How much longer do we need to go on pretending that  
XHTML can be sent as text/html"? and "Exactly who benefits from the  
myth of XHTML being able to be sent as text/html?" This to me is  
about whether it is possible or troublesome to send an (appendix C  
style)  XML authored document as text/html. Whether the scripts or  
CSS selectors make other assumptions about the the document is a  
separate issue. That again is about transitioning from appendix C  
style text/html to delivery as application/xhtml+xml.


> The lack of pages which are well-formed but invalid may suggest  
> that few people are actually interested in well-formedness - some  
> are just interested in validity, and fix their well-formedness  
> errors as an incidental detail. Those people would get the same  
> benefit from using HTML and an HTML validator instead.

I would say that many of these appendix C style pages are valid and  
well-formed. Which is what I was saying before. The pages from your  
sample that are invalid or ill-formed look to be simply pages that  
are not validated at all, but happen to have an XHTML doctype  
declaration. These errors look to be more about the spill-over of the  
debates here where one author edits the page in an XHTML-like manner  
and the other author edits the page in a text/html manner. That's a  
separate issue from whether it would be a good practice to author  
content as (appendix C like) XHTML and serve it as text/html. To me  
the benefit there is authoring content once and not having to pre- 
process it to other formats (to HTML 4.01) just to send it to a  
client-side processor that can process it exactly the same without  
pre-processing it (as appendix C style XHTML1).

>
>
> Unencoded ampersands:

As you said these are errors in HTML. They are also errors that would  
not be there with appendix-C style authoring.

> Other unencoded characters:
> * http://www.livejournal.com: for (var i = 0; i < site_k.length; i+ 
> +) {
> * http://www.xunlei.com: for(var i = 0; i < productName.length &&  
> random > temp; i++) {
> * http://www.xanga.com: document.write('<scr' + 'ipt src="' +  
> adserver + allAdTags + ad1 + ad2 + ad3 + '?" type="text/ 
> javascript">'); document.write('</scr' + 'ipt>');
> * http://www.wretch.cc: document.write("<img style=\"display:none; 
> \" width=1 height=1 src=http://bcw1.mining.vip.tp2.yahoo.com/b? 
> s=2022137079&make=yahoo&type=wretch&t="+random_num+">");

These look to be in scripts and other improperly unmarked CDATA  
sections. Again, this is not authoring XHTML and delivering it as  
text/html. Its authoring frankenstein documents and wondering whether  
there's a danger in serving those as application/xhtml+xml. That is  
certainly not advised. I've seen no one dispute that.

Unclosed *Elements*
> Unclosed tags:
> * http://www.rapidshare.com: <img src="http://images.rapidshare.com/ 
> img/rslogo.jpg">
> * http://www.sina.com.cn: <meta name="stencil" content="PGLS000022">
> * http://www.dailymotion.com/gb: <img src="/images/ 
> creative_user_logo.gif">
> * http://www.aol.com: <link rel="alternate" type="application/rss 
> +xml" title="AOL Top Stories" href="http://xml.web.aol.com/ 
> aolportal/dynamiclead.xml">
> * http://www.hi5.com: <link href="http://images.hi5.com/images/ 
> favicon.ico" type=image/x-icon rel="shortcut icon">
> * http://www.taobao.com: <input name="f" value="D9_5_1" type="hidden">
> * http://www.tom.com: <meta http-equiv="Content-Type" content="text/ 
> html; charset=gb2312">

Again these are not XHTML authored content delivered as text/html  
which is the topic of the thread.

> Other errors:
> * http://www.deviantart.com: <![if ! lt IE 5.5]>
> * http://www.live.com: <html xmlns:web xmlns="http://www.w3.org/ 
> 1999/xhtml" lang="en" xml:lang="en" class="liveApp la_en lo_gb">
> * http://www.eastmoney.com: <meta  name=keywords content="..." />

These too are not XHTML valid constructs.

Take care,
Rob
Received on Friday, 31 August 2007 21:37:20 UTC