Re: Prevalence of ill-formed XHTML from Robert Burns on 2007-09-01 (public-html@w3.org from September 2007)

From: Robert Burns <rob@robburns.com>
Date: Sat, 1 Sep 2007 00:52:13 -0500
To: Philip Taylor <philip@zaynar.demon.co.uk>
Cc: "public-html@w3.org WG" <public-html@w3.org>
Message-Id: <3ED738C9-C1DB-4D95-8179-6C5F8F2B264E@robburns.com>
HI Philip,

On Aug 31, 2007, at 6:46 PM, Philip Taylor wrote:

>
> Robert Burns wrote:
>> On Aug 31, 2007, at 3:06 PM, Philip Taylor wrote:
>>> Robert Burns wrote:
>>>> I'm sure we can find countless sites that serve valid XHTML  
>>>> files as text/html. This discussion keeps popping up, but so far  
>>>> no one has been able to articulate what the dangers are in doing  
>>>> so.
>>>
>>> There's a bigger countless number that serve invalid XHTML files  
>>> as text/html, and they are often invalid (in part) directly  
>>> because of confusion between XML syntax and HTML syntax.
>> There may be many pages like this, I don't know. However, I don't  
>> think your data bears that out. I also don't think the confusion  
>> is due to the differences between XML and HTML but rather the  
>> laxity of HTML parsers that authors use to test their pages.
>
> I think the missing slashes in <img src="..."> indicate some level  
> of XML/HTML confusion - if the only factor was the laxity of HTML  
> parsers, people could be writing <img src="..."\> or <img src=... / 
> > or <image src="..."/> since those are similarly wrong and are  
> handled as the author expects by HTML parsers. But those are very  
> rare, whereas the preferred HTML syntax (<img src="...">) is quite  
> common. That suggests that people are erroneously using HTML syntax  
> in particular, rather than erroneously using any other syntax which  
> works in their browser.

I agree with your assessment there. However, that suggests that the  
errors come from authors mixing HTML and XHTML syntax. My view of  
where this pattern originates is that one author starts a page in  
XHTML1 and then another author edits the page with no awareness of  
XHTML1 and simply adds content in the HTML4 syntax they're familiar  
with (or copies and pastes from other pages, etc.). Those practices  
are entirely different than the practice of each author (or authoring  
tool) creating content to XHTML 1.0 conformance and then serving the  
page as text/html.

Instead of the errors you found, I would expect to see errors such as  
<br></br>, <img src='...' ></img> and <script />. Those are the sorts  
of mistakes I would expect to see from an author who adhered to XHTML  
1 syntax, but then deployed as text/html Content-Type. Perhaps you  
did not look for those types of errors in that data, but what I'm  
suggesting is that we would need to find pages that are valid and  
well-formed XHTML; served as text/html; and turn out to be invalid  
HTML (other than the DocType declaration and the properly used  
according to appendix C self-closing tags; i.e., <br/>, <hr/>, <img/ 
 >, <meta/>, <link/>, <area/>, <col/> <input/> and <base/>).

>
>>> <...>
>>>
>>> I looked in more detail at the first half,
>> I wasn't sure what half you were referring to here. I assume you   
>> mean you looked at the 51 that were ill-formed XML. Is that right?
>
> Sorry, that was far too vague - I meant I looked at the XHTML pages  
> from the first half of the list of 200 pages. (Of those 100 pages,  
> that was 32 XHTML pages, of which 23 had parse errors.)

Well that is a considerable number of invalid XHTML pages then.  
However, I agree that authors should not send pages that are invalid  
and ill-formed. That is indeed harmful (or should be considered  
harmful :-) )

>> This to me is about whether it is possible or troublesome to send  
>> an (appendix C style)  XML authored document as text/html.
>
> Appendix C is a bit troublesome to follow since it misses lots of  
> cases where conforming XHTML breaks in normal HTML UAs. It's  
> clearly possible to send some XHTML documents to HTML UAs and have  
> them function properly, and it's possible for some to be conforming  
> HTML5 too; but I don't know that it's feasible to define exactly  
> which documents are safe and to cover all the relevant cases,  
> except by saying "HTML 5 documents sent as text/html must be  
> conforming HTML5 [i.e. the HTML serialisation] [regardless of  
> whether they're conforming XHTML5 too]" (in which case it doesn't  
> matter whether they're conceptually HTML5 or XHTML5 documents).

I agree with this (and with what the draft currently says). I don't  
think that its too strong to say that authors MUST deliver a document  
that conforms to the text/html serialization when sending as ext/html  
and MUST deliver a document that conforms to XML (and XHTML) when  
sending a document with an application/xhtml+xml Content-Type. I see  
nothing wrong with a MUST there. I would go further and say we should  
have a MUST there.

My point here was to challenge Dean Eldridge's contention that there  
is some sort of myth that a particular form of XHTML can be delivered  
as text/html without problems. That is the contention I take issue  
with. It should not be called a myth because it is true that authors  
can author content that is conforming to both XHTML 1.0 and HTML 4.01  
(or at least conforming in every way that matters). I certainly agree  
that XHTML and text/html should be valid and well-formed (that  
authors should strive to achieve that). Sending invalid XHTML to  
legacy html parsers is a bad idea. I agree 100% with that. However,  
sending valid XHTML 1.0 that also adheres to appendix C (I really  
think W3C should have an appendix C validator) is not troublesome.  
Authors working only in XHTML should not worry about the claims to  
the contrary. Yes its true that migrating content back  and forth has  
some issues that content creators need to deal with. But that is a  
separate issue from authoring HTML content as valid and well-formed  
XHTML. I've asked this before, but let me ask again. What problems  
would an author face with actual browsers if they authored valid and  
well-formed XHTML 1.0 that also adhered to the appendix C guidelines  
and then delivered that content as text/html? I cannot think of any  
and I've yet to hear any issues presented (Note that adhering to  
appendix C means there's no CData sections and <script> is always  
closed with </script>).

To me, the confusion that you're talking about is more related to the  
back and forth pronouncements that "you should use XHTML" followed by  
"using XHTML is harmful.". Yes that creates confusion. However, using  
XHTML syntax is not harmful if one follows those XHTML 1.0 appendix C  
guidelines. It requires authors to be aware of what is a void element  
and what is not, but that's information author's should be familiar  
with anyway. If appendix C forces this sort of familiarity than that  
is a good thing in my view.

Note that for any authors who decided to go with authoring content in  
XHTML 1.0 using an XHTML 1.0 DTD for validation, would not face any  
of the issues raised by those who say using XHTML should be  
considered harmful. The same DOM scripts those authors used on their  
HTML 4.01 documents will continue to work on their XHTML 1.0   
(appendix C)  documents. The same CSS used on the HTML 4.01 documents  
will still work on the XHTML 1.0 (appendix C) documents. The problems  
commonly cited are not about using XHTML syntax for documents  
delivered as text/html. The problems commonly cited are about moving  
content from text/html to application/xhtml+xml. Authors will face  
those problems whether they originally authored content as HTML 4.01  
or XHTML 1.0; except the ones who authored content as XHTML 1.0 will  
not need to convert their HTMML files along with their scripts and  
style sheet files.

Take care,
Rob
Received on Saturday, 1 September 2007 05:52:31 UTC