Re: Prevalence of ill-formed XHTML from Robert Burns on 2007-09-02 (public-html@w3.org from September 2007)

From: Robert Burns <rob@robburns.com>
Date: Sun, 2 Sep 2007 10:40:40 -0500
To: Kornel Lesinski <kornel@geekhood.net>
Cc: "public-html@w3.org" <public-html@w3.org>
Message-Id: <1AD7F5F4-6379-4CA4-A314-D9274EFC1D6E@robburns.com>
Hi Kornel,

On Sep 2, 2007, at 9:37 AM, Kornel Lesinski wrote:

>
> On Sat, 01 Sep 2007 21:24:31 +0100, Robert Burns <rob@robburns.com>  
> wrote:
>
>> I'm not sure what you're saying here. If you change your XSLT to a  
>> different output mode won't it output a pure HTML serialization  
>> (with no xml-isms)?
>
> It won't output XHTML as HTML. It's completly counter-intuitive,  
> but that's what the spec requires:
> "The html output method should not output an element differently  
> from the xml output method unless the expanded-name of the element  
> has a null namespace URI;"
> http://www.w3.org/TR/xslt#section-HTML-Output-Method

OK, but doesn't that just mean the XSLT has to be authored to remove  
the namespace URI from those XHTML elements before they can be output  
as legacy HTML. Some of the HTML5 discussion to bring namespaces to  
HTML would require a change in the XSLT recommendation. We should be  
sure to liaison with that WG on that issue.

>
>>> I find it troublesome. The fundamental problem is that you have  
>>> to observe all restrictions of XML, but you can't use XML tools  
>>> anymore, because they don't care about additional limitations  
>>> imposed by HTML.
>>
>> I think observing the XML restrictions is a good thing.
>> I also think the treatment of void elements explicitly with  
>> something like <br/> makes it easier for authors to understand  
>> what their doing (which is the only additional restriction for  
>> HTML I can think of).
>
> The same syntax can also be source of confusion in case of <script  
> src=""/>.

I don't think the syntax is at all the source of the confusion. The  
difference — visible in all Appendix C code — between <script  
src='...' > </script> and <meta name='...' content='...'  /> is the  
very difference authors need to understand. In other words authors  
need to understand the difference between an element that happens to  
be empty and on that is defined to be canonically empty. Without the  
"/>" syntax, the student of HTML can more easily miss that fact. That  
syntax is a very powerful pedagogical tool. I think it's a source of  
understanding — not confusion.

>> Many of those problems relate to the immaturity of XML / XHTML  
>> implementations and not anything about the DOM APIs themselves.
>
> I disagree. If one does intend to parse document as XML, sniffing  
> will always be required when text/html is used. Incompatibilities  
> between HTML and XML DOM are part of the spec: case sensitivity vs  
> case folding, forbidden document.write or implied <tbody> won't  
> change as implementations mature.

Well since HTML5 proposes to add document.write to the XML  
serialization of HTML then yes that's a part of implementation  
immaturity. However, leaving that aside, we're talking about XHTML1  
documents authored to the Appendix C guidelines. So all of the issues  
you raise here do not apply.  The remaining issues are entirely about  
XHTML processing immaturity (except for the issue of  CDATA sections  
that Philip raised and Appendix C failed to deal with adequately).

>> The CSS issues are minor to non-existent for anyone following  
>> appendix C.
>
> Indeed, it's just yet another thing authors have to be aware of,  
> and it fails silently if they don't.

It doesn't fail silently if they author to XHTML 1.0 Appendix C  
guidelines. Authors testing their documents in an XML processor would  
see problems immediately even if they subsequently turned those  
documents into text/html media types to finalize and apply scripts  
and stylesheets.

>>> I think that if a document will not work properly as XHTML, and  
>>> was never intended to do, it shouldn't be called XHTML.
>>
>> I'm not clear what you're saying here. Any document that is valid  
>> and well-formed XHTML 1 and also adheres  to the XHTML 1.0  
>> appendix C guidelines will work properly as XHTML.
>
> Yes, if such document adheres to appendix C (and possibly few other  
> things) it would. The problem is that appendix C is not normative,  
> it doesn't formalize any new language. XHTML, whether it's  
> compatible or not, is allowed to be sent as text/html.

Yes, I would say the problem is that there is no Appendix C DTD  
whether Appendix C is normative or not, there could still be an  
Appendix C DTD for validation., However, despite several long threads  
we still haven't identified any incompatibilities or ramifications to  
sending valid XHTML 1.0 as text/html. Even with the CSS and  
EMCAScript errors in the CDATA end tag and improperly added end tags,  
do any browsers actually choke on these errors? If they don't than I  
don't see the problem. Even if they do then I still see no problem  
with authors adhering to the set of norms that have spontaneously  
developed around XHTML, appendix C and external scripts and  
stylesheets. To me this is authors saying loud and clear they want  
this technology to move forward.

> This leads to ridiculous situation where you can have valid, well- 
> formed, 100% spec-compliant XHTML that's not compatible with XML  
> mode. And this is common on the web today (unless authors fail  
> short of creating valid and/or well-formed XHTML in a first place,  
> of course :)

I have no idea what you're talking about here. Can you give an  
example where this leads to 100% spec compliant XHTML that's not  
compatible with XML processors?

> Therefore my suggestion is not to allow XHTML to be sent as text/ 
> html. Migration path should come from HTML5 side, which allows  
> appendix C-compatible syntax now. "HTML with slashes" better  
> describes what those XHTML-wannabe documents are, and there would  
> be no confusion which media type applies to which language.

The migration path has already happened for many authors.  They're  
now waiting for XML processing implementations to mature and become  
widely used by their site visitors. HTML5 can certainly help with  
this. The way it could help authors is in providing a conformance  
checker that does not flag their XML-isms in their HTML. An even more  
important benefit HTML5 could bring is to clean up the mess in DOM  
APIs that are not sufficiently serialization aware and fix that for  
authors. That way the DOM issues will become as insignificant as the  
CSS issues. HTML5 could also solve the CSS issues by making the tbody  
and colgroup element a required part of the table element's content  
model (with optional tag omission in the text/html serialization).

Take care,
Rob
Received on Sunday, 2 September 2007 15:42:03 UTC