Re: edge issues with DOM, text/html, and xml serializations [was Re: handling fallback content for still images] from Robert Burns on 2007-07-09 (public-html@w3.org from July 2007)

From: Robert Burns <rob@robburns.com>
Date: Mon, 9 Jul 2007 11:50:45 -0500
To: Andrew Sidwell <takkaria@gmail.com>
Cc: James Graham <jg307@cam.ac.uk>, public-html@w3.org
Message-Id: <42ACC6A0-E539-44A0-AA9A-435C0FE0B93E@robburns.com>
On Jul 9, 2007, at 11:12 AM, Andrew Sidwell wrote:

> Robert Burns wrote:
>> On Jul 9, 2007, at 9:34 AM, James Graham wrote:
>>> Robert Burns wrote:
>>>> Despite some confusion on these issues, there isn't a single
>>>> right way to do these things and the sooner we can acknowledge
>>>> that the easier our task will be.
>>>
>>> If you're talking about XML parsing there really is only one way  
>>> to do
>>> it; the DOM you get is determined by the XML spec. Any browser that
>>> does something different has a bug.
>>
>> I've been working with primarily XML for nearly a year now (CSS  
>> and DOM
>> and translation). And I can tell you it's not as unambiguous as you
>> might think. There's definitely ambiguity and there's room to  
>> clear up
>> ambiguity. The XML spec is most clear on well-formedness. After that,
>> there's wiggle room.
>
> Instead of just stating "there's wiggle room", please could you give
> examples of where such room exists?  It's very hard to understand  
> any of
> the issues involved based on such vague statements.

Sure, sorry for the ambiguity. I've often been writing at great  
length on topics to have my words dismissed with a turn of  phrase..  
I'l try to provide a couple of examples off of the top of my head  
that have been changing and continue to change with XML parsing.  
First is the treatment of named character references (or character  
entity references in SGML nomenclature). Early XHTML UAs would throw  
up fatal errors when encountering these, just as they throw up fatal  
errors for ill-formed elements. I imagine this has been a significant  
frustration for authors trying to move seemingly well-formed code  
over to XML processing. Over time, Mozilla (and I think WebKit is  
moving in this direction too) has added  support for them: basically  
hard-wiring its knowledge of HTML. XML makes a distinction between  
DTD retrieving UAs and non-DTD retrieving UAs. Most UAs do not  
retrieve a DTD, however, that hasn't stopped them from adding  
knowledge from those DTDs to the processing of XHTML.

The same situation arises with WebKit's treatment of XHTML and the  
inferred tbody element. At some point the WebKit team decided to  
infer an actual tobdy element and insert it into the DOM based on its  
knowledge of the HTML namespace (separate from XML requirements).  
These are decisions UA developers have to make all the time.  
Sometimes it breaks interoperability. Sometimes it actually fixes  
interoperability. However, from our point  of view, we should be  
willing to consider such measures and not simply dismiss them out-of- 
hand, because we're in a unique position to promote such measures to  
improve interoperability and help users, authors and UA developers  
alike.

Do named character references belong in XHTML (i.e., are they even in  
the DTD)? I don't even recall off of the top of my head. However, I'm  
still running into tools that obliterate my Unicode characters, and  
so maybe its too soon to drop named character references from the  
HTML namespace (I know Dan reminded me they are not technically part  
of the HTML namespace, but that's how we tend to think of it). Should  
WebKit be inserting an inferred tbody element into the DOM. Not per  
the current spec, but since we're developing the next spec, its a  
possibility we shouldn't dismiss, just because its not what XHTML1 did.

XML requires fatal errors on ill-formedness errors. It does not  
require failures on invalidity errors. Perhaps someone will cite a  
passage to prove me wrong, but I don't recall reading anything in XML  
that would prohibit a UA with hard-wired knowledge from repairing  
invalid text by, for example, adding in a missing tbody element.  
(presuming that conformance required it)

I'm sure if I did a little research I could come up with some other  
examples. the important to keep in mind is that XML separates  
validity from well-formedness. It requires fatal errors on ill- 
formedness and not on invalidity. Certainly any DTD that includes  
named character references would potentially lead to ill-formedness  
errors for non-DTD-retrieving UAs. But there's no reason that even  
those UAs can't implement those named character entities through hard- 
wiring them (like Gecko).

 From what I've witnessed over the last year, the XML UAs are still  
figuring out what XML and XHTML conformance is. We could certainly  
weigh in on that: particularly regarding HTML5/XML.

Take care,
Rob
Received on Monday, 9 July 2007 16:50:59 UTC