Re: HTML and XML

To speak as I so rarely do on W3C lists, as a feed parsing library  
vendor:


On 11 Feb 2009, at 11:01, Julian Reschke wrote:

> As far as I can tell, the Atom feed format gets away with draconian  
> error handling (minus the RFC3023 thingy) quite well.

I wouldn't say that is entirely true: Atom, as well as RFC3023 issues,  
hits the character decoding restriction, i.e.,

> It is a fatal error if an XML entity is determined (via default,  
> encoding declaration, or higher-level protocol) to be in a certain  
> encoding but contains byte sequences that are not legal in that  
> encoding.

It is fairly common for feeds to contain invalid byte sequences. Also,  
note that until work on Acid3 got underway, the only major browser to  
implement this was IE. Finally, the only other issue I can think of is  
that something along the lines of <http://www.w3.org/TR/2008/WD-html5-20080610/parsing.html#character0 
 > is also needed for compatibility with the real web.

The issues with RSS aren't much greater: mainly just more people  
having HTML entities within the XML output, and a small number of  
cases of bogus content beyond the end of the document.

Revising RFC 32023, changing the requirement that any encoding error  
is fatal, and defining how to compare encoding names (perhaps even as  
leniently as Unicode TR22 defines) with changing some to others (such  
as ISO-8859-1 to Windows-1252) would make it far more possible for  
feed readers to comply with the relevant specifications. Until that  
happens, I'll be amazed if any major feed readers fully comply with  
them.


--
Geoffrey Sneddon
<http://gsnedders.com/>
<http://simplepie.org/>

Received on Wednesday, 11 February 2009 15:55:33 UTC