Re: Some random ideas around (broken) XML

Julian Reschke wrote:
> Karl Dubost wrote:
>> ... # PRODUCING BROKEN XML
>> 
>> The fact is that many atom feeds are broken for many reasons.
>> 
>> * edited by hand * created by templating tools which are not XML
>> producers * mixing content from different sources (html, db, xml)
>> with different encodings
>> 
>> It means when designing an atom feed consumer, implementers are
>> forced to recover the broken content to be able to make it usable
>> by the crowd (social impact). Second part of the postel laws "Be
>> liberal in what you accept". ...
> 
> Are you *really* sure about that? My understanding is that there are
>  popular Atom consumers that require proper XML (except for the
> RFC3023 issue), and that falling back to handle broken XML is
> actually not needed (opposed to RSS).

Almost all violate (as it is needed for compatibility):

> It is a fatal error if an XML entity is determined (via default,
> encoding declaration, or higher-level protocol) to be in a certain
> encoding but contains byte sequences that are not legal in that
> encoding.

Quite a lot of feed readers use identical processors for both Atom and 
RSS though, and I imagine that a lot don't want to have one processor 
for each, so if you really want to be strict for Atom you probably have 
to convince people that it is in their interest to be strict for RSS 
(and for any commercial product, I expect the cost of poorer 
compatibility is greater than that gained by being strict).

Probably the only thing really needed for RSS but not needed for Atom is 
  predefined entities (that were present in RSS 0.91 (Netscape)), which 
arguably should be solved just by increasing the number of predefined 
entities in XML.

Out of incidental interest, I did try shipping a release of SimplePie 
(which, combined with downstream users, has millions of users) which was 
strict with character encodings, but that turned out quite quickly to be 
unworkable in the real web. It, to this day, is strict with entities, 
and that causes around one bug report/support issue per month. I have 
plenty of occasions been tempted to prefix all documents with a DOCTYPE 
containing the entities present in RSS 0.91 (Netscape), though always 
found some technical reason to not implement it due to implementation 
complexity.

-- 
Geoffrey Sneddon — Opera Software
<http://gsnedders.com/>
<http://www.opera.com/>

Received on Wednesday, 18 November 2009 10:31:06 UTC