Re: HTML and XML from Geoffrey Sneddon on 2009-02-11 (www-tag@w3.org from February 2009)

From: Geoffrey Sneddon <foolistbar@googlemail.com>
Date: Wed, 11 Feb 2009 16:07:22 +0000
To: elharo@metalab.unc.edu
Cc: Henri Sivonen <hsivonen@iki.fi>, "Henry S.Thompson" <ht@inf.ed.ac.uk>, Anne van Kesteren <annevk@opera.com>, David Orchard <orchard@pacificspirit.com>, www-tag@w3.org
Message-Id: <20B69E7B-07A9-480D-A805-292CA9BD6A40@googlemail.com>

On 11 Feb 2009, at 14:53, Elliotte Harold wrote:

> I do agree that the state of XML serialization is rather pathetic,  
> though. XML is more complex than it appears and the amount of bad  
> XML generating and escaping code out there is a problem. I tend to  
> think the response is better libraries, and perhaps integrating some  
> checks into staic analysis tools.

But has this not been the response for the past eleven years? It  
remains true, eleven years (and one day!) after XML 1.0 was published,  
that serializers by and large make it possible to output a byte-stream  
that does not match the XML production. What is to say this will  
improve over the next eleven years?

I know that at least one major issue in PHP (which, to my knowledge,  
has no fully working serializer), is with the DOM extension which  
simply implements the DOM Level 3 Load and Save, which actually goes  
as far as to state:

> For nodes of type Document or Entity, well-formed XML will be  
> created when possible (well-formedness is guaranteed if the document  
> or entity comes from a parse operation and is unchanged since it was  
> created).

When we have W3C specified serializers that do not guarantee well- 
formedness, what hope is there?

I would guess that the majority of XML produced dynamically online is  
done through PHP, and when PHP 5 has no working serializer, yet alone  
the PHP 4 the majority of PHP software still supports (the closest  
that gets to XML serializing is string concatenation without non- 
standard cannot-be-relied-upon extensions!), which leaves XML output  
on the web in a far from brilliant state.

In PHP's case at least, there is no native Unicode support so  
implementing a lot of the character restrictions would be a fair  
amount of work (even if only UTF-8 supported, there is still the  
entire overhead of that needed), as well as having a fair  
computational overhead. With PHP 6 (which will add native Unicode  
support) still a fair way off, likely to have fairly slow uptake, and  
the majority of PHP software supporting six year old versions of the  
interpreter, there is little likelihood of this changing any time soon  
— maybe it'll be possible in eleven years time…

--
Geoffrey Sneddon
<http://gsnedders.com/>

Received on Wednesday, 11 February 2009 16:08:09 UTC