Re: [automattic] False Marketing from Geoffrey Sneddon on 2008-12-12 (www-archive@w3.org from December 2008)

From: Geoffrey Sneddon <geoffers@gmail.com>
Date: Fri, 12 Dec 2008 19:00:00 +0000
To: Matt Mullenweg <m@mullenweg.com>
Cc: www-archive <www-archive@w3.org>
Message-Id: <6D1CF1F0-76DF-40F5-B982-EDDF6887F082@gmail.com>
On 16 Nov 2008, at 00:20, Matt Mullenweg wrote:

> Geoffrey Sneddon wrote:
>> I can see no way to improve the XML export without breaking  
>> backwards compatibility as WXR is so far from XML. It is also  
>> impossible to change the version in the URI you use as a namespace,  
>> as currently even unknown major versions are attempted to be parsed.
>
> How would making it better-formed break backward compatibility, in  
> practice?

Looking more closely at what's done, it seems anything that I thought  
would be broken is already broken (try exporting custom fields with  
both name and value set as "This < foo & bar", this doesn't round-trip  
and causes the export to not be XML). However, if we removed the CDATA  
sections (which are undesirable as there is no way to have them  
containing "]]>" as a literal string (outside of a CDATA section  
"]]&gt;" works fine)), then existing versions of WP would break on the  
input in places (where CDATA is used, the importer seems to reply on  
CDATA for doing all escaping so it doesn't have to do any parsing of  
entities, another violation of the XML spec). Just bumping the version  
doesn't help (despite the comment saying it's there for when we might  
break compat.) as the current importer completely ignores the version  
number (and shipping something that starts caring doesn't fix the  
backwards compat. issue, as you still have the millions of copies of  
WP that have already been shipped with support for WXR) — regardless,  
bumping the version in such a case would seem extremely kludgy as the  
format would be entirely compatible with itself before, it would just  
be working around a bug in the current de-facto implementation. Either  
backwards compatibility has to be broken, or WXR has to admit to not  
being XML.

> We can be liberal in what we accept, conservative in what we output.

If you want to follow Postel's Law, then XML certainly isn't what you  
want. XML is absolutely clear that any errors should be fatal. We  
can't be liberal in what we accept (and that means disallowing some  
edge case backups that are currently supported such as the above  
example).

> Any patches to improve our output are always welcome - I know we're  
> far from perfect in the content we output sometimes but that doesn't  
> mean we shouldn't strive to be.

Is there any interest in moving over to a fully fledged XML serializer  
(which would at the very least mean any XML conformance bug would be  
in one place)? This could be used not only for WXR but also for  
generic RSS/Atom. <http://hsivonen.iki.fi/producing-xml/> covers most  
of the advantages for using a serializer.

The importer should be easier — WP already has a built in RSS parser  
(in the form MagpieRSS, which doesn't comply to XML either (esp. the  
Namespaces for XML spec), but is a lot lot lot closer), and I am still  
unable to think of any reason why it wasn't used initially (which  
would be avoid most of this issue which now arises) as it would  
involve a lot less code. See trac ticket #7400 for this.

> BTW Movable Type has a well-working WXR importer (or so they claim)  
> so obviously it isn't impossible to make something else work from it.

I guess provided you ignore edge cases you can pretty much get away  
with it from an import POV — from an exporter POV would need to very  
carefully make sure you follow the subset of XML-like byte-streams  
that the importer supports.

Also, you said back in August 2007 (<http://ma.tt/2007/08/movabletype-4-vs-wordpress-22/#div-comment-424413 
 >):

> However I still do plan to get a spec doc up for it one day. If that  
> were a condition of them [MT] supporting it I’d happily prioritize it.

When I brought this up on wp-hackers in July (2008), I received the  
reply (from Otto):

> If you want it documented, then look at it and write a document for  
> it.

Either there's a disconnect, or there's a change of plan. Is it still  
the case that there is a spec coming? I know the main reason why  
Habari to this day does not have a WXR importer is because it is, as  
it stands, an undefined XML-look-alike byte-stream. This makes it very  
hard to support any export that isn't XML, and on the principle of  
trying to implement edge-cases first (thereby making all more normal  
cases work fine) supporting something that isn't XML is hard. If it  
were the case that it was possible to be reasonably certain that the  
output will be XML (which, IMO, it isn't) then I would be willing to  
write a spec sometime (as then it could be done in terms of a DOM, and  
not in terms of a byte-stream, massively simplifying it), though it  
would probably be unlikely to happen until March/April '09 (I would  
guess it would probably be around a day's work).

That said, I think having some standardized format would be better,  
and something based upon RSS probably isn't a good way to do that  
(mainly because RSS is uselessly vague, and WXR would almost certainly  
have to be defined as an extension of a subset of that (e.g., Is the  
title element text or HTML? Different implementations do different  
things here, some performing heuristics to try and determine the  
answer.), but also because Atom has far more of what is needed already  
standarized).


--
Geoffrey Sneddon
<http://gsnedders.com/>
Received on Friday, 12 December 2008 19:00:38 UTC