- From: Geoffrey Sneddon <geoffers@gmail.com>
- Date: Fri, 12 Dec 2008 19:00:00 +0000
- To: Matt Mullenweg <m@mullenweg.com>
- Cc: www-archive <www-archive@w3.org>
On 16 Nov 2008, at 00:20, Matt Mullenweg wrote: > Geoffrey Sneddon wrote: >> I can see no way to improve the XML export without breaking >> backwards compatibility as WXR is so far from XML. It is also >> impossible to change the version in the URI you use as a namespace, >> as currently even unknown major versions are attempted to be parsed. > > How would making it better-formed break backward compatibility, in > practice? Looking more closely at what's done, it seems anything that I thought would be broken is already broken (try exporting custom fields with both name and value set as "This < foo & bar", this doesn't round-trip and causes the export to not be XML). However, if we removed the CDATA sections (which are undesirable as there is no way to have them containing "]]>" as a literal string (outside of a CDATA section "]]>" works fine)), then existing versions of WP would break on the input in places (where CDATA is used, the importer seems to reply on CDATA for doing all escaping so it doesn't have to do any parsing of entities, another violation of the XML spec). Just bumping the version doesn't help (despite the comment saying it's there for when we might break compat.) as the current importer completely ignores the version number (and shipping something that starts caring doesn't fix the backwards compat. issue, as you still have the millions of copies of WP that have already been shipped with support for WXR) — regardless, bumping the version in such a case would seem extremely kludgy as the format would be entirely compatible with itself before, it would just be working around a bug in the current de-facto implementation. Either backwards compatibility has to be broken, or WXR has to admit to not being XML. > We can be liberal in what we accept, conservative in what we output. If you want to follow Postel's Law, then XML certainly isn't what you want. XML is absolutely clear that any errors should be fatal. We can't be liberal in what we accept (and that means disallowing some edge case backups that are currently supported such as the above example). > Any patches to improve our output are always welcome - I know we're > far from perfect in the content we output sometimes but that doesn't > mean we shouldn't strive to be. Is there any interest in moving over to a fully fledged XML serializer (which would at the very least mean any XML conformance bug would be in one place)? This could be used not only for WXR but also for generic RSS/Atom. <http://hsivonen.iki.fi/producing-xml/> covers most of the advantages for using a serializer. The importer should be easier — WP already has a built in RSS parser (in the form MagpieRSS, which doesn't comply to XML either (esp. the Namespaces for XML spec), but is a lot lot lot closer), and I am still unable to think of any reason why it wasn't used initially (which would be avoid most of this issue which now arises) as it would involve a lot less code. See trac ticket #7400 for this. > BTW Movable Type has a well-working WXR importer (or so they claim) > so obviously it isn't impossible to make something else work from it. I guess provided you ignore edge cases you can pretty much get away with it from an import POV — from an exporter POV would need to very carefully make sure you follow the subset of XML-like byte-streams that the importer supports. Also, you said back in August 2007 (<http://ma.tt/2007/08/movabletype-4-vs-wordpress-22/#div-comment-424413 >): > However I still do plan to get a spec doc up for it one day. If that > were a condition of them [MT] supporting it I’d happily prioritize it. When I brought this up on wp-hackers in July (2008), I received the reply (from Otto): > If you want it documented, then look at it and write a document for > it. Either there's a disconnect, or there's a change of plan. Is it still the case that there is a spec coming? I know the main reason why Habari to this day does not have a WXR importer is because it is, as it stands, an undefined XML-look-alike byte-stream. This makes it very hard to support any export that isn't XML, and on the principle of trying to implement edge-cases first (thereby making all more normal cases work fine) supporting something that isn't XML is hard. If it were the case that it was possible to be reasonably certain that the output will be XML (which, IMO, it isn't) then I would be willing to write a spec sometime (as then it could be done in terms of a DOM, and not in terms of a byte-stream, massively simplifying it), though it would probably be unlikely to happen until March/April '09 (I would guess it would probably be around a day's work). That said, I think having some standardized format would be better, and something based upon RSS probably isn't a good way to do that (mainly because RSS is uselessly vague, and WXR would almost certainly have to be defined as an extension of a subset of that (e.g., Is the title element text or HTML? Different implementations do different things here, some performing heuristics to try and determine the answer.), but also because Atom has far more of what is needed already standarized). -- Geoffrey Sneddon <http://gsnedders.com/>
Received on Friday, 12 December 2008 19:00:38 UTC