CSV to XML conversion requirement

Hi -

Following feedback from Liam Quin [1] it occurred to me that we already had a hook for the CSV to XML translation in the form of use case #1 Digital Preservation of Government Records <http://w3c.github.io/csvw/use-cases-and-requirements/index.html#UC-DigitalPreservationOfGovernmentRecords> where XML conversion was used as an interim step.

I've updated the Use Case to include some of the general points from Liam ... and this gives us a CSV to XML requirement <http://w3c.github.io/csvw/use-cases-and-requirements/index.html#R-CsvToXmlTransformation>!

Jeremy


[1] Liam's thoughts on XML are provided below:

In general XML is used so widely and for such a diverse range of applications that people tend to take one of several freely-available XSLT transformations or XQuery CSV parsers and modify as needed.

Some ways they are used -

1. data oriented XML, e.g. for a Web Service or database interchange.

Here you read the CSV file and maybe generate a document, or a sequence of SQL statements to insert it into a database, or you recombine the tuples into objects or XML elements and put them into a ForestStore, e.g. using XQuery Update.

2. report generation

Use XSLT to read CSV data and produce a histogram of student test results compared to shoe size.

3. text

Use a CSV file as a poor boy's database, e.g. given a spreadsheet or relational database dump of stock exchange data -
  exchange-id, ticker symbol, company name, company Web site process an XHTML document and surround
  <span class="ts nyx">SUNW</span>
with
  <a href="http://www.sun.com/"><span class="ts">SUNW</span> <span class="cd">(Sun Microsystems Inc.)</span></a>

In some cases (e.g. RSS as CSV) there might even be whole paragraphs of text in the data to be reassembled in some order.

4. conversion

convert CSV in one format into another by parsing with XSLT or XQuery and then writing out with another off-the-shelf module.

Once people doing this note that column names are typically not suitable for use as XML element names, there's generally no difficulty in finding an ad hoc mapping. But that insight is important, as otherwise code gets locked into place that's hard to change later. E.g. you can have spaces in CSV column names but not in XML element names.

In most cases, the choice is simply whether to repeat column names for each row or simply have, <row><item>Department for Business Innovation and Skills</item><item>...
and so forth, the conversion software (e.g. XSLT) automatically escaping < as &lt; and & as &amp; of course.

It is worth noting that relying on CDATA sections to avoid excaping does NOT work -

<row><item><![CDATA[ data value unescaped here ]]></item>....

because the data could include
   ]]></item><item>....</item></row><row>..... <![CDATA[ and the result would be to inject extra rows into the result - this is called a CDATA Injection Attack.

Producing CSV from arbitrary XML documents is meaningless - I don't see a role for a standard there. You could do it, just as there's a way to turn any XML document into RDF (each text character becomes an anonymous node so it can be sequenced...) but both are 100% useless in practice except for testing round trip conversions perhaps.

Received on Wednesday, 21 May 2014 12:02:47 UTC