PDF and "CSV for the Web" from Larry Masinter on 2016-09-07 (public-pdf-open-data@w3.org from September 2016)

From: Larry Masinter <masinter@adobe.com>
Date: Wed, 7 Sep 2016 19:11:23 +0000
To: "public-pdf-open-data@w3.org" <public-pdf-open-data@w3.org>
Message-ID: <83CDB5FC-BC01-4C4A-8361-F3AC0003D286@adobe.com>

I had a good discussion yesterday with Gregg Kellogg (new group member) and I thought I would report it.

Gregg worked on the CSV-on-the-web working group, and in some ways we’re trying to do for PDF what the CSV group did for CSV: find a way of letting PDF data be five star.

There is all kinds of data one might want to get out of a PDF, but for lots and lots of use cases, the important data is in tables.
CSV (comma-separated-values) is a common, simple way of communicating values in a table, can be read into a spreadsheet directly.

The CSVW group defined a way of representing the metadata you need to know to transform the data in the CSV file into RDF triples.
https://www.w3.org/standards/techs/csv

Gregg developed a Note about embedding CSV inside HTML
http://www.w3.org/TR/csvw-html/

for the same kinds of reasons… keep the data with the report that describes it, keep existing workflows which have grown up around having a single file.

So: suppose, for each table in a PDF file with (useful) data in it, we add an attachment of CSV and JSON metadata. (There’s some question of which points to which, or if you could have multiple CSV fragments for one table, and some issue of what URL to use to get to parts of the CSV file, but these seem workable.)

Gregg has some web utilities that do useful things with RDF and CSV. One takes a URI of a CSV and produces other formats.

The CSVW github repo has a lot of examples and test cases.
(under http://w3c.github.io/csvw/)

Lots of samples were based on “palo alto trees” (one of the CSVW use cases).

For example
https://raw.githubusercontent.com/w3c/csvw/gh-pages/examples/tree-ops.csv-metadata.json

has the metadata for tree-ops.

I am curious as to whether the PDF fragment identifier syntax could be extended to allow pointers into file attachments.
See https://tools.ietf.org/html/draft-hardy-pdf-mime-04.pdf

We’re going to get together Friday afternoon in San Jose and I hope we can talk about the charter and see how far we can get with the PDFData tools https://github.com/Aiybe/PDFData.

If you’d like to join, let me know.

Larry
--
http://larry.masinter.net

Received on Wednesday, 7 September 2016 19:11:56 UTC