Re: thoughts on file attachments

The CSVW (CSV on the Web) specs call for CSV files to be annotated by a metadata file, which is located using the location of the CSV. Alternatively, the metadata file can be used to find associated CSV files. One standard way to do this is to append “-metadata.json” to the location of the CSV.

In the case of a PDF, the “location” of the CSV is more likely represented using a fragment identifier on the location of the PDF itself, so this mechanism isn’t too useful. We might invent a new default location for use in embedded scenarios, but better to not rely on locating metadata from an attached CSV, but locating the CSV files from an attached metadata. However, there is no reason that a metadata file not be found using the default rules for locating metadata based on the path of the PDF (so, http://example.org/tree-data.pdf might have metadata located at http://example.org/tree-data.pdf-metadata.json).

The procedure for extracting such information from a PDF with attached semantic file content might be the following:

1) scan attached files for those having a MimeType of “application/csvw+json”.
2) extract associated file and process in accordance with the Tabular Data Model [1].
3) locate associated tables using the “url” property for each table, relative to the location of the containing PDF file. (Note, metadata may set @base for resolving relative URLs).
  3.1) If URL location is a fragment (e.g., #ef=“tree-data.csv”), it identifies another attached file, identified in accordance with draft-hardy-pdf-mime [2] which MUST locate an attached tabular data file (CSV or TSV based on it’s MimeTime).
  3.2) Otherwise, if it locates a PDF file, it MUST include a fragment identifier locating an embedded file as in 3.1.
  3.3) Otherwise, it MUST locate a tabular data file as defined in [1].
4) Generate an RDF graph (or, JSON file) based on this process.

Note: if starting with a URL containing a fragment locating the CSVW metadata, this will result in a single result. If scanning for all embedded files, multiple CSVW metadata files may be located, which result in a merged RDF Graph, or multiple JSON results.

Also, note that attached files may have a MimeType associated with another RDF format, such as "text/turtle”, which allows alternate ways to include RDF data within a PDF. The Metadata object may also be used as a source of RDF triples, as it is generally a subset of RDF/XML.

This behavior is generally compatible with the note described for embedding CSVW metadata in HTML files [3], which might be extended to include embedding text/csv content in a script element.

Gregg Kellogg
gregg@greggkellogg.net

[1] http://www.w3.org/TR/tabular-data-model/
[2] https://tools.ietf.org/html/draft-hardy-pdf-mime-04#page-3
[3] http://www.w3.org/TR/csvw-html/

> On Sep 11, 2016, at 12:27 PM, Larry Masinter <LMM@acm.org> wrote:
> 
> I met with Gregg Friday; I think we made some progress on the “embed CSV for tabular data” case.
> 
> We talked about where to put metadata for each table, and just 
> 
> I think we came to a preference for attaching multiple files for each table: the CSV(s) and a metadata file.
> 
> It looks like there are lots of utilities for manipulating (adding, extracting, deleting) PDF file attachments; you don’t need acrobat. (All seem to deal with them only at the top level? )
> 
> Anyway, let’s say we give data-metadata files a special pattern:
> METADATA-<n>-<descriptive name>.json
> The CSV files can be named anything. They’re linked from the metadata files.
> (The data doesn’t even have to be in the PDF!)
> 
> To extract data, just run a “pull out file attachments” utility, or use something fancier.
> Look for METADATA files, and use them to manipulate the data.
> 
> Embedded files can use relative URLs to talk about other embedded files.
> You don’t need to set base.
> 
> If you unpack all the attachments, they’ll work relative to ‘file’.
> 
> I’m being a little terse so I hope what I’m saying is clear.
> I’ll make some examples.
> 
> 
> 
> 

Received on Sunday, 11 September 2016 19:58:22 UTC