Re: thoughts on file attachments

First, let me be clear that we’re not making decisions. We’re just exploring the space for background. We haven’t settled on requirements or have agreed use cases. The discussion is interesting because it helps make requirements explicit.

 

Second, not all data is tabular. Using embedding is just one possible solution for one special use case. I’m looking for something with near-zero explanation or deployment friction.  That is, the embedded data can be made to pass the “machine readable” test. 

 

To pass that, the availability of easy to use tools is important part of “machine readable”. We need to address tooling availability, especially for reading.

 

 

Ø  The CSVW (CSV on the Web) specs call for CSV files to be annotated by a metadata file, which is located using the location of the CSV. 

 

Ø  Alternatively, the metadata file can be used to find associated CSV files. One standard way to do this is to append “-metadata.json” to the location of the CSV.

 

This confused me, I think “one standard way to do ‘this’” you don’t mean that naming the metadata files –metadata.json is a way you use a metadata file to find associated CSV data. ( It’s a way of naming metadata files so they can be found and reasonably unlikely to conflict with unintended usage, human-explanatory to someone browsing inside a PDF.)

 

It’s probably useful to be careful about “content” and “location”. 

 

I proposed: The metadata and CSV files should be attached as close as they can to the visual representation of the tabular data; If you can attach it to the structure, fine. If all you can do is attach to the document, that’s ok. 

 

Ø  In the case of a PDF, the “location” of the CSV is more likely represented using a fragment identifier on the location of the PDF itself, 

 

I don’t think there’s a problem treating embedded data in a package as having URLs relative to a base. I know it’s controversial to do so, since it means assigning a new URL scheme as ‘base’ just so the CSV data can be accessed using relative links.

 

Ø  so this mechanism isn’t too useful. We might invent a new default location for use in embedded scenarios, but better to not rely on locating metadata from an attached CSV, but locating the CSV files from an attached metadata.

 

I agree with this; but I think this is simple.

 

Ø   However, there is no reason that a metadata file not be found using the default rules for locating metadata based on the path of the PDF (so, http://example.org/tree-data.pdf might have metadata located at http://example.org/tree-data.pdf-metadata.json).

 

is it useful to support  a situation where the CSV is embedded but the metadata is not embedded?  If not, why add complexity?

 

> The procedure for extracting such information from a PDF with attached semantic file content might be the following:

 

> 1) scan attached files for those having a MimeType of “application/csvw+json”.

 

Better to scan based on file name than media type, especially if the metadata name
is unique. 

 

>  2) extract associated file and process in accordance with the Tabular Data Model [1].

> 3) locate associated tables using the “url” property for each table, relative to the location of the containing PDF file. (Note, metadata may set @base for resolving relative URLs).

 

Having the default base be the locdation of the containing PDF file means you have to use #ef=subname external URLs in the metadata files for links when embedded and a metadata when it’s not. 

 

Ø  3.1) If URL location is a fragment (e.g., #ef=“tree-data.csv”), it identifies another attached file, identified in accordance with draft-hardy-pdf-mime [2] which MUST locate an attached tabular data file (CSV or TSV based on it’s MimeTime).

 

There might be a point to include the content-type along with the URL in the CSV/TSV.

It might be possible to treat the URL for embedded files as if the embedding were just the PDF file, so you’d say that if A.PDF has attachments B.csv and B.csv-nnnn-metadata.json that URLs in B assume ‘base’ of the PDF file’s, while treating “..” as a relative path to the pdf.

Ø   3.2) Otherwise, if it locates a PDF file, it MUST include a fragment identifier locating an embedded file as in 3.1.

Ø  3.3) Otherwise, it MUST locate a tabular data file as defined in [1].

3.3 covers 3.2.

 

4) Generate an RDF graph (or, JSON file) based on this process.

 

Ø  Note: if starting with a URL containing a fragment locating the CSVW metadata, this will result in a single result. If scanning for all embedded files, multiple CSVW metadata files may be located, which result in a merged RDF Graph, or multiple JSON results.

 

I thought it was the other way around, that a single metadata file could point to multiple CSV files. What’s the point of multiple metadata files for a single CSV?

 

Ø  Also, note that attached files may have a MimeType associated with another RDF format, such as "text/turtle”, which allows alternate ways to include RDF data within a PDF. The Metadata object may also be used as a source of RDF triples, as it is generally a subset of RDF/XML.

 

So the format and meaning of the metadata file is crucial. I thought the metadata object was in JSON, not XML, so how is it a subset of RDF/XML?

 

Ø  This behavior is generally compatible with the note described for embedding CSVW metadata in HTML files [3], which might be extended to include embedding text/csv content in a script element.

s

Gregg Kellogg

gregg@greggkellogg.net

 

[1] http://www.w3.org/TR/tabular-data-model/

[2] https://tools.ietf.org/html/draft-hardy-pdf-mime-04#page-3

[3] http://www.w3.org/TR/csvw-html/

 

Received on Monday, 12 September 2016 04:26:19 UTC