Re: thoughts on file attachments from Gregg Kellogg on 2016-09-12 (public-pdf-open-data@w3.org from September 2016)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Mon, 12 Sep 2016 10:05:32 -0700
To: Larry Masinter <LMM@acm.org>
Cc: public-pdf-open-data@w3.org
Message-Id: <35BAD68A-5056-417C-9B9F-EFFA5601C861@greggkellogg.net>
> On Sep 11, 2016, at 9:25 PM, Larry Masinter <LMM@acm.org> wrote:
> 
> First, let me be clear that we’re not making decisions. We’re just exploring the space for background. We haven’t settled on requirements or have agreed use cases. The discussion is interesting because it helps make requirements explicit.
>  
> Second, not all data is tabular. Using embedding is just one possible solution for one special use case. I’m looking for something with near-zero explanation or deployment friction.  That is, the embedded data can be made to pass the “machine readable” test. 

Yes, that’s why I added the paragraph about detecting other known RDF formats via MimeType, such as text/turtle.

> To pass that, the availability of easy to use tools is important part of “machine readable”. We need to address tooling availability, especially for reading.
>  
>  
> Ø  The CSVW (CSV on the Web) specs call for CSV files to be annotated by a metadata file, which is located using the location of the CSV. 
>  
> Ø  Alternatively, the metadata file can be used to find associated CSV files. One standard way to do this is to append “-metadata.json” to the location of the CSV.
>  
> This confused me, I think “one standard way to do ‘this’” you don’t mean that naming the metadata files –metadata.json is a way you use a metadata file to find associated CSV data. ( It’s a way of naming metadata files so they can be found and reasonably unlikely to conflict with unintended usage, human-explanatory to someone browsing inside a PDF.)

The Tabular Data Model spec outlines different ways to find metadata associated with a CSV [4]:

* Using the HTTP Link header with rel=“describedby”
* The default locations {+url}-metadata.json or csv-metadata.json, or using patterns defined in /.well-known/csvm
* embedded metadata (which requires a separate specification for how metadata is embedded in a CSV.

It is also a requirement of the metadata files is that each table contain a “url” property which locates the associated CSV.

This really comes down to how the data is shared, using the URL of the CSV file or the URL of the metadata; the latter is more robust.

Embedding data and/or metadata has it’s own issues, and the above mechanisms aren’t as useful, although a PDF-specific specification could describe alternates. For example, I understand that an embedded file can have an AFRelationship to another file, which could be one way of explicitly associating an attached CSV file to attached metadata.

> It’s probably useful to be careful about “content” and “location”. 
>  
> I proposed: The metadata and CSV files should be attached as close as they can to the visual representation of the tabular data; If you can attach it to the structure, fine. If all you can do is attach to the document, that’s ok. 
>  
> Ø  In the case of a PDF, the “location” of the CSV is more likely represented using a fragment identifier on the location of the PDF itself, 
>  
> I don’t think there’s a problem treating embedded data in a package as having URLs relative to a base. I know it’s controversial to do so, since it means assigning a new URL scheme as ‘base’ just so the CSV data can be accessed using relative links.

The “@base” context element within the metadata certainly allows an alternative base to be specified, including using a different URI scheme. We may want to call this out, and identify specific useful schemes. However, CSVW is a web format, and HTTP(S) was given greater consideration during the development process.

If you consider that the location of a PDF is given using a URI, and a particular embedded file is identified/located using a fragment identifier, then URI join semantics as provided in RFC3986 come into play. To say that one URI is relative to another typically refers to the path components; this chiefly comes into play when looking at default metadata locations and referenced CSV urls. If relative URLs restrict themselves to the fragment, then I don’t see an issue with locating content relative to the base URL estabilshed for the PDF.

> Ø  so this mechanism isn’t too useful. We might invent a new default location for use in embedded scenarios, but better to not rely on locating metadata from an attached CSV, but locating the CSV files from an attached metadata.
>  
> I agree with this; but I think this is simple.
>  
> Ø   However, there is no reason that a metadata file not be found using the default rules for locating metadata based on the path of the PDF (so, http://example.org/tree-data.pdf <http://example.org/tree-data.pdf> might have metadata located at http://example.org/tree-data.pdf-metadata.json <http://example.org/tree-data.pdf-metadata.json>).
>  
> is it useful to support  a situation where the CSV is embedded but the metadata is not embedded?  If not, why add complexity?

This falls out of locating metadata as described in [4]; I don’t see a reason to specifically forbid it, but we may want to define a set of best practices for embedding CSVW in PDFs that encourage both data and metadata to be included as attached files within the PDF.

> > The procedure for extracting such information from a PDF with attached semantic file content might be the following:
>  
> > 1) scan attached files for those having a MimeType of “application/csvw+json”.
>  
> Better to scan based on file name than media type, especially if the metadata name
> is unique. 

If using a URL with a fragment identifier, then absolutely, use that to locate a specific attached file. However, if you’re trying to extract all semantic information from a PDF, then you need to scan attached files and MimeTime is a more reliable way of identifying such data than filename, and a presumed match on file extension.

> >  2) extract associated file and process in accordance with the Tabular Data Model [1].
> > 3) locate associated tables using the “url” property for each table, relative to the location of the containing PDF file. (Note, metadata may set @base for resolving relative URLs).
>  
> Having the default base be the locdation of the containing PDF file means you have to use #ef=subname external URLs in the metadata files for links when embedded and a metadata when it’s not. 
>  
> Ø  3.1) If URL location is a fragment (e.g., #ef=“tree-data.csv”), it identifies another attached file, identified in accordance with draft-hardy-pdf-mime [2] which MUST locate an attached tabular data file (CSV or TSV based on it’s MimeTime).
>  
> There might be a point to include the content-type along with the URL in the CSV/TSV.
> It might be possible to treat the URL for embedded files as if the embedding were just the PDF file, so you’d say that if A.PDF has attachments B.csv and B.csv-nnnn-metadata.json that URLs in B assume ‘base’ of the PDF file’s, while treating “..” as a relative path to the pdf.

You’ll need to expand on this; how might the content-type be included? Otherwise, I think this is essentially correct, but “..” shouldn’t be necessary, just “#…” to get to other embedded content.

> Ø   3.2) Otherwise, if it locates a PDF file, it MUST include a fragment identifier locating an embedded file as in 3.1.
> Ø  3.3) Otherwise, it MUST locate a tabular data file as defined in [1].
> 3.3 covers 3.2.
>  
> 4) Generate an RDF graph (or, JSON file) based on this process.
>  
> Ø  Note: if starting with a URL containing a fragment locating the CSVW metadata, this will result in a single result. If scanning for all embedded files, multiple CSVW metadata files may be located, which result in a merged RDF Graph, or multiple JSON results.
>  
> I thought it was the other way around, that a single metadata file could point to multiple CSV files. What’s the point of multiple metadata files for a single CSV?

It is conceivable that a PDF might contain multiple tables, and that each table has associated CSV and metadata files, so that a single PDF might have multiple CSVW metadata files.

> Ø  Also, note that attached files may have a MimeType associated with another RDF format, such as "text/turtle”, which allows alternate ways to include RDF data within a PDF. The Metadata object may also be used as a source of RDF triples, as it is generally a subset of RDF/XML.
>  
> So the format and meaning of the metadata file is crucial. I thought the metadata object was in JSON, not XML, so how is it a subset of RDF/XML?

Here I was refering to the PDF metadata object, which AFAIK uses a subset of RDF/XML as it’s serialization format. Other attached files may use arbitrary MimeTypes associated with RDF serialization formats.

Gregg

> Ø  This behavior is generally compatible with the note described for embedding CSVW metadata in HTML files [3], which might be extended to include embedding text/csv content in a script element.
> s
> Gregg Kellogg
> gregg@greggkellogg.net <mailto:gregg@greggkellogg.net>
>  
> [1] http://www.w3.org/TR/tabular-data-model/ <http://www.w3.org/TR/tabular-data-model/>
> [2] https://tools.ietf.org/html/draft-hardy-pdf-mime-04#page-3 <https://tools.ietf.org/html/draft-hardy-pdf-mime-04#page-3>
> [3] http://www.w3.org/TR/csvw-html/ <http://www.w3.org/TR/csvw-html/>[4] http://www.w3.org/TR/2015/REC-tabular-data-model-20151217/#locating-metadata
Received on Monday, 12 September 2016 17:06:06 UTC