Re: readers and writers and options from Gregg Kellogg on 2016-09-24 (public-pdf-open-data@w3.org from September 2016)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Sat, 24 Sep 2016 12:41:08 -0700
To: Larry Masinter <masinter@adobe.com>
Cc: Gregg Kellogg <gregg@greggkellogg.net>, "public-pdf-open-data@w3.org" <public-pdf-open-data@w3.org>
Message-Id: <05D67852-7BBD-4424-BF35-60A36FC5CDB3@greggkellogg.net>
> On Sep 22, 2016, at 8:09 PM, Larry Masinter <masinter@adobe.com> wrote:
> 
> I wrote this a while ago, thinking I could make it shorter, but it’s just getting longer….
>  
> ========
>  
> Part of my thinking, when talking about options of “where to stash (links to) data” for “Open” data, is that ease-of-use for readers is a higher priority than “ease of use for writers”.
>  
> Protocol 101:
> Whenever you specify a protocol/API/File-format, you (should) specify the rules of engagement/practice/conformance for the each of the various actors/agents engaged; a good standard is one where conforming agents will interoperate, but be resilient when operated against non-conforming agents.  
>  
> For file formats/profiles, the roles are (at a minimum) (a) readers & (b) writers, so a spec should specify (implicitly or explicitly) rules for writers and rules for readers, such that conforming writers write things that conforming readers can read.
>  
> (Recent tradition in HTML5 is to specify the rules for conforming readers, and implicitly note that writers don’t really follow specs, but should only write things that the readers they want to reach can read.)
>  
> If you have a spec with options, then the more options you allow writers (the more flexibility and implementation possibilities)  the more work you’re making for readers, since every option for writers is a “must read” for readers.  For example, if you offer options for readers (MAY implement one of A or B or C), then you constrain writers to write things that work no matter which option the reader implemented – that is, writers have to write data that simultaneously meets A and B and C requirements.
>  
> Conclusion: the more options we allow for “best practice data-in-PDF”, the harder we make it for widespread deployment of PDF-data-readers.
>  
> For a PDF-with-data to be eligible for “machine-readable” and it’s second star, we must demonstrate widespread availability of easy-to-use tools for “reading” – pulling the data out.
>  
> Thus I want to minimize the mandatory-to-implement options for writers unless the option represents a really different possibility.  I also want a really simple writer option for people to use if they have any old PDF in hand.
>  
> One possibility is to encode the metadata about intrinsic data in the PDF structure tree. This is the natural way to write the data found in HTML-with-RDFA into a tagged PDF. It has the advantage that the data would survive a PDF -> HTML export, as well as PDF editing.  But this data is hard to write after the fact unless the PDF already has structure.
>  
> Data in form fields, document metadata in XMP, document text as data are also methods that we should allow writers to write, and require readers to read besides the options of “embed json-ld metadata and CSV data or text/turtle”.
>  
> There are some conventions worked out by the CSV-on-web working group about looking for CSV and metadata locations, but the PDF-embedding (any embedding) is different from linking, and I don’t know if embedding fits.

I think a data-in-PDF spec should specify how CSV metadata is located, which is perfectly in keeping with the intention of CSV on the Web. But, CSVW has a choice of finding metadata given a CSV, or starting processing from the Metadata. Do we want to extend such chose to data-in-PDF? Or, chose a particular method. In any case, the metadata MUST reference the CSV(s) it relates too, through a URL.

> I don’t know if I’m belaboring the obvious here, or if any of this is controversial, or if I’m imagining requirements that aren’t really there--which is why I’m sending it; have at it J
>  
>  
>  
> It would be timely and useful to review the Candidate Recommendation
> https://www.w3.org/TR/dwbp/ <https://www.w3.org/TR/dwbp/>
>  
> “Data on the Web best practices” to see how any or all of these could be accomplished using PDFs with data, or if some of their use cases would be natural examples.

Indeed, CSVW was informed by work going on in parallel in DWBP, even though it came out in advance.

Gregg

> Larry
> --
> http://larry.masinter.net <http://larry.masinter.net/>
>
Received on Saturday, 24 September 2016 19:41:40 UTC