W3C home > Mailing lists > Public > public-lod@w3.org > January 2015

RE: linked open data and PDF

From: Larry Masinter <masinter@adobe.com>
Date: Fri, 30 Jan 2015 15:48:14 +0000
To: Sarven Capadisli <info@csarven.ca>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <DM2PR0201MB09606A5C24706109FD9E4A69C3310@DM2PR0201MB0960.namprd02.prod.outlook.com>
((sorry, this sat in my drafts box, incomplete, and out of sequence now))

In reply to Sarven Capadisli's email of 2015-01-19 12:20:

> First off, I totally acknowledge your interest to improve the state of
> things for PDF.

> I'm welcome to be proven wrong, but for the "big picture", I don't
> believe that LaTeX/XMP/PDF is the way to go for LD-friendly - perhaps
> efforts for that better invested elsewhere.

People choose document formats for a variety of reasons, for
access to tools and downstream workflows and community 
features. I don't think there is a single "way to go". 
Some requirements lead some people to choose PDF
(or Excel or JPEG or PNG or ePub) for some use cases of
document distribution. There is no good reason to
avoid enhancing the LD-friendliness for those cases.

Of course you are free to put your effort on future
document formats, or other use cases.

>  There are a number of issues
> and shortcomings with the PDF approach which in the end will not play
> well with the Web is intended to be, nor how it functions.

I think I have familiarity with what the Web is intended to
be, and how it functions, and I disagree.

One of the earliest advances of the web (from Tim's
original HTML-only design) was the introduction of 
support for multiple formats, including image and
document representations.  It would greatly
improve the open data initiative to not restrict LD
to HTML.

Your other points:
> not fault tolerant

What are kinds of faults you think should be tolerated
but are not? I looked through

but I'm still not sure what you mean.

> machine-friendly (regardless of what can be stuffed into XMP), 

I think "machine-friendly" in LOD context,  means that there are
readily available tools to add, extract, manipulate. 

And it should be possible to annotate any format that
is suitable.

> and will not scale. 

This baffles me, what scaling do you have in mind?
I've worked with 2000-page PDF files, which, when
served from HTTP servers with range retrieval,
can incrementally display quickly. There may be
some performance goals specific to 'data'?

> At the end of the day, PDF is a silo-document, 

There are hyperlinks in, and hyperlinks out. Embedding. 
Except that HTML can be source-edited as text, I am
not sure what you mean by 'silo', then.

> its rendering is a resource-hog in different devices,

Rendering time for PDF and any other format is determined
as more by the quality of creation tools. Surely there are
bloated singing dancing HTML5 documents which hog 

These are publisher choices, whether they invest
in optimization.

>  and it is not a ubiquitous reading/interactive
> experience in different devices.

More consistent than other choices by design. Perhaps
not as widely available as HTML and JPEG, but close.

> For XMP/PDF to work, I presume you are going to end up dealing with

One possibility, although not exclusively, see "attachment" above.

> and an appropriate interface for authors to mark their
> statements with.

The use cases I have in mind are for data-driven document generation
scenarios, not primarily individual authors. And even for individual
documents, PDF is not currently a common authoring format,
except perhaps when filling out PDF forms.

> Keep in mind that, this will most likely treat the data
> as a separate island, disassociated from the context in which it appears
> in.

Except for annotations, or form-data for PDF forms, 
I'm not sure what you see as the problem. Yes, one might
imagine someone updating a PDF using editing tools without
updating the corresponding data, but I don't imagine
this a common pattern. I’m thinking rather that data
identification and markup would happen in a source format,
and subsequently extracted from the source and re-injected
into the PDF as part of the publication process, if not
preserved in the conversion process (which depends on
the tools in use for PDF production).

> May I invite you to read:
> http://csarven.ca/enabling-accessible-knowledge 
> It covers my position in sufficient depth - not intended to be overly
> technical, but rather covering the ground rules and ongoing work.

Thank you; your paper brings up additional considerations.

> While you are at it, please do a quick print-view from your Web
> browser (preferably in Firefox) or print to PDF.

I tried Chrome and Internet explorer, not sure what I’m supposed
to see. The style is entirely different, of course. Having worked
on content-adaptation since the 80s, I can say my experience is 
users don't like the surprises of visual incongruity among 
content-negotiated renderings.

> The RDF bits are visible here:
> http://www.w3.org/2012/pyRdfa/extract?uri=http%3A%2F%2Fcsarven.ca%2Fenabling-accessible-knowledge&rdfa_lite=false&vocab_expansion=false&embedded_rdf=true&validate=yes&space_preserve=true&vocab_cache_report=false&vocab_cache_bypass=false

Is there a way of generating triples in Turtle or RDF/XML?
I was experimenting with JSON because I could get it:


and (by hand, just to see what it would look like) merge it with the PDF's XMP.
Triples about the document itself fit into XMP document metadata, while other
data fits into a linked-data namespaced JSON-Data element.

You might also want to consider XMP for video

> I will spare you the details on what's going on there, unless you really
> want to know, but to put it in a nutshell: it covers statements dealing
> with sections, provenance, references/citations..

I think these can be covered too. The ways this kind of structure might be useful are harder to imagine. Perhaps for data about formal documents like legislation and court records.

> Here is another paper: http://linked-reseach.270a.info/ (which can just
> as well be a PDF - after all, PDF is just a view), which in addition to
> above, includes more atomic things like hypothesis, variables,
> workflows, ..

It's heady the amount of 'data' you can annotate and extract,
but volume is not the best metric, not even a good metric.
I'd rather start from practical use cases where the data available
has clear, incremental value.  

> The work is based on Linked Research:
> https://github.com/csarven/linked-research

I'll just say I have a different perspective, which I've blogged about over the years, e.g.,



> If you are comfortable with your browser's developer toolbar, try
> changing the stylesheet lncs.css in <head> to acm.css.
> There is a whole behavioural/interactive layer which I'll skip over now,
> but you can take a look at it if you fancy JavaScript.
> As you may have already noticed, the HTML template is flexible enough
> for "blog" posts and "papers" - again, this is about separating the
> structure/content from the other layers: presentation, and behaviour.


Surely it's not what you do, but it's the way that you do it (that determines who you are).

Received on Friday, 30 January 2015 15:48:43 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:22:18 UTC