Re: linked open data and PDF from Sarven Capadisli on 2015-02-03 (public-lod@w3.org from February 2015)

From: Sarven Capadisli <info@csarven.ca>
Date: Tue, 03 Feb 2015 17:39:18 +0100
To: Larry Masinter <masinter@adobe.com>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <54D0F9B6.6080908@csarven.ca>
On 2015-01-30 16:48, Larry Masinter wrote:
>>   There are a number of issues
>> and shortcomings with the PDF approach which in the end will not play
>> well with the Web is intended to be, nor how it functions.
>
> I think I have familiarity with what the Web is intended to
> be, and how it functions, and I disagree.
>
> One of the earliest advances of the web (from Tim's
> original HTML-only design) was the introduction of
> support for multiple formats, including image and
> document representations.  It would greatly
> improve the open data initiative to not restrict LD
> to HTML.

No one is restricting LD to HTML. Evidently, that is not the case, nor 
should it be that way. FYI, RDF serializations lead the way in LD. But, 
for the human/end-user, all roads almost always lead to HTML.

The multiple formats are indeed supported, but their mileage varies on 
how we get a hold of them. We have HTML which tries to address their 
accessibility and discoverability. It is clear that PDFs are data-silos 
since we do not hop from one (binary document) to another. While linking 
is possible, at the end of the day, there is a UX problem. There is no 
ubiquitous experience which allows one to switch between PDF and HTML 
resources in a given device, operating-system, and software (e.g., Web 
browser, PDF reader). Jumping between them is awkward, and for the sake 
of what? How or why would that UX be any preferable for the user? 
Surely, that can be improved; as you well know that Web browsers can 
display PDFs nowadays. But still, that's just an annoyance (or depending 
on who you ask, it is a convenience).

Surely, you also know why timbl decided not to use TeX as the language 
to author and exchange documents on the Web.

I stand by my original point that HTML is a good bet. The burden of 
proof that PDF is somehow Web or LD "friendly" lies on the shoulders of 
enthusiasts and stake holders. Make is so.

This is not to discourage any format striving to be more open and 
machine-friendly on the Web.

> Your other points:
>> not fault tolerant
>
> What are kinds of faults you think should be tolerated
> but are not? I looked through
>   http://csarven.ca/enabling-accessible-knowledge
> but I'm still not sure what you mean.

Open up a (La)TeX/Word/PDF file and remove a non-content character - I 
hope we don't have to debate about which character. Is the document 
still "useful"? What kind of error-handling is there in corresponding 
readers or anything that can make an HTTP call and display the response 
for the human? Compare that with HTML.

>> machine-friendly (regardless of what can be stuffed into XMP),
>
> I think "machine-friendly" in LOD context,  means that there are
> readily available tools to add, extract, manipulate.
>
> And it should be possible to annotate any format that
> is suitable.

With that line of reasoning it practically means anything is 
machine-friendly, and not to mention that it is something that we are 
striving for any way. For instance, an image of text is certainly 
machine-friendly if OCR can be applied to it, or that one can point 
their camera to some text on the wall and have it translate the words 
for you. But, I suspect that many would argue whether an image is 
machine-friendly or not in the LD context. Is there a fundamental 
difference between PDF and say a JPEG in context of LD? I'm ignorant on 
this matter as I have difficulty spotting that.

>> and will not scale.
>
> This baffles me, what scaling do you have in mind?
> I've worked with 2000-page PDF files, which, when
> served from HTTP servers with range retrieval,
> can incrementally display quickly. There may be
> some performance goals specific to 'data'?

First, I'm not suggesting that PDF is not widely used in a (desktop) 
environment with pre-installed software, but rather that its access over 
the Web is not that great.

This relates to ease of creating, publishing, and maintaining PDF 
documents. If PDF had a strong case, I would argue that we'd see a 
different Web than the one we are using now.

>> At the end of the day, PDF is a silo-document,
>
> There are hyperlinks in, and hyperlinks out. Embedding.
> Except that HTML can be source-edited as text, I am
> not sure what you mean by 'silo', then.

I've touched on data-silo earlier. Yes, certainly parts of PDF can be 
linked to or that it can link out, but again, how good and reliable is 
that UX across devices, OS, and viewers?

>>   and it is not a ubiquitous reading/interactive
>> experience in different devices.
>
> More consistent than other choices by design. Perhaps
> not as widely available as HTML and JPEG, but close.

I suppose we should define pixel accuracy, but I agree with you on 
consistency. I do not think that PDF is anywhere "close" to HTML's 
penetration across devices, but, if you have the numbers for that, I'd 
be happy to change my view on this particular point.

>> Keep in mind that, this will most likely treat the data
>> as a separate island, disassociated from the context in which it appears
>> in.
>
> Except for annotations, or form-data for PDF forms,
> I'm not sure what you see as the problem. Yes, one might
> imagine someone updating a PDF using editing tools without
> updating the corresponding data, but I don't imagine
> this a common pattern. I’m thinking rather that data
> identification and markup would happen in a source format,
> and subsequently extracted from the source and re-injected
> into the PDF as part of the publication process, if not
> preserved in the conversion process (which depends on
> the tools in use for PDF production).

As you like. I think there are too many points of failure in that 
workflow, but I won't argue against the initiative.

>> May I invite you to read:
>> http://csarven.ca/enabling-accessible-knowledge
>>
>> It covers my position in sufficient depth - not intended to be overly
>> technical, but rather covering the ground rules and ongoing work.
>
> Thank you; your paper brings up additional considerations.
>
>> While you are at it, please do a quick print-view from your Web
>> browser (preferably in Firefox) or print to PDF.
>
> I tried Chrome and Internet explorer, not sure what I’m supposed
> to see. The style is entirely different, of course. Having worked
> on content-adaptation since the 80s, I can say my experience is
> users don't like the surprises of visual incongruity among
> content-negotiated renderings.

Users do not like surprises in general :) Unless of course if the UX is 
designed with that in mind and the user is aware of it, which can be fun 
e.g., games often do this. In any case, it is trivial to point out that, 
the Web we have is far from pixel-perfection, and far from well-formed 
documents. Yet, the net result is that, information is disseminated just 
fine.

If we are going to discuss UI issues with HTML(+CSS...) documents, then 
we should also discuss PDF's. Here is a snippet which I've actually left 
out from the enabling-accessible-knowledge (because I couldn't find an 
appropriate place to leave it in - it is still in source, commented out):

The Nielsen Norman Group, an internationally well-known UI and UX 
consulting firm, has conducted a number of UI evaluations over the years 
(most recently in 2010) on Web usability, and have repeatedly reported 
that PDF is "unfit for human consumption":

http://www.nngroup.com/articles/pdf-unfit-for-human-consumption

One of the emphasis the group makes is that PDF is great for one thing: 
printing documents. Moreover, they state that "forcing users to browse 
PDF files makes usability approximately 300% worse compared to HTML 
pages", with accompanying a variety of usability studies summarizing: 
"users hate PDF":

http://www.nngroup.com/articles/avoid-pdf-for-on-screen-reading

Having said that, I'm sure it is trivial to find many studies favouring 
as well as disapproving both HTML and PDF for users in different 
contexts. What is true is that, both are useful for different users, 
needs, and environments.

>> The RDF bits are visible here:
>>
>> http://www.w3.org/2012/pyRdfa/extract?uri=http%3A%2F%2Fcsarven.ca%2Fenabling-accessible-knowledge&rdfa_lite=false&vocab_expansion=false&embedded_rdf=true&validate=yes&space_preserve=true&vocab_cache_report=false&vocab_cache_bypass=false
>
> Is there a way of generating triples in Turtle or RDF/XML?
> I was experimenting with JSON because I could get it:
>
> http://graphite.ecs.soton.ac.uk/rdf2json/index.php/converted.js?uri=http%3A%2F%2Fcsarven.ca%2Fenabling-accessible-knowledge

> and (by hand, just to see what it would look like) merge it with the PDF's XMP.
> Triples about the document itself fit into XMP document metadata, while other
> data fits into a linked-data namespaced JSON-Data element.

Sure, there are many parsers and enhancers, try some of these:

* http://www.w3.org/2012/pyRdfa/
* http://rdf.greggkellogg.net/distiller
* http://linkeddata.uriburner.com/

I would also recommend rapper which is great from the command-line:

https://github.com/dajobe/raptor

>> I will spare you the details on what's going on there, unless you really
>> want to know, but to put it in a nutshell: it covers statements dealing
>> with sections, provenance, references/citations..
>
> I think these can be covered too. The ways this kind of structure might be useful are harder to imagine. Perhaps for data about formal documents like legislation and court records.
>
>> Here is another paper: http://linked-reseach.270a.info/ (which can just
>> as well be a PDF - after all, PDF is just a view), which in addition to
>> above, includes more atomic things like hypothesis, variables,
>> workflows, ..
>
> It's heady the amount of 'data' you can annotate and extract,
> but volume is not the best metric, not even a good metric.
> I'd rather start from practical use cases where the data available
> has clear, incremental value.

I can come to agree with you that volume is not a good metric, but may I 
refer you to: “Quantity has a quality all of its own” — Joseph Stalin ( 
https://twitter.com/csarven/status/69174259058085888 ).

First of all, if we didn't care about quantity, we'd probably stop at 
the title of a document. Second, I would say that, what I have in that 
document is not "a lot", nor something to brag about. It merely covers 
the initial argument line. There is a whole suite of vocabularies and 
ontologies which covers concepts and fine granularity. Authors only need 
to use them as they need to ("pay as you go"). For instance, if I want 
someone to discover and be able to refer to the hypothesis of a research 
paper, I'd make sure that's possible using the technologies that's 
available to me. If I want to (dis)agree with someone else's claim, I 
can relate mine with theirs. In fact, the "relation" concept is nothing 
new. It was here from day one:

http://www.w3.org/History/1989/proposal.html

As for the use case, how about this one that I've mentioned from another 
thread (actually, I mention something along these lines quite often in 
this mailing list):

https://lists.w3.org/Archives/Public/public-lod/2015Jan/0089.html

"Example: I want to discover the variables that are declared in the 
hypothesis of papers."

Having said all that, I can step back from my preferred technologies and 
environments, but ask you whether an "Acid Test" like the one I've proposed:

http://csarven.ca/enabling-accessible-knowledge#acid-test

is something that you (as well as others) can agree on - at least for 
"research" documents. If so, then we can probably have a more fruitful 
discussion as we can focus on the same goals and strive for 
interoperability.

-Sarven
http://csarven.ca/#i
Received on Tuesday, 3 February 2015 16:39:50 UTC