- From: Sarven Capadisli <info@csarven.ca>
- Date: Tue, 03 Feb 2015 17:39:18 +0100
- To: Larry Masinter <masinter@adobe.com>
- CC: "public-lod@w3.org" <public-lod@w3.org>
On 2015-01-30 16:48, Larry Masinter wrote: >> There are a number of issues >> and shortcomings with the PDF approach which in the end will not play >> well with the Web is intended to be, nor how it functions. > > I think I have familiarity with what the Web is intended to > be, and how it functions, and I disagree. > > One of the earliest advances of the web (from Tim's > original HTML-only design) was the introduction of > support for multiple formats, including image and > document representations. It would greatly > improve the open data initiative to not restrict LD > to HTML. No one is restricting LD to HTML. Evidently, that is not the case, nor should it be that way. FYI, RDF serializations lead the way in LD. But, for the human/end-user, all roads almost always lead to HTML. The multiple formats are indeed supported, but their mileage varies on how we get a hold of them. We have HTML which tries to address their accessibility and discoverability. It is clear that PDFs are data-silos since we do not hop from one (binary document) to another. While linking is possible, at the end of the day, there is a UX problem. There is no ubiquitous experience which allows one to switch between PDF and HTML resources in a given device, operating-system, and software (e.g., Web browser, PDF reader). Jumping between them is awkward, and for the sake of what? How or why would that UX be any preferable for the user? Surely, that can be improved; as you well know that Web browsers can display PDFs nowadays. But still, that's just an annoyance (or depending on who you ask, it is a convenience). Surely, you also know why timbl decided not to use TeX as the language to author and exchange documents on the Web. I stand by my original point that HTML is a good bet. The burden of proof that PDF is somehow Web or LD "friendly" lies on the shoulders of enthusiasts and stake holders. Make is so. This is not to discourage any format striving to be more open and machine-friendly on the Web. > Your other points: >> not fault tolerant > > What are kinds of faults you think should be tolerated > but are not? I looked through > http://csarven.ca/enabling-accessible-knowledge > but I'm still not sure what you mean. Open up a (La)TeX/Word/PDF file and remove a non-content character - I hope we don't have to debate about which character. Is the document still "useful"? What kind of error-handling is there in corresponding readers or anything that can make an HTTP call and display the response for the human? Compare that with HTML. >> machine-friendly (regardless of what can be stuffed into XMP), > > I think "machine-friendly" in LOD context, means that there are > readily available tools to add, extract, manipulate. > > And it should be possible to annotate any format that > is suitable. With that line of reasoning it practically means anything is machine-friendly, and not to mention that it is something that we are striving for any way. For instance, an image of text is certainly machine-friendly if OCR can be applied to it, or that one can point their camera to some text on the wall and have it translate the words for you. But, I suspect that many would argue whether an image is machine-friendly or not in the LD context. Is there a fundamental difference between PDF and say a JPEG in context of LD? I'm ignorant on this matter as I have difficulty spotting that. >> and will not scale. > > This baffles me, what scaling do you have in mind? > I've worked with 2000-page PDF files, which, when > served from HTTP servers with range retrieval, > can incrementally display quickly. There may be > some performance goals specific to 'data'? First, I'm not suggesting that PDF is not widely used in a (desktop) environment with pre-installed software, but rather that its access over the Web is not that great. This relates to ease of creating, publishing, and maintaining PDF documents. If PDF had a strong case, I would argue that we'd see a different Web than the one we are using now. >> At the end of the day, PDF is a silo-document, > > There are hyperlinks in, and hyperlinks out. Embedding. > Except that HTML can be source-edited as text, I am > not sure what you mean by 'silo', then. I've touched on data-silo earlier. Yes, certainly parts of PDF can be linked to or that it can link out, but again, how good and reliable is that UX across devices, OS, and viewers? >> and it is not a ubiquitous reading/interactive >> experience in different devices. > > More consistent than other choices by design. Perhaps > not as widely available as HTML and JPEG, but close. I suppose we should define pixel accuracy, but I agree with you on consistency. I do not think that PDF is anywhere "close" to HTML's penetration across devices, but, if you have the numbers for that, I'd be happy to change my view on this particular point. >> Keep in mind that, this will most likely treat the data >> as a separate island, disassociated from the context in which it appears >> in. > > Except for annotations, or form-data for PDF forms, > I'm not sure what you see as the problem. Yes, one might > imagine someone updating a PDF using editing tools without > updating the corresponding data, but I don't imagine > this a common pattern. I’m thinking rather that data > identification and markup would happen in a source format, > and subsequently extracted from the source and re-injected > into the PDF as part of the publication process, if not > preserved in the conversion process (which depends on > the tools in use for PDF production). As you like. I think there are too many points of failure in that workflow, but I won't argue against the initiative. >> May I invite you to read: >> http://csarven.ca/enabling-accessible-knowledge >> >> It covers my position in sufficient depth - not intended to be overly >> technical, but rather covering the ground rules and ongoing work. > > Thank you; your paper brings up additional considerations. > >> While you are at it, please do a quick print-view from your Web >> browser (preferably in Firefox) or print to PDF. > > I tried Chrome and Internet explorer, not sure what I’m supposed > to see. The style is entirely different, of course. Having worked > on content-adaptation since the 80s, I can say my experience is > users don't like the surprises of visual incongruity among > content-negotiated renderings. Users do not like surprises in general :) Unless of course if the UX is designed with that in mind and the user is aware of it, which can be fun e.g., games often do this. In any case, it is trivial to point out that, the Web we have is far from pixel-perfection, and far from well-formed documents. Yet, the net result is that, information is disseminated just fine. If we are going to discuss UI issues with HTML(+CSS...) documents, then we should also discuss PDF's. Here is a snippet which I've actually left out from the enabling-accessible-knowledge (because I couldn't find an appropriate place to leave it in - it is still in source, commented out): The Nielsen Norman Group, an internationally well-known UI and UX consulting firm, has conducted a number of UI evaluations over the years (most recently in 2010) on Web usability, and have repeatedly reported that PDF is "unfit for human consumption": http://www.nngroup.com/articles/pdf-unfit-for-human-consumption One of the emphasis the group makes is that PDF is great for one thing: printing documents. Moreover, they state that "forcing users to browse PDF files makes usability approximately 300% worse compared to HTML pages", with accompanying a variety of usability studies summarizing: "users hate PDF": http://www.nngroup.com/articles/avoid-pdf-for-on-screen-reading Having said that, I'm sure it is trivial to find many studies favouring as well as disapproving both HTML and PDF for users in different contexts. What is true is that, both are useful for different users, needs, and environments. >> The RDF bits are visible here: >> >> http://www.w3.org/2012/pyRdfa/extract?uri=http%3A%2F%2Fcsarven.ca%2Fenabling-accessible-knowledge&rdfa_lite=false&vocab_expansion=false&embedded_rdf=true&validate=yes&space_preserve=true&vocab_cache_report=false&vocab_cache_bypass=false > > Is there a way of generating triples in Turtle or RDF/XML? > I was experimenting with JSON because I could get it: > > http://graphite.ecs.soton.ac.uk/rdf2json/index.php/converted.js?uri=http%3A%2F%2Fcsarven.ca%2Fenabling-accessible-knowledge > and (by hand, just to see what it would look like) merge it with the PDF's XMP. > Triples about the document itself fit into XMP document metadata, while other > data fits into a linked-data namespaced JSON-Data element. Sure, there are many parsers and enhancers, try some of these: * http://www.w3.org/2012/pyRdfa/ * http://rdf.greggkellogg.net/distiller * http://linkeddata.uriburner.com/ I would also recommend rapper which is great from the command-line: https://github.com/dajobe/raptor >> I will spare you the details on what's going on there, unless you really >> want to know, but to put it in a nutshell: it covers statements dealing >> with sections, provenance, references/citations.. > > I think these can be covered too. The ways this kind of structure might be useful are harder to imagine. Perhaps for data about formal documents like legislation and court records. > >> Here is another paper: http://linked-reseach.270a.info/ (which can just >> as well be a PDF - after all, PDF is just a view), which in addition to >> above, includes more atomic things like hypothesis, variables, >> workflows, .. > > It's heady the amount of 'data' you can annotate and extract, > but volume is not the best metric, not even a good metric. > I'd rather start from practical use cases where the data available > has clear, incremental value. I can come to agree with you that volume is not a good metric, but may I refer you to: “Quantity has a quality all of its own” — Joseph Stalin ( https://twitter.com/csarven/status/69174259058085888 ). First of all, if we didn't care about quantity, we'd probably stop at the title of a document. Second, I would say that, what I have in that document is not "a lot", nor something to brag about. It merely covers the initial argument line. There is a whole suite of vocabularies and ontologies which covers concepts and fine granularity. Authors only need to use them as they need to ("pay as you go"). For instance, if I want someone to discover and be able to refer to the hypothesis of a research paper, I'd make sure that's possible using the technologies that's available to me. If I want to (dis)agree with someone else's claim, I can relate mine with theirs. In fact, the "relation" concept is nothing new. It was here from day one: http://www.w3.org/History/1989/proposal.html As for the use case, how about this one that I've mentioned from another thread (actually, I mention something along these lines quite often in this mailing list): https://lists.w3.org/Archives/Public/public-lod/2015Jan/0089.html "Example: I want to discover the variables that are declared in the hypothesis of papers." Having said all that, I can step back from my preferred technologies and environments, but ask you whether an "Acid Test" like the one I've proposed: http://csarven.ca/enabling-accessible-knowledge#acid-test is something that you (as well as others) can agree on - at least for "research" documents. If so, then we can probably have a more fruitful discussion as we can focus on the same goals and strive for interoperability. -Sarven http://csarven.ca/#i
Received on Tuesday, 3 February 2015 16:39:50 UTC