- From: Larry Masinter <masinter@adobe.com>
- Date: Fri, 30 Jan 2015 15:48:14 +0000
- To: Sarven Capadisli <info@csarven.ca>
- CC: "public-lod@w3.org" <public-lod@w3.org>
((sorry, this sat in my drafts box, incomplete, and out of sequence now)) In reply to Sarven Capadisli's email of 2015-01-19 12:20: > First off, I totally acknowledge your interest to improve the state of > things for PDF. thanks > I'm welcome to be proven wrong, but for the "big picture", I don't > believe that LaTeX/XMP/PDF is the way to go for LD-friendly - perhaps > efforts for that better invested elsewhere. People choose document formats for a variety of reasons, for access to tools and downstream workflows and community features. I don't think there is a single "way to go". Some requirements lead some people to choose PDF (or Excel or JPEG or PNG or ePub) for some use cases of document distribution. There is no good reason to avoid enhancing the LD-friendliness for those cases. Of course you are free to put your effort on future document formats, or other use cases. > There are a number of issues > and shortcomings with the PDF approach which in the end will not play > well with the Web is intended to be, nor how it functions. I think I have familiarity with what the Web is intended to be, and how it functions, and I disagree. One of the earliest advances of the web (from Tim's original HTML-only design) was the introduction of support for multiple formats, including image and document representations. It would greatly improve the open data initiative to not restrict LD to HTML. Your other points: > not fault tolerant What are kinds of faults you think should be tolerated but are not? I looked through http://csarven.ca/enabling-accessible-knowledge but I'm still not sure what you mean. > machine-friendly (regardless of what can be stuffed into XMP), I think "machine-friendly" in LOD context, means that there are readily available tools to add, extract, manipulate. And it should be possible to annotate any format that is suitable. > and will not scale. This baffles me, what scaling do you have in mind? I've worked with 2000-page PDF files, which, when served from HTTP servers with range retrieval, can incrementally display quickly. There may be some performance goals specific to 'data'? > At the end of the day, PDF is a silo-document, There are hyperlinks in, and hyperlinks out. Embedding. Except that HTML can be source-edited as text, I am not sure what you mean by 'silo', then. > its rendering is a resource-hog in different devices, Rendering time for PDF and any other format is determined as more by the quality of creation tools. Surely there are bloated singing dancing HTML5 documents which hog resources. These are publisher choices, whether they invest in optimization. > and it is not a ubiquitous reading/interactive > experience in different devices. More consistent than other choices by design. Perhaps not as widely available as HTML and JPEG, but close. > For XMP/PDF to work, I presume you are going to end up dealing with > RDF/XML One possibility, although not exclusively, see "attachment" above. > and an appropriate interface for authors to mark their > statements with. The use cases I have in mind are for data-driven document generation scenarios, not primarily individual authors. And even for individual documents, PDF is not currently a common authoring format, except perhaps when filling out PDF forms. > Keep in mind that, this will most likely treat the data > as a separate island, disassociated from the context in which it appears > in. Except for annotations, or form-data for PDF forms, I'm not sure what you see as the problem. Yes, one might imagine someone updating a PDF using editing tools without updating the corresponding data, but I don't imagine this a common pattern. I’m thinking rather that data identification and markup would happen in a source format, and subsequently extracted from the source and re-injected into the PDF as part of the publication process, if not preserved in the conversion process (which depends on the tools in use for PDF production). > May I invite you to read: > http://csarven.ca/enabling-accessible-knowledge > > It covers my position in sufficient depth - not intended to be overly > technical, but rather covering the ground rules and ongoing work. Thank you; your paper brings up additional considerations. > While you are at it, please do a quick print-view from your Web > browser (preferably in Firefox) or print to PDF. I tried Chrome and Internet explorer, not sure what I’m supposed to see. The style is entirely different, of course. Having worked on content-adaptation since the 80s, I can say my experience is users don't like the surprises of visual incongruity among content-negotiated renderings. > The RDF bits are visible here: > > http://www.w3.org/2012/pyRdfa/extract?uri=http%3A%2F%2Fcsarven.ca%2Fenabling-accessible-knowledge&rdfa_lite=false&vocab_expansion=false&embedded_rdf=true&validate=yes&space_preserve=true&vocab_cache_report=false&vocab_cache_bypass=false Is there a way of generating triples in Turtle or RDF/XML? I was experimenting with JSON because I could get it: http://graphite.ecs.soton.ac.uk/rdf2json/index.php/converted.js?uri=http%3A%2F%2Fcsarven.ca%2Fenabling-accessible-knowledge and (by hand, just to see what it would look like) merge it with the PDF's XMP. Triples about the document itself fit into XMP document metadata, while other data fits into a linked-data namespaced JSON-Data element. You might also want to consider XMP for video http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/DynamicMediaXMPPartnerGuide.pdf > I will spare you the details on what's going on there, unless you really > want to know, but to put it in a nutshell: it covers statements dealing > with sections, provenance, references/citations.. I think these can be covered too. The ways this kind of structure might be useful are harder to imagine. Perhaps for data about formal documents like legislation and court records. > Here is another paper: http://linked-reseach.270a.info/ (which can just > as well be a PDF - after all, PDF is just a view), which in addition to > above, includes more atomic things like hypothesis, variables, > workflows, .. It's heady the amount of 'data' you can annotate and extract, but volume is not the best metric, not even a good metric. I'd rather start from practical use cases where the data available has clear, incremental value. > The work is based on Linked Research: > > https://github.com/csarven/linked-research I'll just say I have a different perspective, which I've blogged about over the years, e.g., http://masinter.blogspot.com/2011/08/expert-system-scalability-and-semantic.html http://masinter.blogspot.com/2014/11/ambiguity-semantic-web-speech-acts.html > If you are comfortable with your browser's developer toolbar, try > changing the stylesheet lncs.css in <head> to acm.css. > > There is a whole behavioural/interactive layer which I'll skip over now, > but you can take a look at it if you fancy JavaScript. > > As you may have already noticed, the HTML template is flexible enough > for "blog" posts and "papers" - again, this is about separating the > structure/content from the other layers: presentation, and behaviour. https://www.youtube.com/watch?v=Jg_uoixTsbY Surely it's not what you do, but it's the way that you do it (that determines who you are).
Received on Friday, 30 January 2015 15:48:43 UTC