- From: Stephen Williams <sdw@lig.net>
- Date: Tue, 07 May 2013 01:40:52 -0700
- To: RebholzSchuhmann <d.rebholz.schuhmann@gmail.com>
- CC: beyond-the-pdf@googlegroups.com, Steve Pettifer <steve.pettifer@manchester.ac.uk>, Leonard Rosenthol <lrosenth@adobe.com>, Linking Open Data <public-lod@w3.org>, SW-forum <semantic-web@w3.org>
- Message-ID: <5188BE14.3090403@lig.net>
Generally, I'd rather have semantically tagged reflowable CSS-enabled XHTML documents, epub like. However, PDFs serve a useful purpose too, and in some restricted cases it's hard to see how a particular goal can be achieved differently. An interesting option is to do something like a "Hybrid PDF": store the original editable document and/or alternate forms (XHTML+CSS+semantic markup) in the PDF, automatically and reliably sensing those alternates at any point. LibreOffice includes this feature now: http://blogs.computerworlduk.com/simon-says/2012/03/the-magic-of-editable-pdfs/index.htm It's possible, at least to a large extent, to associate particular segments of data to particular rendered elements. OCR programs make use of this to place resulting text in the same position as the graphic version of the text in a scanned page. This could allow copy and paste of semantically tagged data from a PDF just like an RDFa web page. sdw On 5/7/13 1:32 AM, RebholzSchuhmann wrote: > Hi, > > I have seen similar discussions before. > > I guess, we look at two different use cases: > (1) PDF: layout oriented, but could (and will, hopefully) carry a lot more semantics information. The key achievement is and > will be to have optimal layout, and on the other side the overhead for processing / exploitation / reuse goes up for everybody > who is NOT PDF-savvy. > (2) the other open formats (Html, Xml, Pdf): allow easy-to-go exploitation, processing, and enrichment, and stand for the > spirit of the open web and reuse of data. > > Listening to publishers, certainly layout matters. I am not only talking about the big five or ten who would have the > resources to go a different direction, I am talking about the 1,000 smaller publishers who have to serve their community. They > would struggle more to comply with the other "standards" and still deliver an appealing product. > > I guess, some clever thinking and collabortive work is required to bring both together. > > Hope this helps. > > -drs- > > On 07/05/2013 09:17, Steve Pettifer wrote: >>> I assume most authors don't actually format their documents by selecting a font size for every single heading and so on. >> This is a tempting assumption to make, especially if you come from computer science / maths / physics and related disciplines (as I do). But my experience in the life sciences is that authors do 'paint' their manuscripts by hand, painstakingly selecting the font and format for every bit of their document. Even using the 'semantic' features of wordprocessors (such as 'Heading 1') is something that's not commonplace. So before we get too carried away with expecting people to write HTML / LaTex or even markup, we'll need to take into account the working practises of the vast majority of academics outside of the more 'semantically aware' bits of science. >> >>> They work in a format that utilizes semantically meaningful information about the work: to identify a title, headings, math blocks, illustrations, plots, etc. >> No, they really don't. I wish they did. But, outside of a certain area of science, they don't. >> >> Steve >> > > -- > D. Rebholz-Schuhmann -mailto:d.rebholz.schuhmann@gmail.com -- Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: http://sdw.st/in V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407 AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer
Received on Tuesday, 7 May 2013 08:41:21 UTC