- From: Hammond, Tony <t.hammond@nature.com>
- Date: Thu, 12 Feb 2009 10:13:46 +0000
- To: <semantic-web@w3.org>
- Message-ID: <C5B9A6DA.18009%t.hammond@nature.com>
Hi All: > If I had to annotate uneditable PDFs > But so far, IMHO the PDF remains not so open @Paul, @Alex: I do not want in any way to be an apologist for PDF, but the simple fact is that it is an open published specification. That is not the issue here. I have myself hand-built a very rudimentary editor for PDFs so I know that it is "do-able". (I did not say it was necessarily easy. PDF is a very straightforward format - just layers upon layers of structure to deal with which does it make it appear "difficult".) One should not mistake complexity for lack of transparency. > Scan the file looking for "<rdf:RDF " and then invoke an RDF/XML parser (til the closing </rdf:RDF>). > XMP being a single separate component of the document, . @Jeremy, @John: XMP is not a singleton. The main metadata for a PDF document is expressed in the XMP packet referenced from the "/Metadata" entry in the document catalog object. Other XMP packets may (and do) occur within a PDF, for example XMP packets for graphics files - JPEGs, GIFs, PNGs, etc - embedded within the PDF. An XMP packet may be associated with any object in a PDF file. It is simply inserted as a PDF stream object (containing the XMP packet which itself wraps an RDF/XML document). [[ And yes, there are some restrictions placed on the RDF/XML profile - but that is a separate subject. :) ]] The correct way to retrieve the main (or document) XMP packet from a PDF is to navigate the PDF object structure. Alternatively there are simple heuristics for raw packet scanning which will return the correct XMP packet. [[ There is a special byte order marker char - the Unicode ³zero width non-breaking space character² (U+FEFF) - in an XMP packet that facilitates alignment of the packet within arbitrary byte streams. This is one of the key features of the XMP value proposition. ]] > Thanks John, tagging the atomic content, not the pdf as a whole @Alex: Confess I had missed that aspect of your original query. But in principle XMP may still be a viable technology for semantically tagging parts of the whole as I have indicated above. Cheers, Tony On 12/2/09 00:17, "Alexander Garcia Castro" <alexgarciac@gmail.com> wrote: > Thanks to all of you for your replies. Thanks John, tagging the atomic > content, not the pdf as a whole, is exactly what I would like to do. How is > this related to the SW? easy, papers have concepts, concepts are in > ontologies, ontologies can point to resources capable of consuming those > concepts. This is particularly true in Life Sciences. > > The actual "why" for my email: I am doing research on the intersection between > folkwonomies and the semantic web in digital libraries. So far, I have not > found a realistic way to use a PDF in an open manner, similar to the way one > could use a latex file. All those libraries, APIs, XMLs, etc etc are great, > some of them facilitate by a lot whatever one wants to do with the PDF. But so > far, IMHO the PDF remains not so open, and also IMHO is not part of what we > could classify as generative technology -which is what could make the > difference in the scesess of the SW, see futureoftheinternet.org/ for > generative tech. > > again thanks a lot to all of you. > > On Thu, Feb 12, 2009 at 1:06 AM, John Graybeal <graybeal@mbari.org> wrote: >> All the responses to date do not seem to address the thrust of the request, >> which is tagging *atomic content* of the PDF (not tagging the whole >> document). >> >> XMP being a single separate component of the document, I don't see how it >> helps, unless there is an obvious way to refer to any element within the >> document. But it would be nice to know of a way (other than "learn how to >> read/write PDF") that atomic PDF elements could be tagged. >> >> john >> >> -------------- >> John Graybeal <mailto:graybeal@mbari.org> -- 831-775-1956 >> Monterey Bay Aquarium Research Institute >> Marine Metadata Interoperability Project: http://marinemetadata.org >> >> >> On Feb 11, 2009, at 10:53 AM, Jeremy Carroll wrote: >> >>> >>> [[ >>> >>>> annotating PDFs, as in tagging not the file but the information within the >>>> file, is not possible by means different from those provided by ADOBE. >>> >>> Not so. The standard means of annotating PDFs, i.e. adding metadata, is to >>> use XMP, the Extensible Metadata Platform [2], an intiative from Adobe for >>> labelling arbitrary binary (and text) files. >>> [2] http://www.adobe.com/products/xmp/ >>> >>> ]] >>> >>> My understanding is that the following method generally works for reading >>> XMP within an arbitrary file (e.g. a PDF file). >>> >>> Scan the file looking for "<rdf:RDF " and then invoke an RDF/XML parser (til >>> the closing </rdf:RDF>). >>> >>> Not necessarily perfect - unclear how the metadata and the data relate for >>> example, but ... >>> >>> If I have ever actually used this method it was several years ago (and not >>> lodged in my memory, I sort have a vague recollection ...). >>> In RDF Core WG we took care to ensure that RDF 2004 was compatible with XMP >>> which was based on RDF 1999. >>> >>> Jeremy >>> >>> >>> >> >> >> > > ******************************************************************************** DISCLAIMER: This e-mail is confidential and should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage mechanism. Neither Macmillan Publishers Limited nor any of its agents accept liability for any statements made which are clearly the sender's own and not expressly made on behalf of Macmillan Publishers Limited or one of its agents. Please note that neither Macmillan Publishers Limited nor any of its agents accept any responsibility for viruses that may be contained in this e-mail or its attachments and it is your responsibility to scan the e-mail and attachments (if any). No contracts may be concluded on behalf of Macmillan Publishers Limited or its agents by means of e-mail communication. Macmillan Publishers Limited Registered in England and Wales with registered number 785998 Registered Office Brunel Road, Houndmills, Basingstoke RG21 6XS ********************************************************************************
Received on Thursday, 12 February 2009 10:17:31 UTC