Re: pdf and the semantic web from Hammond, Tony on 2009-02-12 (semantic-web@w3.org from February 2009)

From: Hammond, Tony <t.hammond@nature.com>
Date: Thu, 12 Feb 2009 10:13:46 +0000
To: <semantic-web@w3.org>
Message-ID: <C5B9A6DA.18009%t.hammond@nature.com>
Hi All:

> If I had to annotate uneditable PDFs

> But so far, IMHO the PDF remains not so open

@Paul, @Alex:

I do not want in any way to be an apologist for PDF, but the simple fact is
that it is an open published specification. That is not the issue here.

I have myself hand-built a very rudimentary editor for PDFs so I know that
it is "do-able". (I did not say it was necessarily easy. PDF is a very
straightforward format - just layers upon layers of structure to deal with
which does it make it appear "difficult".)

One should not mistake complexity for lack of transparency.

> Scan the file looking for "<rdf:RDF " and then invoke an RDF/XML parser (til
the closing </rdf:RDF>).

> XMP being a single separate component of the document, .

@Jeremy, @John:

XMP is not a singleton. The main metadata for a PDF document is expressed in
the XMP packet referenced from the "/Metadata" entry in the document catalog
object. Other XMP packets may (and do) occur within a PDF, for example XMP
packets for graphics files - JPEGs, GIFs, PNGs, etc - embedded within the
PDF.

An XMP packet may be associated with any object in a PDF file. It is simply
inserted as a PDF stream object (containing the XMP packet which itself
wraps an RDF/XML document).

[[ And yes, there are some restrictions placed on the RDF/XML profile - but
that is a separate subject. :) ]]

The correct way to retrieve the main (or document) XMP packet from a PDF is
to navigate the PDF object structure. Alternatively there are simple
heuristics for raw packet scanning which will return the correct XMP packet.

[[ There is a special byte order marker char - the Unicode ³zero width
non-breaking space character² (U+FEFF) - in an XMP packet that facilitates
alignment of the packet within arbitrary byte streams. This is one of the
key features of the XMP value proposition. ]]

> Thanks John, tagging the atomic content, not the pdf as a whole

@Alex:

Confess I had missed that aspect of your original query. But in principle
XMP may still be a viable technology for semantically tagging parts of the
whole as I have indicated above.

Cheers,

Tony





On 12/2/09 00:17, "Alexander Garcia Castro" <alexgarciac@gmail.com> wrote:

> Thanks to all of you for your replies. Thanks John, tagging the atomic
> content, not the pdf as a whole, is exactly what I would like to do. How is
> this related to the SW? easy, papers have concepts, concepts are in
> ontologies, ontologies can point to resources capable of consuming those
> concepts. This is particularly true in Life Sciences.
> 
> The actual "why" for my email: I am doing research on the intersection between
> folkwonomies and the semantic web in digital libraries. So far, I have not
> found a realistic way to use a PDF in an open manner, similar to the way one
> could use a latex file. All those libraries, APIs, XMLs, etc etc are great,
> some of them facilitate by a lot whatever one wants to do with the PDF. But so
> far, IMHO the PDF remains not so open, and also IMHO is not part of what we
> could classify as generative technology -which is what could make the
> difference in the scesess of the SW, see futureoftheinternet.org/ for
> generative tech. 
> 
> again thanks a lot to all of you.
> 
> On Thu, Feb 12, 2009 at 1:06 AM, John Graybeal <graybeal@mbari.org> wrote:
>> All the responses to date do not seem to address the thrust of the request,
>> which is tagging *atomic content* of the PDF (not tagging the whole
>> document).
>> 
>> XMP being a single separate component of the document, I don't see how it
>> helps, unless there is an obvious way to refer to any element within the
>> document.  But it would be nice to know of a way (other than "learn how to
>> read/write PDF") that atomic PDF elements could be tagged.
>> 
>> john
>> 
>> --------------
>> John Graybeal   <mailto:graybeal@mbari.org>  -- 831-775-1956
>> Monterey Bay Aquarium Research Institute
>> Marine Metadata Interoperability Project: http://marinemetadata.org
>> 
>> 
>> On Feb 11, 2009, at 10:53 AM, Jeremy Carroll wrote:
>> 
>>> 
>>> [[
>>> 
>>>> annotating PDFs, as in tagging not the file but the information within the
>>>> file, is not possible by means different from those provided by ADOBE.
>>> 
>>> Not so. The standard means of annotating PDFs, i.e. adding metadata, is to
>>> use XMP, the Extensible Metadata Platform [2], an intiative from Adobe for
>>> labelling arbitrary binary (and text) files.
>>> [2] http://www.adobe.com/products/xmp/
>>> 
>>> ]]
>>> 
>>> My understanding is that the following method generally works for reading
>>> XMP within an arbitrary file (e.g. a PDF file).
>>> 
>>> Scan the file looking for "<rdf:RDF " and then invoke an RDF/XML parser (til
>>> the closing </rdf:RDF>).
>>> 
>>> Not necessarily perfect - unclear how the metadata and the data relate for
>>> example, but ...
>>> 
>>> If I have ever actually used this method it was several years ago (and not
>>> lodged in my memory, I sort have a vague recollection ...).
>>> In RDF Core WG we took care to ensure that RDF 2004 was compatible with XMP
>>> which was based on RDF 1999.
>>> 
>>> Jeremy
>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 



********************************************************************************   
DISCLAIMER: This e-mail is confidential and should not be used by anyone who is
not the original intended recipient. If you have received this e-mail in error
please inform the sender and delete it from your mailbox or any other storage
mechanism. Neither Macmillan Publishers Limited nor any of its agents accept
liability for any statements made which are clearly the sender's own and not
expressly made on behalf of Macmillan Publishers Limited or one of its agents.
Please note that neither Macmillan Publishers Limited nor any of its agents
accept any responsibility for viruses that may be contained in this e-mail or
its attachments and it is your responsibility to scan the e-mail and 
attachments (if any). No contracts may be concluded on behalf of Macmillan 
Publishers Limited or its agents by means of e-mail communication. Macmillan 
Publishers Limited Registered in England and Wales with registered number 785998 
Registered Office Brunel Road, Houndmills, Basingstoke RG21 6XS   
********************************************************************************
Received on Thursday, 12 February 2009 10:17:31 UTC