Re: linked open data and PDF from Norman Gray on 2015-01-21 (public-lod@w3.org from January 2015)

From: Norman Gray <norman@astro.gla.ac.uk>
Date: Wed, 21 Jan 2015 17:16:11 +0000
To: Paul Houle <ontology2@gmail.com>
Cc: Herbert Van de Sompel <hvdsomp@gmail.com>, "jschneider@pobox.com" <jschneider@pobox.com>, "public-lod@w3.org" <public-lod@w3.org>
Message-Id: <2F4906DF-DFDE-434F-903D-DC3352ED148D@astro.gla.ac.uk>

Paul and Rod, hello.

> On 2015 Jan 21, at 16:32, Paul Houle <ontology2@gmail.com> wrote:
> 
>       I think the world needs a survey of XMP metadata in the field.  Only by inspection of a large set of diverse files can we say how good or bad the situation actually is.

Rod's link at <http://rossmounce.co.uk/2013/01/06/pdf-metadata-using-exiftool/> is very interesting, and possibly encouraging.

>       There ought to be a tool that gives XMP-annotated documents a point score for metadata quality;  you ought to get a lot of points for having the simple things that were missing in the document exported from word like the title, author,  copyright,  etc.

Now, that's a _Really_ good idea!  And just to prove how simple it is to do something crude:

% ./extract-xmp test-xmp.pdf | rapper -irdfxml -ontriples - test-xmp.pdf | python score-rdf.py 
...
11 triples found; metadata-goodness-score=12

This is with the python script included at the bottom.

>       Note it is not just about PDF but many kinds of media files that are tagged with this,  so it really is about XMP,  not just PDF.

Very much so.

(also it's not even really about XMP; there are all sorts of ways of scraping metadata out of objects and turning it into something which an RDF parser can read, and from that point you can start being imaginative.  This is of course stupidly obvious to everyone on this list, but it's an aha! that many people haven't got yet).

All the best,

Norman



#! /usr/bin/python

# score RDF for metadata goodness
#
# Usage:
#
#    ./extract-xmp test-xmp.pdf | rapper -irdfxml -ontriples - test-xmp.pdf | python score-rdf.py 

import sys, re

ntline = re.compile('(?:<([^>]*)>|(_:[^ ]*)) *<([^>]*)> *(.*)')

scores = {'http://purl.org/dc/elements/1.1/creator': 1,
          'http://purl.org/dc/elements/1.1/title': 1,
          'http://purl.org/dc/elements/1.1/created': 1,
          'http://purl.org/dc/elements/1.1/abstract': 2,
          'http://ns.adobe.com/xap/1.0/rights/Marked': 3,
          'http://creativecommons.org/ns#license': 4
          }

no_triples = 0
score = 0

for line in sys.stdin:
    m = ntline.match(line)
    if m:
        bits = m.groups()
        print('{}  /  {}\n\t{}\n\t{}\n'.format(bits[0],
                                               bits[1],
                                               bits[2],
                                               bits[3]))

        no_triples = no_triples + 1
        pred = bits[2]
        if pred in scores:
            score += scores[pred]
    else:
        print("---didn't match {}".format(line))

print('{} triples found; metadata-goodness-score={}'.format(no_triples, score))

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK

Received on Wednesday, 21 January 2015 17:16:37 UTC