Standoff experiment plus observations from Felix Sasaki on 2013-02-25 (public-multilingualweb-lt@w3.org from February 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Mon, 25 Feb 2013 19:01:08 +0100
To: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Message-ID: <512BA6E4.3060401@w3.org>

Hi all,

Christian, Marcis and Tadej know this (apologies for the repetition) - 
but I thought others might be interested too.

I played a bit with the NERD API
http://nerd.eurecom.fr/documentation#nerdapi

1) I generated ITS "tan" via 4 annotation engines that can be accessed 
through the api: dpbedia spotlight, extractiv, lupedia, yahoo.

2 a) I also created a *non* ITS "tan" standoff version, see 
multiple-ann-with-id-plus-standoff.html . It relies on ID attributes, 
and the standoff annotations point to the IDs. This is the approach that 
we had discussed a while ago on the mailing list.
2 b) The file multiple-ann-with-standoff-refs-script.html uses our 
current localization quality issue and provenance standoff approach, 
that is: pointing from the content to annotations, here via an 
artificial x-t-ref attribute.


 From 2), I learned various things:

- Making sure that standoff works requires a known workflow, "know" esp. 
with regards to white space handling. Otherwise the multiple annotation 
engines create multiple character offsets. So from this having a 
recommendation to leave standoff processingto NIF makes a lot of sense.

- The non ITS standoff representation(see 
multiple-ann-with-id-plus-standoff.html) has the merit that a human 
consumer who doesn't know anything about NIF et al. (= somebody in an 
XML based localization workflow or looking at an HTML document) can look 
into the annotations and choose: Hover over the green spans of text, 
e.g. over "St Peter" as part of " held in St Peter's Basilica. ". the 
annotation from extractiv holds a more specific "its-class-ref" than the 
one from dbpedia spotlight. But only dbpedia spotlight holds an 
"its-ident-ref". So a human user consuming these annotations has the 
most value if he combines them.

- Developing applications based on the output of multiple engines is 
pretty straightforward for non NLP / NIF people if you have the output 
represented in an easy to digest format (JSON, XML, ...). I won't argue 
for standardizing that format and creating ITS "tan" standoff (we had 
that discussion). I'm mentioning this just because the merit of the 
annotations in a long term might grow if Web developers face a low 
barrier for wide spread app development.

- A thought I had during today's discussion of the XLIFF mapping: having 
the external standoff pointing to IDs might be a way to solve the XLIFF 
representation issue of "mrk": here the issue is again (it seems) that 
you want to apply multiple annotations to the same span of text (the 
content of "mrk") - but you can't since the "type" attribute can be only 
used once. Externalizing the annotations solves that problem.

- During the discussion of multiple annotations a while ago we also 
touched upon the "direction" of the standoff: from outside to IDs (see 
multiple-ann-with-id-plus-standoff.html and 2a), or from the document to 
the standoff (current loc quality issue / provenance, see 2b) above). 
Pointint from the document (= 2b) has the drawback in HTML that you need 
a separate "script" element for each target - whereas in the case of 2a) 
you only need one script element. So for 2a) in total there are 58 
elements, and 2b) has 101 elements.

FYI: with the above observations I won't push for anything - just 
sharing my experience to see what others think.

Best,

Felix

Attachments

text/html attachment: multiple-ann-with-id-plus-standoff.html
text/html attachment: multiple-ann-with-standoff-refs-script.html

Received on Monday, 25 February 2013 18:01:38 UTC