- From: Felix Sasaki <fsasaki@w3.org>
- Date: Mon, 25 Feb 2013 19:01:08 +0100
- To: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
- Message-ID: <512BA6E4.3060401@w3.org>
Hi all, Christian, Marcis and Tadej know this (apologies for the repetition) - but I thought others might be interested too. I played a bit with the NERD API http://nerd.eurecom.fr/documentation#nerdapi 1) I generated ITS "tan" via 4 annotation engines that can be accessed through the api: dpbedia spotlight, extractiv, lupedia, yahoo. 2 a) I also created a *non* ITS "tan" standoff version, see multiple-ann-with-id-plus-standoff.html . It relies on ID attributes, and the standoff annotations point to the IDs. This is the approach that we had discussed a while ago on the mailing list. 2 b) The file multiple-ann-with-standoff-refs-script.html uses our current localization quality issue and provenance standoff approach, that is: pointing from the content to annotations, here via an artificial x-t-ref attribute. From 2), I learned various things: - Making sure that standoff works requires a known workflow, "know" esp. with regards to white space handling. Otherwise the multiple annotation engines create multiple character offsets. So from this having a recommendation to leave standoff processingto NIF makes a lot of sense. - The non ITS standoff representation(see multiple-ann-with-id-plus-standoff.html) has the merit that a human consumer who doesn't know anything about NIF et al. (= somebody in an XML based localization workflow or looking at an HTML document) can look into the annotations and choose: Hover over the green spans of text, e.g. over "St Peter" as part of " held in St Peter's Basilica. ". the annotation from extractiv holds a more specific "its-class-ref" than the one from dbpedia spotlight. But only dbpedia spotlight holds an "its-ident-ref". So a human user consuming these annotations has the most value if he combines them. - Developing applications based on the output of multiple engines is pretty straightforward for non NLP / NIF people if you have the output represented in an easy to digest format (JSON, XML, ...). I won't argue for standardizing that format and creating ITS "tan" standoff (we had that discussion). I'm mentioning this just because the merit of the annotations in a long term might grow if Web developers face a low barrier for wide spread app development. - A thought I had during today's discussion of the XLIFF mapping: having the external standoff pointing to IDs might be a way to solve the XLIFF representation issue of "mrk": here the issue is again (it seems) that you want to apply multiple annotations to the same span of text (the content of "mrk") - but you can't since the "type" attribute can be only used once. Externalizing the annotations solves that problem. - During the discussion of multiple annotations a while ago we also touched upon the "direction" of the standoff: from outside to IDs (see multiple-ann-with-id-plus-standoff.html and 2a), or from the document to the standoff (current loc quality issue / provenance, see 2b) above). Pointint from the document (= 2b) has the drawback in HTML that you need a separate "script" element for each target - whereas in the case of 2a) you only need one script element. So for 2a) in total there are 58 elements, and 2b) has 101 elements. FYI: with the above observations I won't push for anything - just sharing my experience to see what others think. Best, Felix
Attachments
- text/html attachment: multiple-ann-with-id-plus-standoff.html
- text/html attachment: multiple-ann-with-standoff-refs-script.html
Received on Monday, 25 February 2013 18:01:38 UTC