Re: Standoff experiment plus observations from Phil Ritchie on 2013-03-19 (public-multilingualweb-lt@w3.org from March 2013)

From: Phil Ritchie <philr@vistatec.ie>
Date: Tue, 19 Mar 2013 15:56:45 +0000
To: Felix Sasaki <fsasaki@w3.org>
Cc: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Message-ID: <OFFCC96CAF.9E585940-ON80257B33.0056933A-80257B33.00579828@vistatec.ie>
Felix, All,

A question: does the id of an enclosing <script /> element need to be the 
same as the ITS element it encloses? e.g.

<script type="application/its+xml" id="lq0">
        <its:locQualityIssues xmlns:its="http://www.w3.org/2005/11/its" 
xml:id="lq0">
                <its:locQualityIssue locQualityIssueType="non-conformance" 
locQualityIssueSeverity="75.7961783439491"></its:locQualityIssue>
        </its:locQualityIssues>
</script>

I suspect not.

That being the case, I'm not convinced that having the script enclosed 
metadata point to the span's saves a significant amount of serialized 
footprint.

Phil.





From:   Felix Sasaki <fsasaki@w3.org>
To:     "public-multilingualweb-lt@w3.org" 
<public-multilingualweb-lt@w3.org>, 
Date:   25/02/2013 18:02
Subject:        Standoff experiment plus observations



Hi all,

Christian, Marcis and Tadej know this (apologies for the repetition) - but 
I thought others might be interested too.

I played a bit with the NERD API 
http://nerd.eurecom.fr/documentation#nerdapi 

1) I generated ITS "tan" via 4 annotation engines that can be accessed 
through the api: dpbedia spotlight, extractiv, lupedia, yahoo. 

2 a) I also created a *non* ITS "tan" standoff version, see 
multiple-ann-with-id-plus-standoff.html . It relies on ID attributes, and 
the standoff annotations point to the IDs. This is the approach that we 
had discussed a while ago on the mailing list.
2 b) The file multiple-ann-with-standoff-refs-script.html uses our current 
localization quality issue and provenance standoff approach, that is: 
pointing from the content to annotations, here via an artificial x-t-ref 
attribute.


>From 2), I learned various things: 

- Making sure that standoff works requires a known workflow, "know" esp. 
with regards to white space handling. Otherwise the multiple annotation 
engines create multiple character offsets. So from this having a 
recommendation to leave standoff processing to NIF makes a lot of sense. 

- The non ITS standoff representation (see 
multiple-ann-with-id-plus-standoff.html) has the merit that a human 
consumer who doesn't know anything about NIF et al. (= somebody in an XML 
based localization workflow or looking at an HTML document) can look into 
the annotations and choose: Hover over the green spans of text, e.g. over 
"St Peter" as part of " held in St Peter's Basilica. ". the annotation 
from extractiv holds a more specific "its-class-ref" than the one from 
dbpedia spotlight. But only dbpedia spotlight holds an "its-ident-ref". So 
a human user consuming these annotations has the most value if he combines 
them. 

- Developing applications based on the output of multiple engines is 
pretty straightforward for non NLP / NIF people if you have the output 
represented in an easy to digest format (JSON, XML, ...). I won't argue 
for standardizing that format and creating ITS "tan" standoff (we had that 
discussion). I'm mentioning this just because the merit of the annotations 
in a long term might grow if Web developers face a low barrier for wide 
spread app development. 

- A thought I had during today's discussion of the XLIFF mapping: having 
the external standoff pointing to IDs might be a way to solve the XLIFF 
representation issue of "mrk": here the issue is again (it seems) that you 
want to apply multiple annotations to the same span of text (the content 
of "mrk") - but you can't since the "type" attribute can be only used 
once. Externalizing the annotations solves that problem.

- During the discussion of multiple annotations a while ago we also 
touched upon the "direction" of the standoff: from outside to IDs (see 
multiple-ann-with-id-plus-standoff.html and 2a), or from the document to 
the standoff (current loc quality issue / provenance, see 2b) above). 
Pointint from the document (= 2b) has the drawback in HTML that you need a 
separate "script" element for each target - whereas in the case of 2a) you 
only need one script element. So for 2a) in total there are 58 elements, 
and 2b) has 101 elements.

FYI: with the above observations I won't push for anything - just sharing 
my experience to see what others think.

Best,

Felix [attachment "multiple-ann-with-id-plus-standoff.html" deleted by 
Phil Ritchie/VISTATEC] [attachment 
"multiple-ann-with-standoff-refs-script.html" deleted by Phil 
Ritchie/VISTATEC] 
************************************************************
VistaTEC Ltd. Registered in Ireland 268483. 
Registered Office, VistaTEC House, 700, South Circular Road, 
Kilmainham. Dublin 8. Ireland. 

The information contained in this message, including any accompanying 
documents, is confidential and is intended only for the addressee(s). 
The unauthorized use, disclosure, copying, or alteration of this 
message is strictly forbidden. If you have received this message in
error please notify the sender immediately.
************************************************************
Received on Tuesday, 19 March 2013 15:57:21 UTC