Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup from Dave Lewis on 2013-01-28 (public-multilingualweb-lt@w3.org from January 2013)

From: Dave Lewis <dave.lewis@cs.tcd.ie>
Date: Mon, 28 Jan 2013 00:23:59 +0000
To: public-multilingualweb-lt@w3.org
Message-ID: <5105C51F.4020007@cs.tcd.ie>
Hi Felix,
Some thoughts on this proposal, primarily in comparison to the existing 
stand-off mechanisms:

1) If I'm understanding this right, you seem to invert the reference 
mechanism between the element being annotated and the local stand-off 
element when compared to the similar mechanisms we already have for 
locQualityIssue and provenance. i.e. the standoff element here 
references the annotated element rather than the other way around in the 
current standoff mechanisms.

Could you clarify why this different approach is needed?

As it stands I see the following problems with this inverted approach:
i) it means implementors potentially need to support two mechanisms for 
handling standoff mark-up in different data categories (and therefore 
introduces a lot of uncertainties  for the LC compared to reusing the 
existing mechanism)
ii) this would get complex if other (non-ITS) functions are 
creating/rewriting, the id values
iii) you loose the ability to associate standoff elements and content 
through global ITS rules, and hence loose the ability to annotate 
content in attributes.
iv) assuming the confidence attribute stays optional (or the confidence 
applies to several occurances), for compactness you may want to refer to 
several elements where the annotated text reoccurs from the same 
textAnalyticsAnnotation - this approach doesn't allow that I think

On the other hand we still don't have a clear idea of how to apply 
annotatorsRef for multiple annotations with the current standoff pattern 
from lqi and provenance, and we can't duck that here because its needed 
when confidence scores are used. One approach could be to apply 
annotatorsRef only to mtConfidence score, and use a dedicate 
tan-annotator attribute here.

2) a more minor issue, in your processing expectation for adding the 
annotation, you state that if there isn't an inline attribute then you 
should add it inline before adding a new textAnalyticsAnnotation. 
However  with the current stand-off approaches we don't mandate this. 
You could in fact put _all_ your annotations in the stand-off, and for 
the XLIFF mapping for lqi and provenance we need to keep that option 
available to implementors.

3) one general, more philosophical point. You correctly note we didn't 
explicitly discuss whether ITS annotation mechanisms were suitable for 
term and disambig. We've implicitly limited ourselves to data categories 
that make sense with the existing annotation mechanisms. We've stretched 
this a bit with local standoff and annotatorRef individually, but 
combining these new mechanisms is not something we've figured our how to 
do yet (a symptom of that stretching). The approach you suggest here 
seems to add a new type of annotation pattern, stretching us further.

Perhaps its better to restrict ourselves to what makes sense to do with 
existing, tested ITS mechanism while adding pointers to external formats 
that can be used for the more complicated cases that these can't handle. 
We took this approach with provenance, where we support simple agent 
provenance inline and provide an external link which gives us the 
possibility to build best practice for combining ITS and the W3C PROV 
model for more complex cases (see ISSUE-71).

So we could allow a link to NIF to deal with cases where we have 
multiple annotations for the same text, or nested annotations or 
overlapping annotations, which NIF is designed to deal with. With some 
sensible best practice, we could use a combination of termInfoRef  and 
NIF to deal with many of these more complex cases.

Essentially I'm arguing for the status quo here, living with limited but 
still useful  scope of current term+disambig inline and using best 
practice to give us the time to work out upgrade paths from this to 
supporting more complex use cases with term+disambig+NIF.

cheers,
Dave

On 27/01/2013 07:24, Felix Sasaki wrote:
> Hi all,
>
> sorry, this is going to be long ... but please have a look, esp. the 
> implementers (both consumers and producers) of terminology and 
> disambiguation.
>
> in the last 10 1/2 months, since Tadej's presentation at the Dublin 
> workshop, we had a lot of discussions on disambiguation, and sometimes 
> (as now) including terminology. But it seems that we never discussed 
> whether ITS2 approach of selection (global, local, inheritence, 
> overriding (partial or not)...) is suitable for this type of information.
>
> By "this type" I mean annotation of linguistic information. Most ITS2 
> and ITS1 data categories are process related (e.g. "Don't translate 
> this ..."), but both terminology and what's now called disambiguation 
> are information that you find in linguistic corpora and processing 
> tools. Now, my point is that in both in such natural language 
> processing tool chains and related corpora, a representation of 
> information *inline per document node* is rather the exception. Mostly 
> you have *standoff information*, that is a complete seperation of 
> information from actual content - as in NIF.
>
> Why is that? In linguistic annotation it is common that you have 
> several layers of information, like our lexical, ontological etc. 
> information. Some of these might be complex in itself (e.g. named 
> entities), some of these might be related to others (e.g. an 
> ontological concept related to a lexical item). I won't try to define 
> these layers here - but my point is that due to the complexity of 
> representing such information inline, nearly nobody is trying to 
> represent several layers at the same time inline. The common approach 
> is rather to have a base layer, and then pointers from the various 
> annotation layers.
>
> In a sense you can describe NIF as an approach of taking character 
> offsets as the implicit base layer (implicit because characters don't 
> need explicit anchors). The TEI here
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
> provides an example for an offset using words as the base unit, with 
> exlicit xml:id attributes.
>
> So far we haven't taken this approach for terminology or 
> disambiguation. This is why we had to came of with 16+ attributes: if 
> you want to do everything "inline", you need to differenciate 
> attribute names and come up with a monster data category. Inline 
> annotations are just not suitable for such information.
>
> So, the first idea behind below approach is: if you want to represent 
> just one linguistic layer (or "qualifier" in Christian's mail at
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
> ) , you use "tan-type" attribute to differentiate annotations. That 
> leads to following models inline models:
>
> 1) A term has its-tan-type with value "term" and optional 
> its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> its-tan-confidence="1.0">Dublin</span>
> Comparison to current ITS1 "Terminology":
> its-tan-type="term" plays the role of term="yes". its-tan-info-ref 
> plays the role of termInfoRef.  its-tan-ident-ref links to a term data 
> base. its-tan-confidence provide confidence information.
> (Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, 
> I'm just trying to exemplify the annotation approach here)
>
> 2) An entity has its-tan-type with value "entity" and optional 
> its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> its-tan-class-ref=" http://nerd.eurecom.fr/ontology#Place" 
> its-tan-confidence="0.7">Dublin</span>
>
> So above is only different naming compared to current "Terminology" 
> and Disambiguation. Below is now the standoff approach. The processing 
> expectation for tools *producing the annotation* is like this:
> - If there is no inline annotation, just create it (e.g. 1) or 2))
> - If there is inline annotation, check if there is an id attribute (in 
> HTML) or xml:id (if XML serizalization of HTML is used and with lower 
> precedence compared to id). For formats other than HTML, add xml:id if 
> possible or use the id attribute appropriate for that format.
>
> Then, for creating standoff annotations, add an 
> "its:textAnalyticsAnnotations" element to the document, e.g. in HTML 
> "script" if needed.
>
> Let's assume before annotation we have
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> its-tan-confidence="0.7">Dublin</span>
> Then after annotation we would have
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> its-tan-confidence="0.7" *id="a8"*>Dublin</span>
> and this:
> <its:textAnalyticsAnnotations>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> its-tan-confidence="1.0"/>
> </its:textAnalyticsAnnotations>
>
> Let's now assume that before annotation we have
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> its-tan-confidence="1.0">Dublin</span>
> Then after annotation we would have
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> its-tan-confidence="1.0" *id="a8"*>Dublin</span>
> and this:
> <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> its-tan-confidence="0.7"/>
> </its:textAnalyticsAnnotations>
>
> Now, if several "entity" annotation tools have been used, we could 
> also have
> <its:textAnalyticsAnnotations>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> its-tan-confidence="0.7" annotatorsRef="tan|tool-x"/>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> its-tan-confidence="0.4" annotatorsRef="tan|tool-y"/>
> </its:textAnalyticsAnnotations>
>
> Above approach would also influence the consumption of this data 
> category, and of annotatorsRef:
>
> - A consuming tools goes through the document and gathers all 
> textAnalyticsAnnotations elements
> - It then goes through the document. For each element node
> * check for existing inline markup. If it's available, add it to the 
> list of annotations for that node. Assume the inline version up in the 
> document tree of annotatorsRef to be responsible for the annotation of 
> that markup.
> * check the accumulated standoff textAnalyticsAnnotations elements for 
> occurrences of IDs that match the node. If there is such an ID, add 
> the related annotation to the list for the node, including the 
> additional annotatorsRef tool, e.g. tool-x or tool-y in the above case.
>
>
> In summary, this standoff tries to solve several issues:
>
> - avoid the 16+ inline attribute monster data category
> - allow for multiple annotations of the same span, with different tools
> - avoid the ITS1/2 or general inline annotation issues with 
> inheritance and overriding - as with the standoff approach at 
> exemplified at
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
> annotation information is just accumulated for a given base item (in 
> our case, element nodes with an ID).
>
> I'm not yet asking for this change, but I see it as a way forward that 
> could make the life of both annotation producers (Marcis and Tadej) 
> and consumers (Yves et al.) simpler. So I'm eager to hear thoughts on 
> this :)
>
> Thoughts?
>
> - Felix
Received on Monday, 28 January 2013 00:24:40 UTC