issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Hi all,

sorry, this is going to be long ... but please have a look, esp. the 
implementers (both consumers and producers) of terminology and 
disambiguation.

in the last 10 1/2 months, since Tadej's presentation at the Dublin 
workshop, we had a lot of discussions on disambiguation, and sometimes 
(as now) including terminology. But it seems that we never discussed 
whether ITS2 approach of selection (global, local, inheritence, 
overriding (partial or not)...) is suitable for this type of information.

By "this type" I mean annotation of linguistic information. Most ITS2 
and ITS1 data categories are process related (e.g. "Don't translate this 
..."), but both terminology and what's now called disambiguation are 
information that you find in linguistic corpora and processing tools. 
Now, my point is that in both in such natural language processing tool 
chains and related corpora, a representation of information *inline per 
document node* is rather the exception. Mostly you have *standoff 
information*, that is a complete seperation of information from actual 
content - as in NIF.

Why is that? In linguistic annotation it is common that you have several 
layers of information, like our lexical, ontological etc. information. 
Some of these might be complex in itself (e.g. named entities), some of 
these might be related to others (e.g. an ontological concept related to 
a lexical item). I won't try to define these layers here - but my point 
is that due to the complexity of representing such information inline, 
nearly nobody is trying to represent several layers at the same time 
inline. The common approach is rather to have a base layer, and then 
pointers from the various annotation layers.

In a sense you can describe NIF as an approach of taking character 
offsets as the implicit base layer (implicit because characters don't 
need explicit anchors). The TEI here
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
provides an example for an offset using words as the base unit, with 
exlicit xml:id attributes.

So far we haven't taken this approach for terminology or disambiguation. 
This is why we had to came of with 16+ attributes: if you want to do 
everything "inline", you need to differenciate attribute names and come 
up with a monster data category. Inline annotations are just not 
suitable for such information.

So, the first idea behind below approach is: if you want to represent 
just one linguistic layer (or "qualifier" in Christian's mail at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
) , you use "tan-type" attribute to differentiate annotations. That 
leads to following models inline models:

1) A term has its-tan-type with value "term" and optional 
its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
<span its-tan-type="term" 
its-tan-ident-ref="http://termdatabase.example.com/entry37" 
its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
its-tan-confidence="1.0">Dublin</span>
Comparison to current ITS1 "Terminology":
its-tan-type="term" plays the role of term="yes". its-tan-info-ref plays 
the role of termInfoRef.  its-tan-ident-ref links to a term data base. 
its-tan-confidence provide confidence information.
(Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, I'm 
just trying to exemplify the annotation approach here)

2) An entity has its-tan-type with value "entity" and optional 
its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
<span its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
its-tan-class-ref=" http://nerd.eurecom.fr/ontology#Place" 
its-tan-confidence="0.7">Dublin</span>

So above is only different naming compared to current "Terminology" and 
Disambiguation. Below is now the standoff approach. The processing 
expectation for tools *producing the annotation* is like this:
- If there is no inline annotation, just create it (e.g. 1) or 2))
- If there is inline annotation, check if there is an id attribute (in 
HTML) or xml:id (if XML serizalization of HTML is used and with lower 
precedence compared to id). For formats other than HTML, add xml:id if 
possible or use the id attribute appropriate for that format.

Then, for creating standoff annotations, add an 
"its:textAnalyticsAnnotations" element to the document, e.g. in HTML 
"script" if needed.

Let's assume before annotation we have
<span its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
its-tan-confidence="0.7">Dublin</span>
Then after annotation we would have
<span its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
its-tan-confidence="0.7" *id="a8"*>Dublin</span>
and this:
<its:textAnalyticsAnnotations>
<its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term" 
its-tan-ident-ref="http://termdatabase.example.com/entry37" 
its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
its-tan-confidence="1.0"/>
</its:textAnalyticsAnnotations>

Let's now assume that before annotation we have
<span its-tan-type="term" 
its-tan-ident-ref="http://termdatabase.example.com/entry37" 
its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
its-tan-confidence="1.0">Dublin</span>
Then after annotation we would have
<span its-tan-type="term" 
its-tan-ident-ref="http://termdatabase.example.com/entry37" 
its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
its-tan-confidence="1.0" *id="a8"*>Dublin</span>
and this:
<its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
<its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
its-tan-confidence="0.7"/>
</its:textAnalyticsAnnotations>

Now, if several "entity" annotation tools have been used, we could also have
<its:textAnalyticsAnnotations>
<its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
its-tan-confidence="0.7" annotatorsRef="tan|tool-x"/>
<its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
its-tan-confidence="0.4" annotatorsRef="tan|tool-y"/>
</its:textAnalyticsAnnotations>

Above approach would also influence the consumption of this data 
category, and of annotatorsRef:

- A consuming tools goes through the document and gathers all 
textAnalyticsAnnotations elements
- It then goes through the document. For each element node
* check for existing inline markup. If it's available, add it to the 
list of annotations for that node. Assume the inline version up in the 
document tree of annotatorsRef to be responsible for the annotation of 
that markup.
* check the accumulated standoff textAnalyticsAnnotations elements for 
occurrences of IDs that match the node. If there is such an ID, add the 
related annotation to the list for the node, including the additional 
annotatorsRef tool, e.g. tool-x or tool-y in the above case.


In summary, this standoff tries to solve several issues:

- avoid the 16+ inline attribute monster data category
- allow for multiple annotations of the same span, with different tools
- avoid the ITS1/2 or general inline annotation issues with inheritance 
and overriding - as with the standoff approach at exemplified at
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
annotation information is just accumulated for a given base item (in our 
case, element nodes with an ID).

I'm not yet asking for this change, but I see it as a way forward that 
could make the life of both annotation producers (Marcis and Tadej) and 
consumers (Yves et al.) simpler. So I'm eager to hear thoughts on this :)

Thoughts?

- Felix

Received on Sunday, 27 January 2013 07:25:09 UTC