Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup from Felix Sasaki on 2013-01-28 (public-multilingualweb-lt@w3.org from January 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Mon, 28 Jan 2013 08:44:13 +0100
To: "Lieske, Christian" <christian.lieske@sap.com>
CC: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Message-ID: <51062C4D.1090101@w3.org>
Hi Christian,

Am 28.01.13 08:33, schrieb Lieske, Christian:
>
> Hi Felix,
>
> Thanks for all of the work you put into this. Your analysis and 
> suggestions have informed my own understanding of the challenge and 
> possible solution.
>

Thanks for the nice feedback.

> Although details of your suggestion (e.g.  what I would phrase as 
> "only generate stand-off if the inline place already has been taken") 
> may require additional discussions, it would be great if at least the 
> Working Group members would be positive about them.
>

No worries about that - I'm very happy about the feedback received from 
Yves and Dave so far. And as said before, I'd encourage everybody from 
the implementers sides (consumers + producers of disambig+term) and from 
the users / "policy" side to chim in the discussion. The more feedback 
the better :)

Best,

Felix


> Cheers,
>
> Christian
>
> *From:*Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Sonntag, 27. Januar 2013 08:25
> *To:* public-multilingualweb-lt@w3.org
> *Subject:* issue-68 from an annotation representation point of view, 
> with potential implications for annotatorsRef and standoff markup
>
> Hi all,
>
> sorry, this is going to be long ... but please have a look, esp. the 
> implementers (both consumers and producers) of terminology and 
> disambiguation.
>
> in the last 10 1/2 months, since Tadej's presentation at the Dublin 
> workshop, we had a lot of discussions on disambiguation, and sometimes 
> (as now) including terminology. But it seems that we never discussed 
> whether ITS2 approach of selection (global, local, inheritence, 
> overriding (partial or not)...) is suitable for this type of information.
>
> By "this type" I mean annotation of linguistic information. Most ITS2 
> and ITS1 data categories are process related (e.g. "Don't translate 
> this ..."), but both terminology and what's now called disambiguation 
> are information that you find in linguistic corpora and processing 
> tools. Now, my point is that in both in such natural language 
> processing tool chains and related corpora, a representation of 
> information *inline per document node* is rather the exception. Mostly 
> you have *standoff information*, that is a complete seperation of 
> information from actual content - as in NIF.
>
> Why is that? In linguistic annotation it is common that you have 
> several layers of information, like our lexical, ontological etc. 
> information. Some of these might be complex in itself (e.g. named 
> entities), some of these might be related to others (e.g. an 
> ontological concept related to a lexical item). I won't try to define 
> these layers here - but my point is that due to the complexity of 
> representing such information inline, nearly nobody is trying to 
> represent several layers at the same time inline. The common approach 
> is rather to have a base layer, and then pointers from the various 
> annotation layers.
>
> In a sense you can describe NIF as an approach of taking character 
> offsets as the implicit base layer (implicit because characters don't 
> need explicit anchors). The TEI here
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
> provides an example for an offset using words as the base unit, with 
> exlicit xml:id attributes.
>
> So far we haven't taken this approach for terminology or 
> disambiguation. This is why we had to came of with 16+ attributes: if 
> you want to do everything "inline", you need to differenciate 
> attribute names and come up with a monster data category. Inline 
> annotations are just not suitable for such information.
>
> So, the first idea behind below approach is: if you want to represent 
> just one linguistic layer (or "qualifier" in Christian's mail at
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
> ) , you use "tan-type" attribute to differentiate annotations. That 
> leads to following models inline models:
>
> 1) A term has its-tan-type with value "term" and optional 
> its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0">Dublin</span>
> Comparison to current ITS1 "Terminology":
> its-tan-type="term" plays the role of term="yes". its-tan-info-ref 
> plays the role of termInfoRef. its-tan-ident-ref links to a term data 
> base. its-tan-confidence provide confidence information.
> (Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, 
> I'm just trying to exemplify the annotation approach here)
>
> 2) An entity has its-tan-type with value "entity" and optional 
> its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> its-tan-class-ref=" 
> http://nerd.eurecom.fr/ontology#Place" 
> its-tan-confidence="0.7">Dublin</span>
>
> So above is only different naming compared to current "Terminology" 
> and Disambiguation. Below is now the standoff approach. The processing 
> expectation for tools *producing the annotation* is like this:
> - If there is no inline annotation, just create it (e.g. 1) or 2))
> - If there is inline annotation, check if there is an id attribute (in 
> HTML) or xml:id (if XML serizalization of HTML is used and with lower 
> precedence compared to id). For formats other than HTML, add xml:id if 
> possible or use the id attribute appropriate for that format.
>
> Then, for creating standoff annotations, add an 
> "its:textAnalyticsAnnotations" element to the document, e.g. in HTML 
> "script" if needed.
>
> Let's assume before annotation we have
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> 
> its-tan-confidence="0.7">Dublin</span>
> Then after annotation we would have
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" 
> *id="a8"*>Dublin</span>
> and this:
> <its:textAnalyticsAnnotations>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0"/>
> </its:textAnalyticsAnnotations>
>
> Let's now assume that before annotation we have
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0">Dublin</span>
> Then after annotation we would have
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0" *id="a8"*>Dublin</span>
> and this:
> <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"/>
> </its:textAnalyticsAnnotations>
>
> Now, if several "entity" annotation tools have been used, we could 
> also have
> <its:textAnalyticsAnnotations>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" 
> annotatorsRef="tan|tool-x"/>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.4" 
> annotatorsRef="tan|tool-y"/>
> </its:textAnalyticsAnnotations>
>
> Above approach would also influence the consumption of this data 
> category, and of annotatorsRef:
>
> - A consuming tools goes through the document and gathers all 
> textAnalyticsAnnotations elements
> - It then goes through the document. For each element node
> * check for existing inline markup. If it's available, add it to the 
> list of annotations for that node. Assume the inline version up in the 
> document tree of annotatorsRef to be responsible for the annotation of 
> that markup.
> * check the accumulated standoff textAnalyticsAnnotations elements for 
> occurrences of IDs that match the node. If there is such an ID, add 
> the related annotation to the list for the node, including the 
> additional annotatorsRef tool, e.g. tool-x or tool-y in the above case.
>
>
> In summary, this standoff tries to solve several issues:
>
> - avoid the 16+ inline attribute monster data category
> - allow for multiple annotations of the same span, with different tools
> - avoid the ITS1/2 or general inline annotation issues with 
> inheritance and overriding - as with the standoff approach at 
> exemplified at
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
> annotation information is just accumulated for a given base item (in 
> our case, element nodes with an ID).
>
> I'm not yet asking for this change, but I see it as a way forward that 
> could make the life of both annotation producers (Marcis and Tadej) 
> and consumers (Yves et al.) simpler. So I'm eager to hear thoughts on 
> this :)
>
> Thoughts?
>
> - Felix
>
Received on Monday, 28 January 2013 07:44:41 UTC