RE: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup from Mārcis Pinnis on 2013-01-28 (public-multilingualweb-lt@w3.org from January 2013)

From: Mārcis Pinnis <marcis.pinnis@Tilde.lv>
Date: Mon, 28 Jan 2013 12:18:50 +0200
To: Felix Sasaki <fsasaki@w3.org>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
CC: Artūrs Vasiļevskis <arturs.vasilevskis@Tilde.lv>
Message-ID: <AC6FD4BB9BB02540AC7322091A6C3B5472B0DE2A1A@postal.Tilde.lv>
Hi Felix, all,

I see that there have been a lot of opinion exchanges on the proposal brought up by Felix.
I have some comments to add. I am now speaking as a data producer and later maybe also a data consumer (and I am not speaking as a linguist! ... that has to be understood as well).

First of all, I would like to ask whether we agreed that ITS 2.0 should be able to represent data in the structure as TEI, NIF, XCES or other NLP related standards do – that is, as far as I understand, the direction where this discussion is heading. Should ITS 2.0 try to re-invent these data standards? I would incline to saying – no!

Secondly, as we are in a last call phase, I understand that such significant change to the ITS 2.0 data categories would rewrite them (and maybe it will get clearer when you read my comments till the end). I as a data producer now will have to rewrite my parsers and data producing systems just to accommodate the „stand-off” mechanisms, which is in a content providers and content consumers perspective a diametric change to just adding additional independent attributes or changing the names of attributes (which was actually the initial proposal by Tadej and me). I would like for others to understand that this solution asks for re-development rather than simple adjustments.

Other comments are inline below...

After reading the comments here is a summary:

In my understanding the proposal complicates data production and consumption significantly as it creates possibilities for a lot of ambiguity, which I guess is the opposite of what initially was meant by the disambiguation data category(!) and at least in our Use Case it requires revision of parser logics and ITS 2.0 metadata annotation logics.
However, I will have a discussion with my colleagues in order to estimate how much changes would be required to our use case from a development perspective.

I also understand that this proposal wants to fuse all types of possible NLP-related text analyses together, but I did not have the feeling that ITS 2.0 should be used as a TEI, XCES, NIF, etc. clone? This is how I see where the changes will lead us.
However, I also do not say that that is a bad thing... we would definitely make linguists more happier, but I as a content provider and later also a consumer would have difficulties working with the data as I would have to agree accepting uncertainty/ambiguity in the ITS 2.0 metadata by default (except external resources as those are defined between consumers/producers and not ITS 2.0).

Best regards,
Mārcis ;o)

From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: Sunday, January 27, 2013 9:25 AM
To: public-multilingualweb-lt@w3.org
Subject: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Hi all,

sorry, this is going to be long ... but please have a look, esp. the implementers (both consumers and producers) of terminology and disambiguation.

in the last 10 1/2 months, since Tadej's presentation at the Dublin workshop, we had a lot of discussions on disambiguation, and sometimes (as now) including terminology. But it seems that we never discussed whether ITS2 approach of selection (global, local, inheritence, overriding (partial or not)...) is suitable for this type of information.

By "this type" I mean annotation of linguistic information. Most ITS2 and ITS1 data categories are process related (e.g. "Don't translate this ..."), but both terminology and what's now called disambiguation are information that you find in linguistic corpora and processing tools. Now, my point is that in both in such natural language processing tool chains and related corpora, a representation of information *inline per document node* is rather the exception. Mostly you have *standoff information*, that is a complete seperation of information from actual content - as in NIF.

Mārcis:
Parsing and understanding of the mark-up is the main difference (how overriding and inheritance work) that requires this „stand-off” mechanism for „this type” of annotation. If there would be only flat level annotation, we would not have this discussion at all. Also, “stand-off” is only good if you really have to add a lot of complex data, but here we have to add just a flag or a reference (if put in simple words). In Prague me and Tadej discussed that if hierarchical information is needed, that should be encoded in the external resources.

If I understand correctly, stand-off mark-up has no inheritance and it has no overriding – it describes a span? If so, I assume that with your proposal we are back at requiring hierarchical annotation, overlapping annotation and contradictive annotation, which will allow all kinds of text analysis annotations (without restricted types – term, entity, ontology, lexical, etc.). This will require data consumers to re-think their data consumption strategies as they will have to disambiguate the “disambiguation-style” annotations (which means that at the end we do not help data consumers, but make the life rather more difficult).

In the current ITS 2.0 draft the annotation is flat - it is simple to parse, simple to consume, simple to produce – it is not hierarchical and it does not overlap.

>From this perspective, the proposed change is a complete overhaul of the 2 data categories in something different.

Also – we do require the flag. That is something that will be heavily complicated with the “stand-off” mechanism (that has to be understood), or won’t be possible at all?! Having a simple attribute inline is the simplest you can achieve. Having a “stand-off” on the other hand is the most complex you can achieve.

And ... if I remember correctly, we did not want to make life difficult for producers/consumers if they did not care about the other data categories?


Why is that? In linguistic annotation it is common that you have several layers of information, like our lexical, ontological etc. information. Some of these might be complex in itself (e.g. named entities), some of these might be related to others (e.g. an ontological concept related to a lexical item). I won't try to define these layers here - but my point is that due to the complexity of representing such information inline, nearly nobody is trying to represent several layers at the same time inline. The common approach is rather to have a base layer, and then pointers from the various annotation layers.

In a sense you can describe NIF as an approach of taking character offsets as the implicit base layer (implicit because characters don't need explicit anchors). The TEI here
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
provides an example for an offset using words as the base unit, with exlicit xml:id attributes.

So far we haven't taken this approach for terminology or disambiguation. This is why we had to came of with 16+ attributes: if you want to do everything "inline", you need to differenciate attribute names and come up with a monster data category. Inline annotations are just not suitable for such information.

Mārcis:
I disagree that 16+ attributes are the difficulty here. The difficulty from the beginning were the questions: 1) how many types of annotation should be supported (we narrowed the list down to 4 – terminology, named entities, ontology concepts, lexical concepts)? 2) should overlapping be supported? 3) should hierarchical annotation be supported? 4) should contradicting annotation be supported?

Also ... data producers would have to worry just about a maximum of 5 attributes simultaneously and they would be able to ignore the rest. For instance, I have no use for the attributes for disambiguation categories. Although I would agree writing a parser that parses all these attributes (just for compliancy with the data category), I would as a consumer consume only the ones related to terminology and I as a producer would produce only those related to terminology. I would nor consume, nor produce the disambiguation related attributes. From that perspective, I disagree to the complexity in the attribute scenario.

For terminology I require a flagging mechanism (with the possibility to add either a reference, a confidence score, or both).

I do agree that we are limiting the annotation with having separate attributes, but then again ... ITS 2.0 does not have to represent every possible text analysis annotation type. It is supposed to aid in localisation processes and not all text analysis types have a valid use case (or a necessary or even a potentially useful use case) in localisation.

Also ... if we are re-inventing terminology and disambiguation, maybe we should analyse which other data categories fall under the type “text analysis”? Domain is a suitable candidate as well (and if we create a suitable text analysis category, maybe domain analysis can be subcategorized under that as well in order to support automated domain analysis solutions (EuroVoc has an automated domain classifier, for instance))?). With this I would like to emphasize that overgeneralization is not the best approach as we are creating data categories for different consumption scenarios.


So, the first idea behind below approach is: if you want to represent just one linguistic layer (or "qualifier" in Christian's mail at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
) , you use "tan-type" attribute to differentiate annotations. That leads to following models inline models:

1) A term has its-tan-type with value "term" and optional its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
<span its-tan-type="term" its-tan-ident-ref="http://termdatabase.example.com/entry37"<http://termdatabase.example.com/entry37> its-tan-info-ref="http://termdatabase.example.com/entry37/description"<http://termdatabase.example.com/entry37/description> its-tan-confidence="1.0">Dublin</span>
Comparison to current ITS1 "Terminology":
its-tan-type="term" plays the role of term="yes". its-tan-info-ref plays the role of termInfoRef.  its-tan-ident-ref links to a term data base. its-tan-confidence provide confidence information.
(Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, I'm just trying to exemplify the annotation approach here)

Mārcis:
Also one thing I tried to emphasize at lunchtime in Prague, TermInfoRef is not necessarily an identity reference. It does not always point to something unique (if we understand that a set is not unique). You can have multiple term entries from multiple user collections in a term bank relating to one semantic term. In the case if you do not specify a domain you could end up having a reference that points to totally different (also contrasting) terms or if you do not specify a target language you may end up having multiple entries because most of the collections are bilingual and not multilingual. Why is that so? It is because a term-bank is not a disambiguator – it acts like a search engine (more or less) – the disambiguation for the “external” information (the meaning; the term unithood is defined by the flag term=”yes” itself) has to be done by the consumers (translation engines or human translators). In most cases (as in the biggest term-banks – IATE, ETB) it does not have a hierarchical understanding of terms as some lexical (WordNet, f.i.) or ontological resources may have. For MT engines a valuable information is already – term=“yes” as that defines the term unithood, which means that the term should be translated as a non-breakable phrase. So ... the MT engine could ignore the TermInfoRef at all if it does not have a suitable disambiguation module and maybe leave the disambiguation to human post-editors...

So ... “ident” is misleading (at least in the case of Terminology annotation)!

Also important: HOW WOULD YOU REPRESENT term=”no”? This is a very important feature of the flag type annotation.

would you say: its-tan-type="not-a-term"? That would require data producers to handle higher complexity annotation!
2) An entity has its-tan-type with value "entity" and optional its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
<span its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref=" http://nerd.eurecom.fr/ontology#Place" its-tan-confidence="0.7">Dublin</span>

So above is only different naming compared to current "Terminology" and Disambiguation. Below is now the standoff approach. The processing expectation for tools *producing the annotation* is like this:
- If there is no inline annotation, just create it (e.g. 1) or 2))
- If there is inline annotation, check if there is an id attribute (in HTML) or xml:id (if XML serizalization of HTML is used and with lower precedence compared to id). For formats other than HTML, add xml:id if possible or use the id attribute appropriate for that format.

Then, for creating standoff annotations, add an "its:textAnalyticsAnnotations" element to the document, e.g. in HTML "script" if needed.

Let's assume before annotation we have
<span its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7">Dublin</span>
Then after annotation we would have
<span its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" id="a8">Dublin</span>
and this:
<its:textAnalyticsAnnotations>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="term" its-tan-ident-ref="http://termdatabase.example.com/entry37"<http://termdatabase.example.com/entry37> its-tan-info-ref="http://termdatabase.example.com/entry37/description"<http://termdatabase.example.com/entry37/description> its-tan-confidence="1.0"/>
</its:textAnalyticsAnnotations>

Let's now assume that before annotation we have
<span its-tan-type="term" its-tan-ident-ref="http://termdatabase.example.com/entry37"<http://termdatabase.example.com/entry37> its-tan-info-ref="http://termdatabase.example.com/entry37/description"<http://termdatabase.example.com/entry37/description> its-tan-confidence="1.0">Dublin</span>
Then after annotation we would have
<span its-tan-type="term" its-tan-ident-ref="http://termdatabase.example.com/entry37"<http://termdatabase.example.com/entry37> its-tan-info-ref="http://termdatabase.example.com/entry37/description"<http://termdatabase.example.com/entry37/description> its-tan-confidence="1.0" id="a8">Dublin</span>
and this:
<its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"/>
</its:textAnalyticsAnnotations>

Now, if several "entity" annotation tools have been used, we could also have
<its:textAnalyticsAnnotations>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" annotatorsRef="tan|tool-x"/>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.4" annotatorsRef="tan|tool-y"/>
</its:textAnalyticsAnnotations>

Above approach would also influence the consumption of this data category, and of annotatorsRef:

- A consuming tools goes through the document and gathers all textAnalyticsAnnotations elements
- It then goes through the document. For each element node
* check for existing inline markup. If it's available, add it to the list of annotations for that node. Assume the inline version up in the document tree of annotatorsRef to be responsible for the annotation of that markup.
* check the accumulated standoff textAnalyticsAnnotations elements for occurrences of IDs that match the node. If there is such an ID, add the related annotation to the list for the node, including the additional annotatorsRef tool, e.g. tool-x or tool-y in the above case.


Mārcis:
Do I understand you correctly that we may end up having contradicting annotations also, for instance term=”yes” and term=”no”? This would require a data consumer to be able to handle a lot of ambiguity in the data.

In summary, this standoff tries to solve several issues:

- avoid the 16+ inline attribute monster data category
Mārcis:
Again, I did not understand why this is worse than having a heavy “stand-off” mechanism.

- allow for multiple annotations of the same span, with different tools
Mārcis:
In Prague Tadej and I had a discussion whether there is a use case for using two tools producing contradicting mark-up and we came to the conclusion that neither of us would produce such data and if such a scenario exists, then the content producer should fuse (disambiguate) the outputs of the two separate tools prior to ITS 2.0 metadata application. I am talking about the same type (for instance, two term annotation tools on the same span) of annotation, not two separate types.

Then my question: does such a scenario exist? Who is implementing it?

- avoid the ITS1/2 or general inline annotation issues with inheritance and overriding - as with the standoff approach at exemplified at
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
annotation information is just accumulated for a given base item (in our case, element nodes with an ID).
Mārcis:
So ... at the end, with this method we would allow:
1) Hierarchical annotation
2) Contradicting annotation
3) (possibly also) overlapping annotation

I'm not yet asking for this change, but I see it as a way forward that could make the life of both annotation producers (Marcis and Tadej) and consumers (Yves et al.) simpler. So I'm eager to hear thoughts on this :)
Mārcis:
As I understand the proposal – it is the complete opposite from being simple (or simplifying things as they are right now having Terminology and Disambiguation separately), it complicates things significantly from the Terminology standpoint as now I do not see where term=”yes” fits in, we have to deal with contradicting annotation (allow or prohibit it is now a question to the consumers – I as a consumer would ask to prohibit it as I do not see a use case for term=”yes” and term=”no” at the same time), and what is more, we have to re-implement the parsers so that instead of overriding and inheritance they would work with accumulating information (and this is a complete revision of the parser logics for the Terminology data category).


Thoughts?

- Felix
Received on Monday, 28 January 2013 10:19:28 UTC