RE: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup from Mārcis Pinnis on 2013-01-29 (public-multilingualweb-lt@w3.org from January 2013)

From: Mārcis Pinnis <marcis.pinnis@Tilde.lv>
Date: Tue, 29 Jan 2013 20:26:07 +0200
To: Felix Sasaki <fsasaki@w3.org>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Message-ID: <AC6FD4BB9BB02540AC7322091A6C3B5472B0F011B3@postal.Tilde.lv>
Hi Felix, all,

I have replied inline (in a color close to cyan).

Best regards,
Mārcis ;o)

From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: Tuesday, January 29, 2013 7:24 PM
To: public-multilingualweb-lt@w3.org
Subject: Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Hi Mārcis, all,


Am 29.01.13 15:48, schrieb Mārcis Pinnis:
Hi Felix,

My comments are inline.

Best regards,
Mārcis ;o)

From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: Tuesday, January 29, 2013 11:27 AM
To: Mārcis Pinnis
Cc: public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>; Artūrs Vasiļevskis
Subject: Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Hi Mārcis, all,

even if this discussion has now continued in a different thread, let met give further feedback here too - it may help to clarify things, and to continue the discussion in general.


Am 28.01.13 11:18, schrieb Mārcis Pinnis:
Hi Felix, all,

I see that there have been a lot of opinion exchanges on the proposal brought up by Felix.
I have some comments to add. I am now speaking as a data producer and later maybe also a data consumer (and I am not speaking as a linguist! ... that has to be understood as well).

First of all, I would like to ask whether we agreed that ITS 2.0 should be able to represent data in the structure as TEI, NIF, XCES or other NLP related standards do – that is, as far as I understand, the direction where this discussion is heading. Should ITS 2.0 try to re-invent these data standards? I would incline to saying – no!


As far as I understand, there standards are not yet implemented in localization tool chains. However, the "multilayer annotation" proposal brought one feature from these standards into such tool chains: the standoff mechanism. I'd rather see this as a value than a problem: bringing NLP friendly representations into localization workflows. Would you disagree?

Mārcis: From the perspective of adding all kinds of annotation, overlapping, contradicting, hierarchical, it certainly is beneficial (I do agree in this aspect).
Mārcis: From the perspective of implementation:
Mārcis: 1) for consumers you suggest reading only known mark-up. I do agree that if we would care only about one tool then we could ignore the rest. But this asks consumers to know who produced the mark-up. By having a flat level flag the consumers did not have to worry about who produced the annotation (also – human and machine users could apply annotation and have an effect on the data); they just read the annotation and used it as is. This is not possible in the stand-off mechanism – the consumer has to know which producer to trust in order to consume the data; otherwise the consumer has to have a disambiguation module at hand that tries to find some reason in all the annotations.
Mārcis: 2) for producers the stand-off mark-up requires adaptation (more than just adding attributes, but still adaptation), which probably is not a big issue.

Mārcis: We (Tilde) are doing both right now – we consume and we produce Terminology. But ... we could switch to just producing and not consuming (which is the part that worries me more...). So we would not have to deal with the disambiguation of the stand-off mark-up and also which annotator to trust or not.

Secondly, as we are in a last call phase, I understand that such significant change to the ITS 2.0 data categories would rewrite them (and maybe it will get clearer when you read my comments till the end). I as a data producer now will have to rewrite my parsers and data producing systems just to accommodate the „stand-off” mechanisms, which is in a content providers and content consumers perspective a diametric change to just adding additional independent attributes or changing the names of attributes (which was actually the initial proposal by Tadej and me). I would like for others to understand that this solution asks for re-development rather than simple adjustments.

I agree - this would be quite some work, and we need to justify the benefit clearly.

Mārcis: The change will affect our Showcase the most as right now we rely on inline mark-up. If we won’t have the inline mark-up at the end (or we will have additional stand-off mark-up) then we will have to re-think the Showcase design and the visualisation possibilities in the Showcase.


Is this because of CSS used for visualization? I agree that with standoff markup visulization gets more compliciated - but not that match. See the javascript bit used for localization quality issue here
http://www.w3.org/TR/2012/WD-its20-20121206/#EX-locQualityIssue-html5-local-2
http://www.w3.org/TR/2012/WD-its20-20121206/examples/html5/EX-locQualityIssue-html5-local-2.html

The resolution of the ID is only a few lines of javascript code.

Mārcis: Thank you for the hint to the javascript code. If we will have an agreement that the stand-off mechanism has to be applied, we will definitely look into this (if our developers won’t already have a solution).

Other comments are inline below...

After reading the comments here is a summary:

In my understanding the proposal complicates data production and consumption significantly as it creates possibilities for a lot of ambiguity, which I guess is the opposite of what initially was meant by the disambiguation data category(!) and at least in our Use Case it requires revision of parser logics and ITS 2.0 metadata annotation logics.

The proposal basically says: here is a way to represent ambiguity, created by several tools annotating the same document. However, I'd see this as a value, not a problem: with separate "its:textAnalyticsAnnotations" elements, including each its own annotatorsRef, you can clearly identify which tool created what annotation. This may be even clearer than the current annotatorsRef.


Mārcis: I do agree, however see above for my comment related to consumers. For them the consumption is different – they will have to know whom to trust. If you think that consumers have to know who produced the data all the time then it is fine by me... (but it is a change from the Terminology as it is right now).

I got your point about trust. And - trying to bring the discussion back - the intial comment was not about trust or no trust. It was rather about unfying terminology and disambiguation - and by relfecting the levels mentioned in disambiguation in different annotaion levels, I trid to find a work-around for this. But that work-around and multilayer annotation are not the main topic.

So, if we drop disambiguation granularity, keep the term yes / no requirement, we may have this representation as a unified approach for both data categories:


<span its-tan-confidence="0.7" its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place>
its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin>
its-term="no">Dublin</span>

If we define that its-term="yes" triggers its-tan-ident-ref to be interpreted as a reference to a termDB, we would have unified the data categories. Well, I guess I'm trying to use an axe for moving this forward ... but let's see what you think.

Mārcis: 1. Let’s have a summary on this:

1.1.We have inline mark-up and stand-off mark-up

a.       The stand-off mark-up has no inheritance and no overriding, but works on spans allowing hierarchical annotation and also conflicting annotations - OK

b.      The inline mark-up has no inheritance, which means that basically nested mark-ups are treated differently than in the alternative stand-off mark-up (won’t we create two ways to understand the annotation? Seems like the inline is not compatible with stand-off)

1.2.We drop the granularity and allow the class-ref or the ident-ref know about what the disambiguation categories are (only if there isn’t a term=”yes”) - OK

1.3.If there is term=”yes”, we ignore the class-ref as that is not necessary, have the confidence optional and have the ident-ref optional. If the ident-ref is used, it points to a term-bank. - OK

2.      There are generally 3 possibilities for its-term as I see it if we do not have conflicting annotations (but I understand that there may be):

2.1.It is not given at all – that means that we can apply mark-up if we would like to

2.2.It says its-term=”yes” – we do not apply mark-up, because an existing mark-up exists saying that a phrase is a term (we might however try referencing to a term base if the reference does not exist and the borders of the spans match)

2.3.It says its-term=”no” – we do not apply mark-up, because an existing mark-up exists saying that a phrase is definitely not a term
3. Keeping in mind that we may have the stand-off principle, this creates the following possibilities for data production (somewhat different to the previous):
      3.1. If there is no mark-up – we apply it if we want to
      3.2.a) If there is mark-up – we ignore it and apply our mark-up if we want to (this way we would create the conflicting mark-ups), or:
      3.2.b) If there is mark-up – we do not apply mark-up as there already is existing mark-up for a given span (this way we would accumulate all stand-off mark-ups and inline mark-ups and mark only the fragments that have no span under a term mark-up)

For consumers –3.2.a is better if we ask consumers to trust producers, 3.2.b is better if we do not have to express trust to producers.

In a summary of the whole comment: 1) to simply get rid of the „red” 1.1.b. issue above, we would apply only stand-off mark-up ignoring in-line mark-up possibilities, 2) depending on decisions on having consumers to trust producers or not we would produce data according to 2a) or 2b); and 3) we would consume data trusting only our annotation tool if we have to trust producers or consolidate annotations if the trust would not be needed.

This is how we would proceed if this scenario moves forward.

Best,

Felix





However, I will have a discussion with my colleagues in order to estimate how much changes would be required to our use case from a development perspective.

I also understand that this proposal wants to fuse all types of possible NLP-related text analyses together, but I did not have the feeling that ITS 2.0 should be used as a TEI, XCES, NIF, etc. clone? This is how I see where the changes will lead us.
However, I also do not say that that is a bad thing... we would definitely make linguists more happier, but I as a content provider and later also a consumer would have difficulties working with the data as I would have to agree accepting uncertainty/ambiguity in the ITS 2.0 metadata by default (except external resources as those are defined between consumers/producers and not ITS 2.0).

Best regards,
Mārcis ;o)

From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: Sunday, January 27, 2013 9:25 AM
To: public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>
Subject: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Hi all,

sorry, this is going to be long ... but please have a look, esp. the implementers (both consumers and producers) of terminology and disambiguation.

in the last 10 1/2 months, since Tadej's presentation at the Dublin workshop, we had a lot of discussions on disambiguation, and sometimes (as now) including terminology. But it seems that we never discussed whether ITS2 approach of selection (global, local, inheritence, overriding (partial or not)...) is suitable for this type of information.

By "this type" I mean annotation of linguistic information. Most ITS2 and ITS1 data categories are process related (e.g. "Don't translate this ..."), but both terminology and what's now called disambiguation are information that you find in linguistic corpora and processing tools. Now, my point is that in both in such natural language processing tool chains and related corpora, a representation of information *inline per document node* is rather the exception. Mostly you have *standoff information*, that is a complete seperation of information from actual content - as in NIF.

Mārcis:
Parsing and understanding of the mark-up is the main difference (how overriding and inheritance work) that requires this „stand-off” mechanism for „this type” of annotation. If there would be only flat level annotation, we would not have this discussion at all. Also, “stand-off” is only good if you really have to add a lot of complex data, but here we have to add just a flag or a reference (if put in simple words). In Prague me and Tadej discussed that if hierarchical information is needed, that should be encoded in the external resources.

If I understand correctly, stand-off mark-up has no inheritance and it has no overriding – it describes a span?

Correct.



If so, I assume that with your proposal we are back at requiring hierarchical annotation, overlapping annotation and contradictive annotation, which will allow all kinds of text analysis annotations (without restricted types – term, entity, ontology, lexical, etc.). This will require data consumers to re-think their data consumption strategies as they will have to disambiguate the “disambiguation-style” annotations (which means that at the end we do not help data consumers, but make the life rather more difficult).

As said above: if a consumer doesn't want to deal with several layers of annotations, it can just say: I want to consume the annotations made by Tilde or by JSI. This is guaranteed by the annotatorsRef attribute.

Mārcis: Again, see my comment above about the consumer having to know whom to trust.

The current state of quo creates this situation: if Tilde already has annotated a text, and JSI wants to add annotations, and you want to compare them: how to do this? You can say "one creates terminology markup, the other disambiguation markup". But what about even more tools?


Mārcis: I agree, currently you can have only one Terminology annotation tool (the disambiguation is not a nice example), but in the current version we acknowledge that there can be only one Terminology annotator for a single phrase (I am fine with that). I understand that the stand-off is a possible way how to solve this issue, but for the previous consumer this will create non-comparable annotations (or he will have to update to consuming the ambiguous annotations or just trust one of the annotations). For future consumers this might as well be acceptable.

In the current ITS 2.0 draft the annotation is flat - it is simple to parse, simple to consume, simple to produce – it is not hierarchical and it does not overlap.

See above - if a consumer does not want to consume relations between annotation tools or levels, you don't have to, and annotatorsRef gives you the ability to differentiate the annotations.

Btw., current disambiguation and terminology also don't inherit, see the table at
http://www.w3.org/TR/2012/WD-its20-20121206/#datacategories-defaults-etc
that is: the annotations of both data categories don't inherit to nested markup. So we could resolve the issue also via something like this:
Input before annotation: Dublin
First annotation: <span its-term="yes">Dublin</span>
Second annotation: <span its-term="yes"><span its-term="no">Dublin</span></span>


Mārcis: As I understand this disambiguates Terminology mark-up from different producers automatically? This is possible in the current version.
The non-existent inheritance, of course requires the marked span not to contain other mark-up. This is, of course, is a limitation of the current version and I agree, might be solved by the stand-off mechanism.

But that has the annotatorsRef issue if several "term annotation" tools have been used.

Mārcis: In the current version we as a consumer treat all Terminology annotators as equally important, thus the annotatorsRef for us is not necessary. However, for data tracking purposes this might be important and with the stand-off mechanism the annotatorsRef becomes mandatory (at least in our consumption scenario).

>From this perspective, the proposed change is a complete overhaul of the 2 data categories in something different.

Also – we do require the flag. That is something that will be heavily complicated with the “stand-off” mechanism (that has to be understood), or won’t be possible at all?!

Setting the type would give you the flat. I know that in the proposal at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0212.html
Tadej dropped the flat. But we could have instead of fixed values, e.g. "term", an URI. You could then interpret that URI as a term flag, e.g.
<span tan-type="http://example.com/term"<http://example.com/term>>

Mārcis: I agree that it gives term=“yes”, but not term=”no”.


Having a simple attribute inline is the simplest you can achieve. Having a “stand-off” on the other hand is the most complex you can achieve.

And ... if I remember correctly, we did not want to make life difficult for producers/consumers if they did not care about the other data categories?


Correct, but here we have the situation that two data categories might be just too similar for keeping everthing as is.





Why is that? In linguistic annotation it is common that you have several layers of information, like our lexical, ontological etc. information. Some of these might be complex in itself (e.g. named entities), some of these might be related to others (e.g. an ontological concept related to a lexical item). I won't try to define these layers here - but my point is that due to the complexity of representing such information inline, nearly nobody is trying to represent several layers at the same time inline. The common approach is rather to have a base layer, and then pointers from the various annotation layers.

In a sense you can describe NIF as an approach of taking character offsets as the implicit base layer (implicit because characters don't need explicit anchors). The TEI here
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
provides an example for an offset using words as the base unit, with exlicit xml:id attributes.

So far we haven't taken this approach for terminology or disambiguation. This is why we had to came of with 16+ attributes: if you want to do everything "inline", you need to differenciate attribute names and come up with a monster data category. Inline annotations are just not suitable for such information.

Mārcis:
I disagree that 16+ attributes are the difficulty here. The difficulty from the beginning were the questions: 1) how many types of annotation should be supported (we narrowed the list down to 4 – terminology, named entities, ontology concepts, lexical concepts)? 2) should overlapping be supported? 3) should hierarchical annotation be supported? 4) should contradicting annotation be supported?

about 1): no type at all would be one solution, but the term identifer issue would come up. about 2): if a consumer just takes up one annotation, e.g. the output of Tilde's tool, there is no need to process overlap. And we can leave that to consumers IMO. 3): same like 2). 4) Same like 2).

Mārcis: 1) I agree that the type itself is important. I know that Tadej said that a Ref URI might have the Type embedded, but for Terminology we do not always have the URI available.
Mārcis: 2-4) I agree, but only if you ask the consumers to know which producer to trust. If that is not an issue, then it is fine by me (it is a compromise as we lose the ability to not have to trust anyone at all)

Also ... data producers would have to worry just about a maximum of 5 attributes simultaneously and they would be able to ignore the rest. For instance, I have no use for the attributes for disambiguation categories.

I think that's the heart of issue-68: there are two quite similar pieces of information, but consumers separate them.




Although I would agree writing a parser that parses all these attributes (just for compliancy with the data category), I would as a consumer consume only the ones related to terminology and I as a producer would produce only those related to terminology. I would nor consume, nor produce the disambiguation related attributes.

That wouldn't work if we have one data category: our conformance requirements say: you implement it global or local or both. You then can also decide whether you implement it in HTML or XML or both. But you cannot cherry pick attributes for consumption. We don't say anything wrt production - but our schema helps us to verify that the "right data" has been produced.


Mārcis: I think you are misunderstanding – parsing content for consumption and production is a totally different architectural level than the logics that makes any use of the content. So ... in my understanding we are not failing on conformance. We are if there is a requirement that we have to really consume and really produce the other types of data in the application logics layers. Is this the case?



>From that perspective, I disagree to the complexity in the attribute scenario.

I think part of the disagreement comes from the "free spirit" you have as data producer and consumer, see above.





For terminology I require a flagging mechanism (with the possibility to add either a reference, a confidence score, or both).

I do agree that we are limiting the annotation with having separate attributes, but then again ... ITS 2.0 does not have to represent every possible text analysis annotation type. It is supposed to aid in localisation processes and not all text analysis types have a valid use case (or a necessary or even a potentially useful use case) in localisation.

Also ... if we are re-inventing terminology and disambiguation, maybe we should analyse which other data categories fall under the type “text analysis”? Domain is a suitable candidate as well (and if we create a suitable text analysis category, maybe domain analysis can be subcategorized under that as well in order to support automated domain analysis solutions (EuroVoc has an automated domain classifier, for instance))?).


Here I would disagree: our domain data category is just for transporting domain information between content and tools, including a potential mapping of domain identifiers inbetween. The "terminology vs disambiguation" discussion came from the observations that two data categories in ITS2 have a huge overlap. I don't see that situation for domain.


Mārcis: I do not agree that domain identification is in structure different than terminology annotation or named entity recognition or even sentence breaking, but fair enough ... I brought this up to show that in general domain annotation is also annotation... an equal in structure task as term tagging or named entity recognition (usually just in bigger spans – but not necessarily).

With this I would like to emphasize that overgeneralization is not the best approach as we are creating data categories for different consumption scenarios.

But are they so different? It sounds to me rather that in your scenario, many opportunities are lost because you don't consume disamgiuatino at all ... so having one umbrella data category might even give you more data consumption opportunities.

Mārcis: :) ... of course, the more annotation, the more possibilities (this is a philosophical truth – I cannot argue). But for producing Terminology in our Use Case we do not require knowledge on the purely semantic lexical, ontological or entity level. Our Use Case uses knowledge from the tools that are applied in the process and they do not ask for information supplied by those data categories (we do require domain, language and others though ... that we, of course, use). The use of Disambiguation data categories would require re-thinking of the modules that do not deal with ITS 2.0 explicitly – the term extraction, term weighing, term retrieval methods which are out-of-scope in this project.


So, the first idea behind below approach is: if you want to represent just one linguistic layer (or "qualifier" in Christian's mail at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
) , you use "tan-type" attribute to differentiate annotations. That leads to following models inline models:

1) A term has its-tan-type with value "term" and optional its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
<span its-tan-type="term" its-tan-ident-ref="http://termdatabase.example.com/entry37"<http://termdatabase.example.com/entry37> its-tan-info-ref="http://termdatabase.example.com/entry37/description"<http://termdatabase.example.com/entry37/description> its-tan-confidence="1.0">Dublin</span>
Comparison to current ITS1 "Terminology":
its-tan-type="term" plays the role of term="yes". its-tan-info-ref plays the role of termInfoRef.  its-tan-ident-ref links to a term data base. its-tan-confidence provide confidence information.
(Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, I'm just trying to exemplify the annotation approach here)



Mārcis:
Also one thing I tried to emphasize at lunchtime in Prague, TermInfoRef is not necessarily an identity reference. It does not always point to something unique (if we understand that a set is not unique). You can have multiple term entries from multiple user collections in a term bank relating to one semantic term. In the case if you do not specify a domain you could end up having a reference that points to totally different (also contrasting) terms or if you do not specify a target language you may end up having multiple entries because most of the collections are bilingual and not multilingual. Why is that so? It is because a term-bank is not a disambiguator – it acts like a search engine (more or less) – the disambiguation for the “external” information (the meaning; the term unithood is defined by the flag term=”yes” itself) has to be done by the consumers (translation engines or human translators). In most cases (as in the biggest term-banks – IATE, ETB) it does not have a hierarchical understanding of terms as some lexical (WordNet, f.i.) or ontological resources may have. For MT engines a valuable information is already – term=“yes” as that defines the term unithood, which means that the term should be translated as a non-breakable phrase. So ... the MT engine could ignore the TermInfoRef at all if it does not have a suitable disambiguation module and maybe leave the disambiguation to human post-editors...

So ... “ident” is misleading (at least in the case of Terminology annotation)!

Also important: HOW WOULD YOU REPRESENT term=”no”? This is a very important feature of the flag type annotation.

would you say: its-tan-type="not-a-term"? That would require data producers to handle higher complexity annotation!


I don't have a clear answer to above questions - others, feel free to chime in if you do.


Mārcis: This is important to understand. Will this be dropped at all or will there be an alternative mechanism?

2) An entity has its-tan-type with value "entity" and optional its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
<span its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref=" http://nerd.eurecom.fr/ontology#Place" its-tan-confidence="0.7">Dublin</span>

So above is only different naming compared to current "Terminology" and Disambiguation. Below is now the standoff approach. The processing expectation for tools *producing the annotation* is like this:
- If there is no inline annotation, just create it (e.g. 1) or 2))
- If there is inline annotation, check if there is an id attribute (in HTML) or xml:id (if XML serizalization of HTML is used and with lower precedence compared to id). For formats other than HTML, add xml:id if possible or use the id attribute appropriate for that format.

Then, for creating standoff annotations, add an "its:textAnalyticsAnnotations" element to the document, e.g. in HTML "script" if needed.

Let's assume before annotation we have
<span its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7">Dublin</span>
Then after annotation we would have
<span its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" id="a8">Dublin</span>
and this:
<its:textAnalyticsAnnotations>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="term" its-tan-ident-ref="http://termdatabase.example.com/entry37"<http://termdatabase.example.com/entry37> its-tan-info-ref="http://termdatabase.example.com/entry37/description"<http://termdatabase.example.com/entry37/description> its-tan-confidence="1.0"/>
</its:textAnalyticsAnnotations>

Let's now assume that before annotation we have
<span its-tan-type="term" its-tan-ident-ref="http://termdatabase.example.com/entry37"<http://termdatabase.example.com/entry37> its-tan-info-ref="http://termdatabase.example.com/entry37/description"<http://termdatabase.example.com/entry37/description> its-tan-confidence="1.0">Dublin</span>
Then after annotation we would have
<span its-tan-type="term" its-tan-ident-ref="http://termdatabase.example.com/entry37"<http://termdatabase.example.com/entry37> its-tan-info-ref="http://termdatabase.example.com/entry37/description"<http://termdatabase.example.com/entry37/description> its-tan-confidence="1.0" id="a8">Dublin</span>
and this:
<its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"/>
</its:textAnalyticsAnnotations>

Now, if several "entity" annotation tools have been used, we could also have
<its:textAnalyticsAnnotations>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" annotatorsRef="tan|tool-x"/>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.4" annotatorsRef="tan|tool-y"/>
</its:textAnalyticsAnnotations>

Above approach would also influence the consumption of this data category, and of annotatorsRef:

- A consuming tools goes through the document and gathers all textAnalyticsAnnotations elements
- It then goes through the document. For each element node
* check for existing inline markup. If it's available, add it to the list of annotations for that node. Assume the inline version up in the document tree of annotatorsRef to be responsible for the annotation of that markup.
* check the accumulated standoff textAnalyticsAnnotations elements for occurrences of IDs that match the node. If there is such an ID, add the related annotation to the list for the node, including the additional annotatorsRef tool, e.g. tool-x or tool-y in the above case.




Mārcis:
Do I understand you correctly that we may end up having contradicting annotations also, for instance term=”yes” and term=”no”? This would require a data consumer to be able to handle a lot of ambiguity in the data.


Sure. But they could identify the ambiguity with a multilayer annotation that clearly identifies the tool used, via annotatorsRef.
Currently, what would you do with this
<span its-term="yes"><span its-term="no">screwdriver</span></span>
how would you resolve the ambiguity here? "Terminology" has no inheritance. This makes sense, otherwise in the following
<span its-term="yes"><span class="em">screw</span>driver</span>
the embedded "span" element would constitute a span. But that leads to this test suite output for
<span its-term="yes"><span its-term="no">screwdriver</span></span>
/span[1] term="yes"
/span[1]/span[1] term="no"
and both "span" nodes contain the same string "screwdriver". So how do you resolve the ambiguity here?


Mārcis: I do not see the issue in the above example. As you said, Terminology does not inherit, therefore, the only thing that is stated is that the “screwdriver” is not a term.
Mārcis: However, one thing I have not understood so far – is there a limitation of how many annotations can be done by the same producer (human or machine). Even the annotatorsRef in my understanding does not always resolve contradictions. Or ... is there a precedence rule if there are equal, but contradicting stand-off annotations, for instance (I made this up to simplify the under):
<its:textAnalyticsAnnotations>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" annotatorsRef="tan|annotator-1"/>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Person"<%22http:/nerd.eurecom.fr/ontology#Person%22> its-tan-confidence="0.4" annotatorsRef="tan|annotator-1"/>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" its-tan-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Organisation"<%22http:/nerd.eurecom.fr/ontology#Organisation%22> its-tan-confidence="0.4" annotatorsRef="tan|annotator-1"/>
</its:textAnalyticsAnnotations>
Mārcis: Here “Dublin” can be all three (Place, Person, Organisation) simultaneously, right?

In summary, this standoff tries to solve several issues:

- avoid the 16+ inline attribute monster data category
Mārcis:
Again, I did not understand why this is worse than having a heavy “stand-off” mechanism.

- allow for multiple annotations of the same span, with different tools
Mārcis:
In Prague Tadej and I had a discussion whether there is a use case for using two tools producing contradicting mark-up and we came to the conclusion that neither of us would produce such data and if such a scenario exists, then the content producer should fuse (disambiguate) the outputs of the two separate tools prior to ITS 2.0 metadata application. I am talking about the same type (for instance, two term annotation tools on the same span) of annotation, not two separate types.

Then my question: does such a scenario exist? Who is implementing it?

If both you and Tadej would agree on one data category: everybody who wants to use both your tools would implement it. And this has the value that people could compare the outcome of the tools.

Mārcis: So you would ask the consumers to disambiguate or choose (in this way they would not use both if both would produce Terminology), right? If yes, it is totally fine, I just want to make sure I understand your idea.



- avoid the ITS1/2 or general inline annotation issues with inheritance and overriding - as with the standoff approach at exemplified at
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
annotation information is just accumulated for a given base item (in our case, element nodes with an ID).
Mārcis:
So ... at the end, with this method we would allow:
1) Hierarchical annotation
2) Contradicting annotation
3) (possibly also) overlapping annotation


Correct.




I'm not yet asking for this change, but I see it as a way forward that could make the life of both annotation producers (Marcis and Tadej) and consumers (Yves et al.) simpler. So I'm eager to hear thoughts on this :)
Mārcis:
As I understand the proposal – it is the complete opposite from being simple (or simplifying things as they are right now having Terminology and Disambiguation separately), it complicates things significantly from the Terminology standpoint as now I do not see where term=”yes” fits in, we have to deal with contradicting annotation (allow or prohibit it is now a question to the consumers – I as a consumer would ask to prohibit it as I do not see a use case for term=”yes” and term=”no” at the same time), and what is more, we have to re-implement the parsers so that instead of overriding and inheritance they would work with accumulating information (and this is a complete revision of the parser logics for the Terminology data category).

I understand the burden on implementation you emphasize - but it seems that one scenario - annotation using different tools even for the terminology data category, see the nested "terminology" annotations above - is not resolved by your proposal. You say this would not be implemented before ITS2 annotation - but if the tool providers are not from the same organization?

Mārcis: Our proposal did not allow nested annotations. Nor does the current ITS 2.0 version. Also – this was my question – is there a necessity to produce 2 Terminology annotations or 2 named entity annotations on top of each other. I see that you are saying – Yes, there is.


Best,

Felix






Thoughts?

- Felix
Received on Tuesday, 29 January 2013 18:26:44 UTC