W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > January 2013

Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 29 Jan 2013 18:24:24 +0100
Message-ID: <510805C8.5020103@w3.org>
To: public-multilingualweb-lt@w3.org
Hi Mârcis, all,


Am 29.01.13 15:48, schrieb Mârcis Pinnis:
>
> Hi Felix,
>
> My comments are inline.
>
> Best regards,
>
> Mârcis ;o)
>
> *From:*Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Tuesday, January 29, 2013 11:27 AM
> *To:* Mârcis Pinnis
> *Cc:* public-multilingualweb-lt@w3.org; Artűrs Vasiďevskis
> *Subject:* Re: issue-68 from an annotation representation point of 
> view, with potential implications for annotatorsRef and standoff markup
>
> Hi Mârcis, all,
>
> even if this discussion has now continued in a different thread, let 
> met give further feedback here too - it may help to clarify things, 
> and to continue the discussion in general.
>
>
> Am 28.01.13 11:18, schrieb Mârcis Pinnis:
>
>     Hi Felix, all,
>
>     I see that there have been a lot of opinion exchanges on the
>     proposal brought up by Felix.
>
>     I have some comments to add. I am now speaking as a data producer
>     and later maybe also a data consumer (and I am not speaking as a
>     linguist! ... that has to be understood as well).
>
>     First of all, I would like to ask whether we agreed that ITS 2.0
>     should be able to represent data in the structure as TEI, NIF,
>     XCES or other NLP related standards do – that is, as far as I
>     understand, the direction where this discussion is heading. Should
>     ITS 2.0 try to re-invent these data standards? I would incline to
>     saying – no!
>
>
>
> As far as I understand, there standards are not yet implemented in 
> localization tool chains. However, the "multilayer annotation" 
> proposal brought one feature from these standards into such tool 
> chains: the standoff mechanism. I'd rather see this as a value than a 
> problem: bringing NLP friendly representations into localization 
> workflows. Would you disagree?
>
> Mârcis: From the perspective of adding all kinds of annotation, 
> overlapping, contradicting, hierarchical, it certainly is beneficial 
> (I do agree in this aspect).
>
> Mârcis: From the perspective of implementation:
>
> Mârcis: 1) for consumers you suggest reading only known mark-up. I do 
> agree that if we would care only about one tool then we could ignore 
> the rest. But this asks consumers to know who produced the mark-up. By 
> having a flat level flag the consumers did not have to worry about who 
> produced the annotation (also – human and machine users could apply 
> annotation and have an effect on the data); they just read the 
> annotation and used it as is. This is not possible in the stand-off 
> mechanism – the consumer has to know which producer to trust in order 
> to consume the data; otherwise the consumer has to have a 
> disambiguation module at hand that tries to find some reason in all 
> the annotations.
>
> Mârcis: 2) for producers the stand-off mark-up requires adaptation 
> (more than just adding attributes, but still adaptation), which 
> probably is not a big issue.
>
> Mârcis: We (Tilde) are doing both right now – we consume and we 
> produce Terminology. But ... we could switch to just producing and not 
> consuming (which is the part that worries me more...). So we would not 
> have to deal with the disambiguation of the stand-off mark-up and also 
> which annotator to trust or not.
>
>     Secondly, as we are in a last call phase, I understand that such
>     significant change to the ITS 2.0 data categories would rewrite
>     them (and maybe it will get clearer when you read my comments till
>     the end). I as a data producer now will have to rewrite my parsers
>     and data producing systems just to accommodate the „stand-off”
>     mechanisms, which is in a content providers and content consumers
>     perspective a diametric change to just adding additional
>     independent attributes or changing the names of attributes (which
>     was actually the initial proposal by Tadej and me). I would like
>     for others to understand that this solution asks for
>     re-development rather than simple adjustments.
>
>
> I agree - this would be quite some work, and we need to justify the 
> benefit clearly.
>
> Mârcis: The change will affect our Showcase the most as right now we 
> rely on inline mark-up. If we won’t have the inline mark-up at the end 
> (or we will have additional stand-off mark-up) then we will have to 
> re-think the Showcase design and the visualisation possibilities in 
> the Showcase.
>


Is this because of CSS used for visualization? I agree that with 
standoff markup visulization gets more compliciated - but not that 
match. See the javascript bit used for localization quality issue here
http://www.w3.org/TR/2012/WD-its20-20121206/#EX-locQualityIssue-html5-local-2
http://www.w3.org/TR/2012/WD-its20-20121206/examples/html5/EX-locQualityIssue-html5-local-2.html

The resolution of the ID is only a few lines of javascript code.

>
>     Other comments are inline below...
>
>     After reading the comments here is a summary:
>
>     In my understanding the proposal complicates data production and
>     consumption significantly as it creates possibilities for a lot of
>     ambiguity, which I guess is the opposite of what initially was
>     meant by the disambiguation data category(!) and at least in our
>     Use Case it requires revision of parser logics and ITS 2.0
>     metadata annotation logics.
>
>
> The proposal basically says: here is a way to represent ambiguity, 
> created by several tools annotating the same document. However, I'd 
> see this as a value, not a problem: with separate 
> "its:textAnalyticsAnnotations" elements, including each its own 
> annotatorsRef, you can clearly identify which tool created what 
> annotation. This may be even clearer than the current annotatorsRef.
>
> Mârcis: I do agree, however see above for my comment related to 
> consumers. For them the consumption is different – they will have to 
> know whom to trust. If you think that consumers have to know who 
> produced the data all the time then it is fine by me... (but it is a 
> change from the Terminology as it is right now).
>

I got your point about trust. And - trying to bring the discussion back 
- the intial comment was not about trust or no trust. It was rather 
about unfying terminology and disambiguation - and by relfecting the 
levels mentioned in disambiguation in different annotaion levels, I trid 
to find a work-around for this. But that work-around and multilayer 
annotation are not the main topic.

So, if we drop disambiguation granularity, keep the term yes / no 
requirement, we may have this representation as a unified approach for 
both data categories:

<spanits-tan-confidence="0.7" 
its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
its-tan-ident-ref="http://dbpedia.org/resource/Dublin"

its-term="no">Dublin</span>

If we define that its-term="yes" triggers its-tan-ident-ref to be 
interpreted as a reference to a termDB, we would have unified the data 
categories. Well, I guess I'm trying to use an axe for moving this 
forward ... but let's see what you think.

Best,

Felix

>
>
>     However, I will have a discussion with my colleagues in order to
>     estimate how much changes would be required to our use case from a
>     development perspective.
>
>     I also understand that this proposal wants to fuse all types of
>     possible NLP-related text analyses together, but I did not have
>     the feeling that ITS 2.0 should be used as a TEI, XCES, NIF, etc.
>     clone? This is how I see where the changes will lead us.
>
>     However, I also do not say that that is a bad thing... we would
>     definitely make linguists more happier, but I as a content
>     provider and later also a consumer would have difficulties working
>     with the data as I would have to agree accepting
>     uncertainty/ambiguity in the ITS 2.0 metadata by default (except
>     external resources as those are defined between
>     consumers/producers and not ITS 2.0).
>
>     Best regards,
>
>     Mârcis ;o)
>
>     *From:*Felix Sasaki [mailto:fsasaki@w3.org]
>     *Sent:* Sunday, January 27, 2013 9:25 AM
>     *To:* public-multilingualweb-lt@w3.org
>     <mailto:public-multilingualweb-lt@w3.org>
>     *Subject:* issue-68 from an annotation representation point of
>     view, with potential implications for annotatorsRef and standoff
>     markup
>
>     Hi all,
>
>     sorry, this is going to be long ... but please have a look, esp.
>     the implementers (both consumers and producers) of terminology and
>     disambiguation.
>
>     in the last 10 1/2 months, since Tadej's presentation at the
>     Dublin workshop, we had a lot of discussions on disambiguation,
>     and sometimes (as now) including terminology. But it seems that we
>     never discussed whether ITS2 approach of selection (global, local,
>     inheritence, overriding (partial or not)...) is suitable for this
>     type of information.
>
>     By "this type" I mean annotation of linguistic information. Most
>     ITS2 and ITS1 data categories are process related (e.g. "Don't
>     translate this ..."), but both terminology and what's now called
>     disambiguation are information that you find in linguistic corpora
>     and processing tools. Now, my point is that in both in such
>     natural language processing tool chains and related corpora, a
>     representation of information *inline per document node* is rather
>     the exception. Mostly you have *standoff information*, that is a
>     complete seperation of information from actual content - as in NIF.
>
>     Mârcis:
>
>     Parsing and understanding of the mark-up is the main difference
>     (how overriding and inheritance work) that requires this
>     „stand-off” mechanism for „this type” of annotation. If there
>     would be only flat level annotation, we would not have this
>     discussion at all. Also, “stand-off” is only good if you really
>     have to add a lot of complex data, but here we have to add just a
>     flag or a reference (if put in simple words). In Prague me and
>     Tadej discussed that if hierarchical information is needed, that
>     should be encoded in the external resources.
>
>     If I understand correctly, stand-off mark-up has no inheritance
>     and it has no overriding – it describes a span?
>
>
> Correct.
>
>
>     If so, I assume that with your proposal we are back at requiring
>     hierarchical annotation, overlapping annotation and contradictive
>     annotation, which will allow all kinds of text analysis
>     annotations (without restricted types – term, entity, ontology,
>     lexical, etc.). This will require data consumers to re-think their
>     data consumption strategies as they will have to disambiguate the
>     “disambiguation-style” annotations (which means that at the end we
>     do not help data consumers, but make the life rather more difficult).
>
>
> As said above: if a consumer doesn't want to deal with several layers 
> of annotations, it can just say: I want to consume the annotations 
> made by Tilde or by JSI. This is guaranteed by the annotatorsRef 
> attribute.
>
>
> Mârcis: Again, see my comment above about the consumer having to know 
> whom to trust.
>
>
> The current state of quo creates this situation: if Tilde already has 
> annotated a text, and JSI wants to add annotations, and you want to 
> compare them: how to do this? You can say "one creates terminology 
> markup, the other disambiguation markup". But what about even more tools?
>
> Mârcis: I agree, currently you can have only one Terminology 
> annotation tool (the disambiguation is not a nice example), but in the 
> current version we acknowledge that there can be only one Terminology 
> annotator for a single phrase (I am fine with that). I understand that 
> the stand-off is a possible way how to solve this issue, but for the 
> previous consumer this will create non-comparable annotations (or he 
> will have to update to consuming the ambiguous annotations or just 
> trust one of the annotations). For future consumers this might as well 
> be acceptable.
>
>     In the current ITS 2.0 draft the annotation is flat - it is simple
>     to parse, simple to consume, simple to produce – it is not
>     hierarchical and it does not overlap.
>
>
> See above - if a consumer does not want to consume relations between 
> annotation tools or levels, you don't have to, and annotatorsRef gives 
> you the ability to differentiate the annotations.
>
> Btw., current disambiguation and terminology also don't inherit, see 
> the table at
> http://www.w3.org/TR/2012/WD-its20-20121206/#datacategories-defaults-etc
> that is: the annotations of both data categories don't inherit to 
> nested markup. So we could resolve the issue also via something like this:
> Input before annotation: Dublin
> First annotation: <span its-term="yes">Dublin</span>
> Second annotation: <span its-term="yes"><span 
> its-term="no">Dublin</span></span>
>
> Mârcis: As I understand this disambiguates Terminology mark-up from 
> different producers automatically? This is possible in the current 
> version.
>
> The non-existent inheritance, of course requires the marked span not 
> to contain other mark-up. This is, of course, is a limitation of the 
> current version and I agree, might be solved by the stand-off mechanism.
>
>
> But that has the annotatorsRef issue if several "term annotation" 
> tools have been used.
>
> Mârcis: In the current version we as a consumer treat all Terminology 
> annotators as equally important, thus the annotatorsRef for us is not 
> necessary. However, for data tracking purposes this might be important 
> and with the stand-off mechanism the annotatorsRef becomes mandatory 
> (at least in our consumption scenario).
>
>     From this perspective, the proposed change is a complete overhaul
>     of the 2 data categories in something different.
>
>     Also – we do require the flag. That is something that will be
>     heavily complicated with the “stand-off” mechanism (that has to be
>     understood), or won’t be possible at all?!
>
>
> Setting the type would give you the flat. I know that in the proposal at
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0212.html
> Tadej dropped the flat. But we could have instead of fixed values, 
> e.g. "term", an URI. You could then interpret that URI as a term flag, 
> e.g.
> <span tan-type="http://example.com/term" <http://example.com/term>>
>
> Mârcis: I agree that it gives term=“yes”, but not term=”no”.
>
>     Having a simple attribute inline is the simplest you can achieve.
>     Having a “stand-off” on the other hand is the most complex you can
>     achieve.
>
>     And ... if I remember correctly, we did not want to make life
>     difficult for producers/consumers if they did not care about the
>     other data categories?
>
>
>
> Correct, but here we have the situation that two data categories might 
> be just too similar for keeping everthing as is.
>
>
>
>
>     Why is that? In linguistic annotation it is common that you have
>     several layers of information, like our lexical, ontological etc.
>     information. Some of these might be complex in itself (e.g. named
>     entities), some of these might be related to others (e.g. an
>     ontological concept related to a lexical item). I won't try to
>     define these layers here - but my point is that due to the
>     complexity of representing such information inline, nearly nobody
>     is trying to represent several layers at the same time inline. The
>     common approach is rather to have a base layer, and then pointers
>     from the various annotation layers.
>
>     In a sense you can describe NIF as an approach of taking character
>     offsets as the implicit base layer (implicit because characters
>     don't need explicit anchors). The TEI here
>     http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
>     provides an example for an offset using words as the base unit,
>     with exlicit xml:id attributes.
>
>     So far we haven't taken this approach for terminology or
>     disambiguation. This is why we had to came of with 16+ attributes:
>     if you want to do everything "inline", you need to differenciate
>     attribute names and come up with a monster data category. Inline
>     annotations are just not suitable for such information.
>
>     Mârcis:
>
>     I disagree that 16+ attributes are the difficulty here. The
>     difficulty from the beginning were the questions: 1) how many
>     types of annotation should be supported (we narrowed the list down
>     to 4 – terminology, named entities, ontology concepts, lexical
>     concepts)? 2) should overlapping be supported? 3) should
>     hierarchical annotation be supported? 4) should contradicting
>     annotation be supported?
>
>
> about 1): no type at all would be one solution, but the term identifer 
> issue would come up. about 2): if a consumer just takes up one 
> annotation, e.g. the output of Tilde's tool, there is no need to 
> process overlap. And we can leave that to consumers IMO. 3): same like 
> 2). 4) Same like 2).
>
> Mârcis: 1) I agree that the type itself is important. I know that 
> Tadej said that a Ref URI might have the Type embedded, but for 
> Terminology we do not always have the URI available.
>
> Mârcis: 2-4) I agree, but only if you ask the consumers to know which 
> producer to trust. If that is not an issue, then it is fine by me (it 
> is a compromise as we lose the ability to not have to trust anyone at all)
>
>     Also ... data producers would have to worry just about a maximum
>     of 5 attributes simultaneously and they would be able to ignore
>     the rest. For instance, I have no use for the attributes for
>     disambiguation categories.
>
>
> I think that's the heart of issue-68: there are two quite similar 
> pieces of information, but consumers separate them.
>
>
>
>     Although I would agree writing a parser that parses all these
>     attributes (just for compliancy with the data category), I would
>     as a consumer consume only the ones related to terminology and I
>     as a producer would produce only those related to terminology. I
>     would nor consume, nor produce the disambiguation related attributes.
>
>
> That wouldn't work if we have one data category: our conformance 
> requirements say: you implement it global or local or both. You then 
> can also decide whether you implement it in HTML or XML or both. But 
> you cannot cherry pick attributes for consumption. We don't say 
> anything wrt production - but our schema helps us to verify that the 
> "right data" has been produced.
>
> Mârcis: I think you are misunderstanding – parsing content for 
> consumption and production is a totally different architectural level 
> than the logics that makes any use of the content. So ... in my 
> understanding we are not failing on conformance. We are if there is a 
> requirement that we have to really consume and really produce the 
> other types of data in the application logics layers. Is this the case?
>
>
>
>     From that perspective, I disagree to the complexity in the
>     attribute scenario.
>
>
> I think part of the disagreement comes from the "free spirit" you have 
> as data producer and consumer, see above.
>
>
>
>     For terminology I require a flagging mechanism (with the
>     possibility to add either a reference, a confidence score, or both).
>
>     I do agree that we are limiting the annotation with having
>     separate attributes, but then again ... ITS 2.0 does not have to
>     represent every possible text analysis annotation type. It is
>     supposed to aid in localisation processes and not all text
>     analysis types have a valid use case (or a necessary or even a
>     potentially useful use case) in localisation.
>
>     Also ... if we are re-inventing terminology and disambiguation,
>     maybe we should analyse which other data categories fall under the
>     type “text analysis”? Domain is a suitable candidate as well (and
>     if we create a suitable text analysis category, maybe domain
>     analysis can be subcategorized under that as well in order to
>     support automated domain analysis solutions (EuroVoc has an
>     automated domain classifier, for instance))?).
>
>
>
> Here I would disagree: our domain data category is just for 
> transporting domain information between content and tools, including a 
> potential mapping of domain identifiers inbetween. The "terminology vs 
> disambiguation" discussion came from the observations that two data 
> categories in ITS2 have a huge overlap. I don't see that situation for 
> domain.
>
> Mârcis: I do not agree that domain identification is in structure 
> different than terminology annotation or named entity recognition or 
> even sentence breaking, but fair enough ... I brought this up to show 
> that in general domain annotation is also annotation... an equal in 
> structure task as term tagging or named entity recognition (usually 
> just in bigger spans – but not necessarily).
>
>     With this I would like to emphasize that overgeneralization is not
>     the best approach as we are creating data categories for different
>     consumption scenarios.
>
>
> But are they so different? It sounds to me rather that in your 
> scenario, many opportunities are lost because you don't consume 
> disamgiuatino at all ... so having one umbrella data category might 
> even give you more data consumption opportunities.
>
> Mârcis: :) ... of course, the more annotation, the more possibilities 
> (this is a philosophical truth – I cannot argue). But for producing 
> Terminology in our Use Case we do not require knowledge on the purely 
> semantic lexical, ontological or entity level. Our Use Case uses 
> knowledge from the tools that are applied in the process and they do 
> not ask for information supplied by those data categories (we do 
> require domain, language and others though ... that we, of course, 
> use). The use of Disambiguation data categories would require 
> re-thinking of the modules that do not deal with ITS 2.0 explicitly – 
> the term extraction, term weighing, term retrieval methods which are 
> out-of-scope in this project.
>
>
>
>     So, the first idea behind below approach is: if you want to
>     represent just one linguistic layer (or "qualifier" in Christian's
>     mail at
>     http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
>     ) , you use "tan-type" attribute to differentiate annotations.
>     That leads to following models inline models:
>
>     1) A term has its-tan-type with value "term" and optional
>     its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
>     <span its-tan-type="term"
>     its-tan-ident-ref="http://termdatabase.example.com/entry37"
>     <http://termdatabase.example.com/entry37>
>     its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>     <http://termdatabase.example.com/entry37/description>
>     its-tan-confidence="1.0">Dublin</span>
>     Comparison to current ITS1 "Terminology":
>     its-tan-type="term" plays the role of term="yes". its-tan-info-ref
>     plays the role of termInfoRef. its-tan-ident-ref links to a term
>     data base. its-tan-confidence provide confidence information.
>     (Esp. at Marcis: I know that "Dublin" is a bad candidate for a
>     term, I'm just trying to exemplify the annotation approach here)
>
>
>     Mârcis:
>
>     Also one thing I tried to emphasize at lunchtime in Prague,
>     TermInfoRef is not necessarily an identity reference. It does not
>     always point to something unique (if we understand that a set is
>     not unique). You can have multiple term entries from multiple user
>     collections in a term bank relating to one semantic term. In the
>     case if you do not specify a domain you could end up having a
>     reference that points to totally different (also contrasting)
>     terms or if you do not specify a target language you may end up
>     having multiple entries because most of the collections are
>     bilingual and not multilingual. Why is that so? It is because a
>     term-bank is not a disambiguator – it acts like a search engine
>     (more or less) – the disambiguation for the “external” information
>     (the meaning; the term unithood is defined by the flag term=”yes”
>     itself) has to be done by the consumers (translation engines or
>     human translators). In most cases (as in the biggest term-banks –
>     IATE, ETB) it does not have a hierarchical understanding of terms
>     as some lexical (WordNet, f.i.) or ontological resources may have.
>     For MT engines a valuable information is already – term=“yes” as
>     that defines the term unithood, which means that the term should
>     be translated as a non-breakable phrase. So ... the MT engine
>     could ignore the TermInfoRef at all if it does not have a suitable
>     disambiguation module and maybe leave the disambiguation to human
>     post-editors...
>
>     So ... “ident” is misleading (at least in the case of Terminology
>     annotation)!
>
>     Also important: HOW WOULD YOU REPRESENT term=”no”? This is a very
>     important feature of the flag type annotation.
>
>     would you say: its-tan-type="not-a-term"? That would require data
>     producers to handle higher complexity annotation!
>
>
>
> I don't have a clear answer to above questions - others, feel free to 
> chime in if you do.
>
> Mârcis: This is important to understand. Will this be dropped at all 
> or will there be an alternative mechanism?
>
>     2) An entity has its-tan-type with value "entity" and optional
>     its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
>     <span its-tan-type="entity"
>     its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>     <http://dbpedia.org/resource/Dublin> its-tan-class-ref="
>     http://nerd.eurecom.fr/ontology#Place"
>     its-tan-confidence="0.7">Dublin</span>
>
>     So above is only different naming compared to current
>     "Terminology" and Disambiguation. Below is now the standoff
>     approach. The processing expectation for tools *producing the
>     annotation* is like this:
>     - If there is no inline annotation, just create it (e.g. 1) or 2))
>     - If there is inline annotation, check if there is an id attribute
>     (in HTML) or xml:id (if XML serizalization of HTML is used and
>     with lower precedence compared to id). For formats other than
>     HTML, add xml:id if possible or use the id attribute appropriate
>     for that format.
>
>     Then, for creating standoff annotations, add an
>     "its:textAnalyticsAnnotations" element to the document, e.g. in
>     HTML "script" if needed.
>
>     Let's assume before annotation we have
>     <span its-tan-type="entity"
>     its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>     <http://dbpedia.org/resource/Dublin>
>     its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>     <http://nerd.eurecom.fr/ontology#Place>
>     its-tan-confidence="0.7">Dublin</span>
>     Then after annotation we would have
>     <span its-tan-type="entity"
>     its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>     <http://dbpedia.org/resource/Dublin>
>     its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>     <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"
>     *id="a8"*>Dublin</span>
>     and this:
>     <its:textAnalyticsAnnotations>
>     <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term"
>     its-tan-ident-ref="http://termdatabase.example.com/entry37"
>     <http://termdatabase.example.com/entry37>
>     its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>     <http://termdatabase.example.com/entry37/description>
>     its-tan-confidence="1.0"/>
>     </its:textAnalyticsAnnotations>
>
>
>     Let's now assume that before annotation we have
>     <span its-tan-type="term"
>     its-tan-ident-ref="http://termdatabase.example.com/entry37"
>     <http://termdatabase.example.com/entry37>
>     its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>     <http://termdatabase.example.com/entry37/description>
>     its-tan-confidence="1.0">Dublin</span>
>     Then after annotation we would have
>     <span its-tan-type="term"
>     its-tan-ident-ref="http://termdatabase.example.com/entry37"
>     <http://termdatabase.example.com/entry37>
>     its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>     <http://termdatabase.example.com/entry37/description>
>     its-tan-confidence="1.0" *id="a8"*>Dublin</span>
>     and this:
>     <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
>     <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>     its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>     <http://dbpedia.org/resource/Dublin>
>     its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>     <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"/>
>     </its:textAnalyticsAnnotations>
>
>     Now, if several "entity" annotation tools have been used, we could
>     also have
>     <its:textAnalyticsAnnotations>
>     <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>     its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>     <http://dbpedia.org/resource/Dublin>
>     its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>     <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"
>     annotatorsRef="tan|tool-x"/>
>     <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>     its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>     <http://dbpedia.org/resource/Dublin>
>     its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>     <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.4"
>     annotatorsRef="tan|tool-y"/>
>     </its:textAnalyticsAnnotations>
>
>     Above approach would also influence the consumption of this data
>     category, and of annotatorsRef:
>
>     - A consuming tools goes through the document and gathers all
>     textAnalyticsAnnotations elements
>     - It then goes through the document. For each element node
>     * check for existing inline markup. If it's available, add it to
>     the list of annotations for that node. Assume the inline version
>     up in the document tree of annotatorsRef to be responsible for the
>     annotation of that markup.
>     * check the accumulated standoff textAnalyticsAnnotations elements
>     for occurrences of IDs that match the node. If there is such an
>     ID, add the related annotation to the list for the node, including
>     the additional annotatorsRef tool, e.g. tool-x or tool-y in the
>     above case.
>
>
>     Mârcis:
>
>     Do I understand you correctly that we may end up having
>     contradicting annotations also, for instance term=”yes” and
>     term=”no”? This would require a data consumer to be able to handle
>     a lot of ambiguity in the data.
>
>
> Sure. But they could identify the ambiguity with a multilayer 
> annotation that clearly identifies the tool used, via annotatorsRef.
> Currently, what would you do with this
> <span its-term="yes"><span its-term="no">screwdriver</span></span>
> how would you resolve the ambiguity here? "Terminology" has no 
> inheritance. This makes sense, otherwise in the following
> <span its-term="yes"><span class="em">screw</span>driver</span>
> the embedded "span" element would constitute a span. But that leads to 
> this test suite output for
> <span its-term="yes"><span its-term="no">screwdriver</span></span>
> /span[1] term="yes"
> /span[1]/span[1] term="no"
> and both "span" nodes contain the same string "screwdriver". So how do 
> you resolve the ambiguity here?
>
> Mârcis: I do not see the issue in the above example. As you said, 
> Terminology does not inherit, therefore, the only thing that is stated 
> is that the “screwdriver” is not a term.
>
> Mârcis: However, one thing I have not understood so far – is there a 
> limitation of how many annotations can be done by the same producer 
> (human or machine). Even the annotatorsRef in my understanding does 
> not always resolve contradictions. Or ... is there a precedence rule 
> if there are equal, but contradicting stand-off annotations, for 
> instance (I made this up to simplify the under):
>
> <its:textAnalyticsAnnotations>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" 
> annotatorsRef="tan|annotator-1"/>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Person" 
> <%22http:/nerd.eurecom.fr/ontology#Person%22> its-tan-confidence="0.4" 
> annotatorsRef="tan|annotator-1"/>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Organisation" 
> <%22http:/nerd.eurecom.fr/ontology#Organisation%22> 
> its-tan-confidence="0.4" annotatorsRef="tan|annotator-1"/>
> </its:textAnalyticsAnnotations>
> Mârcis: Here “Dublin” can be all three (Place, Person, Organisation) 
> simultaneously, right?
>
>     In summary, this standoff tries to solve several issues:
>
>     - avoid the 16+ inline attribute monster data category
>
>     Mârcis:
>
>     Again, I did not understand why this is worse than having a heavy
>     “stand-off” mechanism.
>
>
>     - allow for multiple annotations of the same span, with different
>     tools
>     Mârcis:
>
>     In Prague Tadej and I had a discussion whether there is a use case
>     for using two tools producing contradicting mark-up and we came to
>     the conclusion that neither of us would produce such data and if
>     such a scenario exists, then the content producer should fuse
>     (disambiguate) the outputs of the two separate tools prior to ITS
>     2.0 metadata application. I am talking about the same type (for
>     instance, two term annotation tools on the same span) of
>     annotation, not two separate types.
>
>     Then my question: does such a scenario exist? Who is implementing it?
>
>
> If both you and Tadej would agree on one data category: everybody who 
> wants to use both your tools would implement it. And this has the 
> value that people could compare the outcome of the tools.
>
> Mârcis: So you would ask the consumers to disambiguate or choose (in 
> this way they would not use both if both would produce Terminology), 
> right? If yes, it is totally fine, I just want to make sure I 
> understand your idea.
>
>
>     - avoid the ITS1/2 or general inline annotation issues with
>     inheritance and overriding - as with the standoff approach at
>     exemplified at
>     http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
>     annotation information is just accumulated for a given base item
>     (in our case, element nodes with an ID).
>     Mârcis:
>
>     So ... at the end, with this method we would allow:
>
>     1) Hierarchical annotation
>
>     2) Contradicting annotation
>
>     3) (possibly also) overlapping annotation
>
>
>
> Correct.
>
>
>
>     I'm not yet asking for this change, but I see it as a way forward
>     that could make the life of both annotation producers (Marcis and
>     Tadej) and consumers (Yves et al.) simpler. So I'm eager to hear
>     thoughts on this :)
>     Mârcis:
>
>     As I understand the proposal – it is the complete opposite from
>     being simple (or simplifying things as they are right now having
>     Terminology and Disambiguation separately), it complicates things
>     significantly from the Terminology standpoint as now I do not see
>     where term=”yes” fits in, we have to deal with contradicting
>     annotation (allow or prohibit it is now a question to the
>     consumers – I as a consumer would ask to prohibit it as I do not
>     see a use case for term=”yes” and term=”no” at the same time), and
>     what is more, we have to re-implement the parsers so that instead
>     of overriding and inheritance they would work with accumulating
>     information (and this is a complete revision of the parser logics
>     for the Terminology data category).
>
>
> I understand the burden on implementation you emphasize - but it seems 
> that one scenario - annotation using different tools even for the 
> terminology data category, see the nested "terminology" annotations 
> above - is not resolved by your proposal. You say this would not be 
> implemented before ITS2 annotation - but if the tool providers are not 
> from the same organization?
>
> Mârcis: Our proposal did not allow nested annotations. Nor does the 
> current ITS 2.0 version. Also – this was my question – is there a 
> necessity to produce 2 Terminology annotations or 2 named entity 
> annotations on top of each other. I see that you are saying – Yes, 
> there is.
>
>
>
> Best,
>
> Felix
>
>
>
>
>
>     Thoughts?
>
>     - Felix
>
Received on Tuesday, 29 January 2013 17:24:53 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:08:26 UTC