- From: Felix Sasaki <fsasaki@w3.org>
- Date: Tue, 29 Jan 2013 19:53:54 +0100
- To: Mârcis Pinnis <marcis.pinnis@Tilde.lv>
- CC: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
- Message-ID: <51081AC2.5050302@w3.org>
Hi Mârcis, all, just a small comment. Am 29.01.13 19:26, schrieb Mârcis Pinnis: > > Hi Felix, all, > > I have replied inline (in a color close to cyan). > > Best regards, > > Mârcis ;o) > > *From:*Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Tuesday, January 29, 2013 7:24 PM > *To:* public-multilingualweb-lt@w3.org > *Subject:* Re: issue-68 from an annotation representation point of > view, with potential implications for annotatorsRef and standoff markup > > Hi Mârcis, all, > > > Am 29.01.13 15:48, schrieb Mârcis Pinnis: > > Hi Felix, > > My comments are inline. > > Best regards, > > Mârcis ;o) > > *From:*Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Tuesday, January 29, 2013 11:27 AM > *To:* Mârcis Pinnis > *Cc:* public-multilingualweb-lt@w3.org > <mailto:public-multilingualweb-lt@w3.org>; Artűrs Vasiďevskis > *Subject:* Re: issue-68 from an annotation representation point of > view, with potential implications for annotatorsRef and standoff > markup > > Hi Mârcis, all, > > even if this discussion has now continued in a different thread, > let met give further feedback here too - it may help to clarify > things, and to continue the discussion in general. > > > Am 28.01.13 11:18, schrieb Mârcis Pinnis: > > Hi Felix, all, > > I see that there have been a lot of opinion exchanges on the > proposal brought up by Felix. > > I have some comments to add. I am now speaking as a data > producer and later maybe also a data consumer (and I am not > speaking as a linguist! ... that has to be understood as well). > > First of all, I would like to ask whether we agreed that ITS > 2.0 should be able to represent data in the structure as TEI, > NIF, XCES or other NLP related standards do – that is, as far > as I understand, the direction where this discussion is > heading. Should ITS 2.0 try to re-invent these data standards? > I would incline to saying – no! > > > > As far as I understand, there standards are not yet implemented in > localization tool chains. However, the "multilayer annotation" > proposal brought one feature from these standards into such tool > chains: the standoff mechanism. I'd rather see this as a value > than a problem: bringing NLP friendly representations into > localization workflows. Would you disagree? > > Mârcis: From the perspective of adding all kinds of annotation, > overlapping, contradicting, hierarchical, it certainly is > beneficial (I do agree in this aspect). > > Mârcis: From the perspective of implementation: > > Mârcis: 1) for consumers you suggest reading only known mark-up. I > do agree that if we would care only about one tool then we could > ignore the rest. But this asks consumers to know who produced the > mark-up. By having a flat level flag the consumers did not have to > worry about who produced the annotation (also – human and machine > users could apply annotation and have an effect on the data); they > just read the annotation and used it as is. This is not possible > in the stand-off mechanism – the consumer has to know which > producer to trust in order to consume the data; otherwise the > consumer has to have a disambiguation module at hand that tries to > find some reason in all the annotations. > > Mârcis: 2) for producers the stand-off mark-up requires adaptation > (more than just adding attributes, but still adaptation), which > probably is not a big issue. > > Mârcis: We (Tilde) are doing both right now – we consume and we > produce Terminology. But ... we could switch to just producing and > not consuming (which is the part that worries me more...). So we > would not have to deal with the disambiguation of the stand-off > mark-up and also which annotator to trust or not. > > Secondly, as we are in a last call phase, I understand that > such significant change to the ITS 2.0 data categories would > rewrite them (and maybe it will get clearer when you read my > comments till the end). I as a data producer now will have to > rewrite my parsers and data producing systems just to > accommodate the „stand-off” mechanisms, which is in a content > providers and content consumers perspective a diametric change > to just adding additional independent attributes or changing > the names of attributes (which was actually the initial > proposal by Tadej and me). I would like for others to > understand that this solution asks for re-development rather > than simple adjustments. > > > I agree - this would be quite some work, and we need to justify > the benefit clearly. > > Mârcis: The change will affect our Showcase the most as right now > we rely on inline mark-up. If we won’t have the inline mark-up at > the end (or we will have additional stand-off mark-up) then we > will have to re-think the Showcase design and the visualisation > possibilities in the Showcase. > > > > Is this because of CSS used for visualization? I agree that with > standoff markup visulization gets more compliciated - but not that > match. See the javascript bit used for localization quality issue here > http://www.w3.org/TR/2012/WD-its20-20121206/#EX-locQualityIssue-html5-local-2 > http://www.w3.org/TR/2012/WD-its20-20121206/examples/html5/EX-locQualityIssue-html5-local-2.html > > The resolution of the ID is only a few lines of javascript code. > > Mârcis: Thank you for the hint to the javascript code. If we will have > an agreement that the stand-off mechanism has to be applied, we will > definitely look into this (if our developers won’t already have a > solution). > > Other comments are inline below... > > After reading the comments here is a summary: > > In my understanding the proposal complicates data production > and consumption significantly as it creates possibilities for > a lot of ambiguity, which I guess is the opposite of what > initially was meant by the disambiguation data category(!) and > at least in our Use Case it requires revision of parser logics > and ITS 2.0 metadata annotation logics. > > > The proposal basically says: here is a way to represent ambiguity, > created by several tools annotating the same document. However, > I'd see this as a value, not a problem: with separate > "its:textAnalyticsAnnotations" elements, including each its own > annotatorsRef, you can clearly identify which tool created what > annotation. This may be even clearer than the current annotatorsRef. > > > Mârcis: I do agree, however see above for my comment related to > consumers. For them the consumption is different – they will have > to know whom to trust. If you think that consumers have to know > who produced the data all the time then it is fine by me... (but > it is a change from the Terminology as it is right now). > > > I got your point about trust. And - trying to bring the discussion > back - the intial comment was not about trust or no trust. It was > rather about unfying terminology and disambiguation - and by > relfecting the levels mentioned in disambiguation in different > annotaion levels, I trid to find a work-around for this. But that > work-around and multilayer annotation are not the main topic. > > So, if we drop disambiguation granularity, keep the term yes / no > requirement, we may have this representation as a unified approach for > both data categories: > > > <span its-tan-confidence="0.7" > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > > its-term="no">Dublin</span> > > If we define that its-term="yes" triggers its-tan-ident-ref to be > interpreted as a reference to a termDB, we would have unified the data > categories. Well, I guess I'm trying to use an axe for moving this > forward ... but let's see what you think. > > Mârcis: 1. Let’s have a summary on this: > > 1.1.We have inline mark-up and stand-off mark-up > > a.The stand-off mark-up has no inheritance and no overriding, but > works on spans allowing hierarchical annotation and also conflicting > annotations - OK > > b.The inline mark-up has no inheritance, which means that basically > nested mark-ups are treated differently than in the alternative > stand-off mark-up (won’t we create two ways to understand the > annotation? Seems like the inline is not compatible with stand-off) > In other data categories with stand-off, we have either stand-off or local. So maybe the incompatibility is not an issue if we do the same here. > 1.2.We drop the granularity and allow the class-ref or the ident-ref > know about what the disambiguation categories are (only if there isn’t > a term=”yes”) - OK > > 1.3.If there is term=”yes”, we ignore the class-ref as that is not > necessary, have the confidence optional and have the ident-ref > optional. If the ident-ref is used, it points to a term-bank. -OK > > 2.There are generally 3 possibilities for its-term as I see it if we > do not have conflicting annotations (but I understand that there may be): > > 2.1.It is not given at all – that means that we can apply mark-up if > we would like to > > 2.2.It says its-term=”yes” – we do not apply mark-up, because an > existing mark-up exists saying that a phrase is a term (we might > however try referencing to a term base if the reference does not exist > and the borders of the spans match) > > 2.3.It says its-term=”no” – we do not apply mark-up, because an > existing mark-up exists saying that a phrase is definitely not a term > > 3. Keeping in mind that we may have the stand-off principle, this > creates the following possibilities for data production (somewhat > different to the previous): > > 3.1. If there is no mark-up – we apply it if we want to > > 3.2.a) If there is mark-up – we ignore it and apply our mark-up > if we want to (this way we would create the conflicting mark-ups), or: > > 3.2.b) If there is mark-up – we do not apply mark-up as there > already is existing mark-up for a given span (this way we would > accumulate all stand-off mark-ups and inline mark-ups and mark only > the fragments that have no span under a term mark-up) > > For consumers –3.2.a is better if we ask consumers to trust producers, > 3.2.b is better if we do not have to express trust to producers. > > In a summary of the whole comment: 1) to simply get rid of the „red” > 1.1.b. issue above, we would apply only stand-off mark-up ignoring > in-line mark-up possibilities, 2) depending on decisions on having > consumers to trust producers or not we would produce data according to > 2a) or 2b); and 3) we would consume data trusting only our annotation > tool if we have to trust producers or consolidate annotations if the > trust would not be needed. > > This is how we would proceed if this scenario moves forward. > Good to know Thanks a lot for looking into this in detail! Now looking forward to further feedback. Best, Felix > > Best, > > Felix > > > > > > However, I will have a discussion with my colleagues in order > to estimate how much changes would be required to our use case > from a development perspective. > > I also understand that this proposal wants to fuse all types > of possible NLP-related text analyses together, but I did not > have the feeling that ITS 2.0 should be used as a TEI, XCES, > NIF, etc. clone? This is how I see where the changes will lead us. > > However, I also do not say that that is a bad thing... we > would definitely make linguists more happier, but I as a > content provider and later also a consumer would have > difficulties working with the data as I would have to agree > accepting uncertainty/ambiguity in the ITS 2.0 metadata by > default (except external resources as those are defined > between consumers/producers and not ITS 2.0). > > Best regards, > > Mârcis ;o) > > *From:*Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Sunday, January 27, 2013 9:25 AM > *To:* public-multilingualweb-lt@w3.org > <mailto:public-multilingualweb-lt@w3.org> > *Subject:* issue-68 from an annotation representation point of > view, with potential implications for annotatorsRef and > standoff markup > > Hi all, > > sorry, this is going to be long ... but please have a look, > esp. the implementers (both consumers and producers) of > terminology and disambiguation. > > in the last 10 1/2 months, since Tadej's presentation at the > Dublin workshop, we had a lot of discussions on > disambiguation, and sometimes (as now) including terminology. > But it seems that we never discussed whether ITS2 approach of > selection (global, local, inheritence, overriding (partial or > not)...) is suitable for this type of information. > > By "this type" I mean annotation of linguistic information. > Most ITS2 and ITS1 data categories are process related (e.g. > "Don't translate this ..."), but both terminology and what's > now called disambiguation are information that you find in > linguistic corpora and processing tools. Now, my point is that > in both in such natural language processing tool chains and > related corpora, a representation of information *inline per > document node* is rather the exception. Mostly you have > *standoff information*, that is a complete seperation of > information from actual content - as in NIF. > > Mârcis: > > Parsing and understanding of the mark-up is the main > difference (how overriding and inheritance work) that requires > this „stand-off” mechanism for „this type” of annotation. If > there would be only flat level annotation, we would not have > this discussion at all. Also, “stand-off” is only good if you > really have to add a lot of complex data, but here we have to > add just a flag or a reference (if put in simple words). In > Prague me and Tadej discussed that if hierarchical information > is needed, that should be encoded in the external resources. > > If I understand correctly, stand-off mark-up has no > inheritance and it has no overriding – it describes a span? > > > Correct. > > > > If so, I assume that with your proposal we are back at > requiring hierarchical annotation, overlapping annotation and > contradictive annotation, which will allow all kinds of text > analysis annotations (without restricted types – term, entity, > ontology, lexical, etc.). This will require data consumers to > re-think their data consumption strategies as they will have > to disambiguate the “disambiguation-style” annotations (which > means that at the end we do not help data consumers, but make > the life rather more difficult). > > > As said above: if a consumer doesn't want to deal with several > layers of annotations, it can just say: I want to consume the > annotations made by Tilde or by JSI. This is guaranteed by the > annotatorsRef attribute. > > > Mârcis: Again, see my comment above about the consumer having to > know whom to trust. > > > The current state of quo creates this situation: if Tilde already > has annotated a text, and JSI wants to add annotations, and you > want to compare them: how to do this? You can say "one creates > terminology markup, the other disambiguation markup". But what > about even more tools? > > > Mârcis: I agree, currently you can have only one Terminology > annotation tool (the disambiguation is not a nice example), but in > the current version we acknowledge that there can be only one > Terminology annotator for a single phrase (I am fine with that). I > understand that the stand-off is a possible way how to solve this > issue, but for the previous consumer this will create > non-comparable annotations (or he will have to update to consuming > the ambiguous annotations or just trust one of the annotations). > For future consumers this might as well be acceptable. > > In the current ITS 2.0 draft the annotation is flat - it is > simple to parse, simple to consume, simple to produce – it is > not hierarchical and it does not overlap. > > > See above - if a consumer does not want to consume relations > between annotation tools or levels, you don't have to, and > annotatorsRef gives you the ability to differentiate the annotations. > > Btw., current disambiguation and terminology also don't inherit, > see the table at > http://www.w3.org/TR/2012/WD-its20-20121206/#datacategories-defaults-etc > that is: the annotations of both data categories don't inherit to > nested markup. So we could resolve the issue also via something > like this: > Input before annotation: Dublin > First annotation: <span its-term="yes">Dublin</span> > Second annotation: <span its-term="yes"><span > its-term="no">Dublin</span></span> > > > Mârcis: As I understand this disambiguates Terminology mark-up > from different producers automatically? This is possible in the > current version. > > The non-existent inheritance, of course requires the marked span > not to contain other mark-up. This is, of course, is a limitation > of the current version and I agree, might be solved by the > stand-off mechanism. > > > But that has the annotatorsRef issue if several "term annotation" > tools have been used. > > Mârcis: In the current version we as a consumer treat all > Terminology annotators as equally important, thus the > annotatorsRef for us is not necessary. However, for data tracking > purposes this might be important and with the stand-off mechanism > the annotatorsRef becomes mandatory (at least in our consumption > scenario). > > From this perspective, the proposed change is a complete > overhaul of the 2 data categories in something different. > > Also – we do require the flag. That is something that will be > heavily complicated with the “stand-off” mechanism (that has > to be understood), or won’t be possible at all?! > > > Setting the type would give you the flat. I know that in the > proposal at > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0212.html > Tadej dropped the flat. But we could have instead of fixed values, > e.g. "term", an URI. You could then interpret that URI as a term > flag, e.g. > <span tan-type="http://example.com/term" <http://example.com/term>> > > Mârcis: I agree that it gives term=“yes”, but not term=”no”. > > > Having a simple attribute inline is the simplest you can > achieve. Having a “stand-off” on the other hand is the most > complex you can achieve. > > And ... if I remember correctly, we did not want to make life > difficult for producers/consumers if they did not care about > the other data categories? > > > > Correct, but here we have the situation that two data categories > might be just too similar for keeping everthing as is. > > > > > > Why is that? In linguistic annotation it is common that you > have several layers of information, like our lexical, > ontological etc. information. Some of these might be complex > in itself (e.g. named entities), some of these might be > related to others (e.g. an ontological concept related to a > lexical item). I won't try to define these layers here - but > my point is that due to the complexity of representing such > information inline, nearly nobody is trying to represent > several layers at the same time inline. The common approach is > rather to have a base layer, and then pointers from the > various annotation layers. > > In a sense you can describe NIF as an approach of taking > character offsets as the implicit base layer (implicit because > characters don't need explicit anchors). The TEI here > http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO > provides an example for an offset using words as the base > unit, with exlicit xml:id attributes. > > So far we haven't taken this approach for terminology or > disambiguation. This is why we had to came of with 16+ > attributes: if you want to do everything "inline", you need to > differenciate attribute names and come up with a monster data > category. Inline annotations are just not suitable for such > information. > > Mârcis: > > I disagree that 16+ attributes are the difficulty here. The > difficulty from the beginning were the questions: 1) how many > types of annotation should be supported (we narrowed the list > down to 4 – terminology, named entities, ontology concepts, > lexical concepts)? 2) should overlapping be supported? 3) > should hierarchical annotation be supported? 4) should > contradicting annotation be supported? > > > about 1): no type at all would be one solution, but the term > identifer issue would come up. about 2): if a consumer just takes > up one annotation, e.g. the output of Tilde's tool, there is no > need to process overlap. And we can leave that to consumers IMO. > 3): same like 2). 4) Same like 2). > > Mârcis: 1) I agree that the type itself is important. I know that > Tadej said that a Ref URI might have the Type embedded, but for > Terminology we do not always have the URI available. > > Mârcis: 2-4) I agree, but only if you ask the consumers to know > which producer to trust. If that is not an issue, then it is fine > by me (it is a compromise as we lose the ability to not have to > trust anyone at all) > > Also ... data producers would have to worry just about a > maximum of 5 attributes simultaneously and they would be able > to ignore the rest. For instance, I have no use for the > attributes for disambiguation categories. > > > I think that's the heart of issue-68: there are two quite similar > pieces of information, but consumers separate them. > > > > > Although I would agree writing a parser that parses all these > attributes (just for compliancy with the data category), I > would as a consumer consume only the ones related to > terminology and I as a producer would produce only those > related to terminology. I would nor consume, nor produce the > disambiguation related attributes. > > > That wouldn't work if we have one data category: our conformance > requirements say: you implement it global or local or both. You > then can also decide whether you implement it in HTML or XML or > both. But you cannot cherry pick attributes for consumption. We > don't say anything wrt production - but our schema helps us to > verify that the "right data" has been produced. > > > Mârcis: I think you are misunderstanding – parsing content for > consumption and production is a totally different architectural > level than the logics that makes any use of the content. So ... in > my understanding we are not failing on conformance. We are if > there is a requirement that we have to really consume and really > produce the other types of data in the application logics layers. > Is this the case? > > > > > From that perspective, I disagree to the complexity in the > attribute scenario. > > > I think part of the disagreement comes from the "free spirit" you > have as data producer and consumer, see above. > > > > > For terminology I require a flagging mechanism (with the > possibility to add either a reference, a confidence score, or > both). > > I do agree that we are limiting the annotation with having > separate attributes, but then again ... ITS 2.0 does not have > to represent every possible text analysis annotation type. It > is supposed to aid in localisation processes and not all text > analysis types have a valid use case (or a necessary or even a > potentially useful use case) in localisation. > > Also ... if we are re-inventing terminology and > disambiguation, maybe we should analyse which other data > categories fall under the type “text analysis”? Domain is a > suitable candidate as well (and if we create a suitable text > analysis category, maybe domain analysis can be subcategorized > under that as well in order to support automated domain > analysis solutions (EuroVoc has an automated domain > classifier, for instance))?). > > > > Here I would disagree: our domain data category is just for > transporting domain information between content and tools, > including a potential mapping of domain identifiers inbetween. The > "terminology vs disambiguation" discussion came from the > observations that two data categories in ITS2 have a huge overlap. > I don't see that situation for domain. > > > Mârcis: I do not agree that domain identification is in structure > different than terminology annotation or named entity recognition > or even sentence breaking, but fair enough ... I brought this up > to show that in general domain annotation is also annotation... an > equal in structure task as term tagging or named entity > recognition (usually just in bigger spans – but not necessarily). > > With this I would like to emphasize that overgeneralization is > not the best approach as we are creating data categories for > different consumption scenarios. > > > But are they so different? It sounds to me rather that in your > scenario, many opportunities are lost because you don't consume > disamgiuatino at all ... so having one umbrella data category > might even give you more data consumption opportunities. > > Mârcis: :) ... of course, the more annotation, the more > possibilities (this is a philosophical truth – I cannot argue). > But for producing Terminology in our Use Case we do not require > knowledge on the purely semantic lexical, ontological or entity > level. Our Use Case uses knowledge from the tools that are applied > in the process and they do not ask for information supplied by > those data categories (we do require domain, language and others > though ... that we, of course, use). The use of Disambiguation > data categories would require re-thinking of the modules that do > not deal with ITS 2.0 explicitly – the term extraction, term > weighing, term retrieval methods which are out-of-scope in this > project. > > > > So, the first idea behind below approach is: if you want to > represent just one linguistic layer (or "qualifier" in > Christian's mail at > http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html > ) , you use "tan-type" attribute to differentiate annotations. > That leads to following models inline models: > > 1) A term has its-tan-type with value "term" and optional > its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. > Example: > <span its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37>its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description>its-tan-confidence="1.0">Dublin</span> > Comparison to current ITS1 "Terminology": > its-tan-type="term" plays the role of term="yes". > its-tan-info-ref plays the role of termInfoRef. > its-tan-ident-ref links to a term data base. > its-tan-confidence provide confidence information. > (Esp. at Marcis: I know that "Dublin" is a bad candidate for a > term, I'm just trying to exemplify the annotation approach here) > > > > Mârcis: > > Also one thing I tried to emphasize at lunchtime in Prague, > TermInfoRef is not necessarily an identity reference. It does > not always point to something unique (if we understand that a > set is not unique). You can have multiple term entries from > multiple user collections in a term bank relating to one > semantic term. In the case if you do not specify a domain you > could end up having a reference that points to totally > different (also contrasting) terms or if you do not specify a > target language you may end up having multiple entries because > most of the collections are bilingual and not multilingual. > Why is that so? It is because a term-bank is not a > disambiguator – it acts like a search engine (more or less) – > the disambiguation for the “external” information (the > meaning; the term unithood is defined by the flag term=”yes” > itself) has to be done by the consumers (translation engines > or human translators). In most cases (as in the biggest > term-banks – IATE, ETB) it does not have a hierarchical > understanding of terms as some lexical (WordNet, f.i.) or > ontological resources may have. For MT engines a valuable > information is already – term=“yes” as that defines the term > unithood, which means that the term should be translated as a > non-breakable phrase. So ... the MT engine could ignore the > TermInfoRef at all if it does not have a suitable > disambiguation module and maybe leave the disambiguation to > human post-editors... > > So ... “ident” is misleading (at least in the case of > Terminology annotation)! > > Also important: HOW WOULD YOU REPRESENT term=”no”? This is a > very important feature of the flag type annotation. > > would you say: its-tan-type="not-a-term"? That would require > data producers to handle higher complexity annotation! > > > > I don't have a clear answer to above questions - others, feel free > to chime in if you do. > > > Mârcis: This is important to understand. Will this be dropped at > all or will there be an alternative mechanism? > > 2) An entity has its-tan-type with value "entity" and optional > its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. > Example: > <span its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin>its-tan-class-ref=" > http://nerd.eurecom.fr/ontology#Place" > its-tan-confidence="0.7">Dublin</span> > > So above is only different naming compared to current > "Terminology" and Disambiguation. Below is now the standoff > approach. The processing expectation for tools *producing the > annotation* is like this: > - If there is no inline annotation, just create it (e.g. 1) or 2)) > - If there is inline annotation, check if there is an id > attribute (in HTML) or xml:id (if XML serizalization of HTML > is used and with lower precedence compared to id). For formats > other than HTML, add xml:id if possible or use the id > attribute appropriate for that format. > > Then, for creating standoff annotations, add an > "its:textAnalyticsAnnotations" element to the document, e.g. > in HTML "script" if needed. > > Let's assume before annotation we have > <span its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7">Dublin</span> > Then after annotation we would have > <span its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7" *id="a8"*>Dublin</span> > and this: > <its:textAnalyticsAnnotations> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37>its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description>its-tan-confidence="1.0"/> > </its:textAnalyticsAnnotations> > > > Let's now assume that before annotation we have > <span its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37>its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description> > its-tan-confidence="1.0">Dublin</span> > Then after annotation we would have > <span its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37>its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description>its-tan-confidence="1.0" > *id="a8"*>Dublin</span> > and this: > <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x"> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7"/> > </its:textAnalyticsAnnotations> > > Now, if several "entity" annotation tools have been used, we > could also have > <its:textAnalyticsAnnotations> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7" annotatorsRef="tan|tool-x"/> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.4" annotatorsRef="tan|tool-y"/> > </its:textAnalyticsAnnotations> > > Above approach would also influence the consumption of this > data category, and of annotatorsRef: > > - A consuming tools goes through the document and gathers all > textAnalyticsAnnotations elements > - It then goes through the document. For each element node > * check for existing inline markup. If it's available, add it > to the list of annotations for that node. Assume the inline > version up in the document tree of annotatorsRef to be > responsible for the annotation of that markup. > * check the accumulated standoff textAnalyticsAnnotations > elements for occurrences of IDs that match the node. If there > is such an ID, add the related annotation to the list for the > node, including the additional annotatorsRef tool, e.g. tool-x > or tool-y in the above case. > > > > Mârcis: > > Do I understand you correctly that we may end up having > contradicting annotations also, for instance term=”yes” and > term=”no”? This would require a data consumer to be able to > handle a lot of ambiguity in the data. > > > Sure. But they could identify the ambiguity with a multilayer > annotation that clearly identifies the tool used, via annotatorsRef. > Currently, what would you do with this > <span its-term="yes"><span its-term="no">screwdriver</span></span> > how would you resolve the ambiguity here? "Terminology" has no > inheritance. This makes sense, otherwise in the following > <span its-term="yes"><span class="em">screw</span>driver</span> > the embedded "span" element would constitute a span. But that > leads to this test suite output for > <span its-term="yes"><span its-term="no">screwdriver</span></span> > /span[1] term="yes" > /span[1]/span[1] term="no" > and both "span" nodes contain the same string "screwdriver". So > how do you resolve the ambiguity here? > > > Mârcis: I do not see the issue in the above example. As you said, > Terminology does not inherit, therefore, the only thing that is > stated is that the “screwdriver” is not a term. > > Mârcis: However, one thing I have not understood so far – is there > a limitation of how many annotations can be done by the same > producer (human or machine). Even the annotatorsRef in my > understanding does not always resolve contradictions. Or ... is > there a precedence rule if there are equal, but contradicting > stand-off annotations, for instance (I made this up to simplify > the under): > > <its:textAnalyticsAnnotations> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7" > annotatorsRef="tan|annotator-1"/> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Person" > <%22http:/nerd.eurecom.fr/ontology#Person%22>its-tan-confidence="0.4" > annotatorsRef="tan|annotator-1"/> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Organisation" > <%22http:/nerd.eurecom.fr/ontology#Organisation%22>its-tan-confidence="0.4" > annotatorsRef="tan|annotator-1"/> > </its:textAnalyticsAnnotations> > Mârcis: Here “Dublin” can be all three (Place, Person, > Organisation) simultaneously, right? > > In summary, this standoff tries to solve several issues: > > - avoid the 16+ inline attribute monster data category > > Mârcis: > > Again, I did not understand why this is worse than having a > heavy “stand-off” mechanism. > > > - allow for multiple annotations of the same span, with > different tools > Mârcis: > > In Prague Tadej and I had a discussion whether there is a use > case for using two tools producing contradicting mark-up and > we came to the conclusion that neither of us would produce > such data and if such a scenario exists, then the content > producer should fuse (disambiguate) the outputs of the two > separate tools prior to ITS 2.0 metadata application. I am > talking about the same type (for instance, two term annotation > tools on the same span) of annotation, not two separate types. > > Then my question: does such a scenario exist? Who is > implementing it? > > > If both you and Tadej would agree on one data category: everybody > who wants to use both your tools would implement it. And this has > the value that people could compare the outcome of the tools. > > Mârcis: So you would ask the consumers to disambiguate or choose > (in this way they would not use both if both would produce > Terminology), right? If yes, it is totally fine, I just want to > make sure I understand your idea. > > > > - avoid the ITS1/2 or general inline annotation issues with > inheritance and overriding - as with the standoff approach at > exemplified at > http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO > annotation information is just accumulated for a given base > item (in our case, element nodes with an ID). > Mârcis: > > So ... at the end, with this method we would allow: > > 1) Hierarchical annotation > > 2) Contradicting annotation > > 3) (possibly also) overlapping annotation > > > > Correct. > > > > > I'm not yet asking for this change, but I see it as a way > forward that could make the life of both annotation producers > (Marcis and Tadej) and consumers (Yves et al.) simpler. So I'm > eager to hear thoughts on this :) > Mârcis: > > As I understand the proposal – it is the complete opposite > from being simple (or simplifying things as they are right now > having Terminology and Disambiguation separately), it > complicates things significantly from the Terminology > standpoint as now I do not see where term=”yes” fits in, we > have to deal with contradicting annotation (allow or prohibit > it is now a question to the consumers – I as a consumer would > ask to prohibit it as I do not see a use case for term=”yes” > and term=”no” at the same time), and what is more, we have to > re-implement the parsers so that instead of overriding and > inheritance they would work with accumulating information (and > this is a complete revision of the parser logics for the > Terminology data category). > > > I understand the burden on implementation you emphasize - but it > seems that one scenario - annotation using different tools even > for the terminology data category, see the nested "terminology" > annotations above - is not resolved by your proposal. You say this > would not be implemented before ITS2 annotation - but if the tool > providers are not from the same organization? > > Mârcis: Our proposal did not allow nested annotations. Nor does > the current ITS 2.0 version. Also – this was my question – is > there a necessity to produce 2 Terminology annotations or 2 named > entity annotations on top of each other. I see that you are saying > – Yes, there is. > > > > Best, > > Felix > > > > > > > Thoughts? > > - Felix >
Received on Tuesday, 29 January 2013 18:54:26 UTC