- From: Felix Sasaki <fsasaki@w3.org>
- Date: Mon, 28 Jan 2013 08:44:13 +0100
- To: "Lieske, Christian" <christian.lieske@sap.com>
- CC: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
- Message-ID: <51062C4D.1090101@w3.org>
Hi Christian, Am 28.01.13 08:33, schrieb Lieske, Christian: > > Hi Felix, > > Thanks for all of the work you put into this. Your analysis and > suggestions have informed my own understanding of the challenge and > possible solution. > Thanks for the nice feedback. > Although details of your suggestion (e.g. what I would phrase as > "only generate stand-off if the inline place already has been taken") > may require additional discussions, it would be great if at least the > Working Group members would be positive about them. > No worries about that - I'm very happy about the feedback received from Yves and Dave so far. And as said before, I'd encourage everybody from the implementers sides (consumers + producers of disambig+term) and from the users / "policy" side to chim in the discussion. The more feedback the better :) Best, Felix > Cheers, > > Christian > > *From:*Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Sonntag, 27. Januar 2013 08:25 > *To:* public-multilingualweb-lt@w3.org > *Subject:* issue-68 from an annotation representation point of view, > with potential implications for annotatorsRef and standoff markup > > Hi all, > > sorry, this is going to be long ... but please have a look, esp. the > implementers (both consumers and producers) of terminology and > disambiguation. > > in the last 10 1/2 months, since Tadej's presentation at the Dublin > workshop, we had a lot of discussions on disambiguation, and sometimes > (as now) including terminology. But it seems that we never discussed > whether ITS2 approach of selection (global, local, inheritence, > overriding (partial or not)...) is suitable for this type of information. > > By "this type" I mean annotation of linguistic information. Most ITS2 > and ITS1 data categories are process related (e.g. "Don't translate > this ..."), but both terminology and what's now called disambiguation > are information that you find in linguistic corpora and processing > tools. Now, my point is that in both in such natural language > processing tool chains and related corpora, a representation of > information *inline per document node* is rather the exception. Mostly > you have *standoff information*, that is a complete seperation of > information from actual content - as in NIF. > > Why is that? In linguistic annotation it is common that you have > several layers of information, like our lexical, ontological etc. > information. Some of these might be complex in itself (e.g. named > entities), some of these might be related to others (e.g. an > ontological concept related to a lexical item). I won't try to define > these layers here - but my point is that due to the complexity of > representing such information inline, nearly nobody is trying to > represent several layers at the same time inline. The common approach > is rather to have a base layer, and then pointers from the various > annotation layers. > > In a sense you can describe NIF as an approach of taking character > offsets as the implicit base layer (implicit because characters don't > need explicit anchors). The TEI here > http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO > provides an example for an offset using words as the base unit, with > exlicit xml:id attributes. > > So far we haven't taken this approach for terminology or > disambiguation. This is why we had to came of with 16+ attributes: if > you want to do everything "inline", you need to differenciate > attribute names and come up with a monster data category. Inline > annotations are just not suitable for such information. > > So, the first idea behind below approach is: if you want to represent > just one linguistic layer (or "qualifier" in Christian's mail at > http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html > ) , you use "tan-type" attribute to differentiate annotations. That > leads to following models inline models: > > 1) A term has its-tan-type with value "term" and optional > its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example: > <span its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37> > its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description> > its-tan-confidence="1.0">Dublin</span> > Comparison to current ITS1 "Terminology": > its-tan-type="term" plays the role of term="yes". its-tan-info-ref > plays the role of termInfoRef. its-tan-ident-ref links to a term data > base. its-tan-confidence provide confidence information. > (Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, > I'm just trying to exemplify the annotation approach here) > > 2) An entity has its-tan-type with value "entity" and optional > its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example: > <span its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> its-tan-class-ref=" > http://nerd.eurecom.fr/ontology#Place" > its-tan-confidence="0.7">Dublin</span> > > So above is only different naming compared to current "Terminology" > and Disambiguation. Below is now the standoff approach. The processing > expectation for tools *producing the annotation* is like this: > - If there is no inline annotation, just create it (e.g. 1) or 2)) > - If there is inline annotation, check if there is an id attribute (in > HTML) or xml:id (if XML serizalization of HTML is used and with lower > precedence compared to id). For formats other than HTML, add xml:id if > possible or use the id attribute appropriate for that format. > > Then, for creating standoff annotations, add an > "its:textAnalyticsAnnotations" element to the document, e.g. in HTML > "script" if needed. > > Let's assume before annotation we have > <span its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> > its-tan-confidence="0.7">Dublin</span> > Then after annotation we would have > <span its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" > *id="a8"*>Dublin</span> > and this: > <its:textAnalyticsAnnotations> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37> > its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description> > its-tan-confidence="1.0"/> > </its:textAnalyticsAnnotations> > > Let's now assume that before annotation we have > <span its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37> > its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description> > its-tan-confidence="1.0">Dublin</span> > Then after annotation we would have > <span its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37> > its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description> > its-tan-confidence="1.0" *id="a8"*>Dublin</span> > and this: > <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x"> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"/> > </its:textAnalyticsAnnotations> > > Now, if several "entity" annotation tools have been used, we could > also have > <its:textAnalyticsAnnotations> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" > annotatorsRef="tan|tool-x"/> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.4" > annotatorsRef="tan|tool-y"/> > </its:textAnalyticsAnnotations> > > Above approach would also influence the consumption of this data > category, and of annotatorsRef: > > - A consuming tools goes through the document and gathers all > textAnalyticsAnnotations elements > - It then goes through the document. For each element node > * check for existing inline markup. If it's available, add it to the > list of annotations for that node. Assume the inline version up in the > document tree of annotatorsRef to be responsible for the annotation of > that markup. > * check the accumulated standoff textAnalyticsAnnotations elements for > occurrences of IDs that match the node. If there is such an ID, add > the related annotation to the list for the node, including the > additional annotatorsRef tool, e.g. tool-x or tool-y in the above case. > > > In summary, this standoff tries to solve several issues: > > - avoid the 16+ inline attribute monster data category > - allow for multiple annotations of the same span, with different tools > - avoid the ITS1/2 or general inline annotation issues with > inheritance and overriding - as with the standoff approach at > exemplified at > http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO > annotation information is just accumulated for a given base item (in > our case, element nodes with an ID). > > I'm not yet asking for this change, but I see it as a way forward that > could make the life of both annotation producers (Marcis and Tadej) > and consumers (Yves et al.) simpler. So I'm eager to hear thoughts on > this :) > > Thoughts? > > - Felix >
Received on Monday, 28 January 2013 07:44:41 UTC