W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > January 2013

Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

From: Tadej Štajner <tadej.stajner@ijs.si>
Date: Mon, 28 Jan 2013 19:08:29 +0100
Message-Id: <AE9D7715-971D-49D4-BCBA-1497AF9319F6@ijs.si>
Cc: Yves Savourel <ysavourel@enlaso.com>, Felix Sasaki <fsasaki@w3.org>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>, Artūrs Vasiļevskis <arturs.vasilevskis@Tilde.lv>
To: Mārcis Pinnis <marcis.pinnis@Tilde.lv>
Hi, all, (long e-mail ahead, you can scroll to TL;DR)
true - the current state is a local optimum that satisfies the requirements. It would need some polish, better guidance and stricter definitions, and possibly renaming disambigGranularity back to disambigType. 

As an improvement, Felix's proposal makes some sense, since it makes ITS2.0 capable of proper multi-layer annotation. If this two mechanisms for inline+standoff annotation is too complex to implement, it would be an acceptable compromise to just have only the stand-off and no inline (except for term="yes", maybe), but I'd vote in favor of keeping the inline part.

Also, the ref/id pointing could also be expressed the other way around, pointing from fragment to the annotation. Instead of:
<span id="dublin1">Dublin</span>
...
<its:textAnalysisAnnotation its:tanType="entity" its:tanIdentRef="http://dbpedia.org/resource/Dublin" ref="dublin1" />

I would suggest same mechanism as in LQI, so we have some symmetry:

<span its:tanRefs="tan1">Dublin</span>
<its:textAnalysisAnnotations id="tan1">
    <its:textAnalysisAnnotation its:tanType="entity" its:tanIdentRef="http://dbpedia.org/resource/Dublin"/>
</its:textAnalysisAnnotations>

Secondly, I'll give another alternative (and orthogonal) proposal, repeating what Pablo Mendes already hinted at in August: remember the question of supporting the distinction between different disambiguation types - entity, lexical concept, ontology, concept, which is now encoded in the 'disambigGranularity' attribute (relevant discussion http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Aug/0322.html).

When trying to merge Terminology and Disambiguation, having that many disambiguation types supported in the same way implies that we end up with 16 or so attributes. After some discussion in Prague, we realized that although we've established that a distinction between those types exists and it is important, we couldn't come up with a use case where having that information would make a difference in the actual workflows. 

Let me clarify:  if a consumer component cares about disambiguation, it will try to resolve the disambigIdentRef identifier. By resolving it, it is able to know what type/level/granularity of disambiguation it's dealing with. By that reasoning, having this information explicit is redundant, because the system already did its job. The question is, is there a use case that justifies keeping the 'disambigGranularity'? For instance, operating on the disambiguation values without actually resolving them? Maybe filtering? 

So, we'd go from:
<span 
          its-disambig-confidence="0.7"
          its-disambig-class-ref="http://nerd.eurecom.fr/ontology#Place"  
          its-disambig-ident-ref="http://dbpedia.org/resource/Dublin" 
          its-disambig-granularity="entity">Dublin</span> 
      is the <span 
          its-disambig-source="Wordnet3.0" 
          its-disambig-ident="301467919" 
          its-disambig-granularity="lexical-concept"
          its-disambig-confidence="0.5"
          >capital</span> of Ireland.

to:
<span 
          its-disambig-confidence="0.7"
          its-disambig-class-ref="http://nerd.eurecom.fr/ontology#Place"  
          its-disambig-ident-ref="http://dbpedia.org/resource/Dublin">Dublin</span> 
      is the <span 
          its-disambig-source="Wordnet3.0" 
          its-disambig-ident="301467919"
          its-disambig-confidence="0.5"
          >capital</span> of Ireland.

In this setting, ITS would just operate with references to identifiers and wouldn't care about the type of that relationship. I understand this is losing information, and it weakens the expressive power, but I'm asking this because it might simplify a couple of solutions here. Even though I proposed it initially, I wouldn't push something that hasn't got any consumers behind it       (the T in ITS doesn't stand for Tadej.. :) ). It would also establish a clearer boundary between what ITS covers and what other formats should cover. 

TL;DR
In short, I see the some scenarios that I'd be ok with:
1) If we keep 'granularity':
    1a) We keep granularity in the form of its:tanType and go with Felix's proposal in the form of its:tanType, and possibly inverting the addressing so it's like LQI;
    1b) We keep granularity, we keep current proposed Disambiguation data model, possibly renaming 'disambigGranularity' back to 'disambigType';
2) If we drop 'granularity', we probably wouldn't need the new its:tan* model, and it would make sense to keep the rest of the disambiguation data category as-is, and describing the three usage scenarios only as best practices. Disambiguation would then serve as a less-specific 'pointer to some meaning identifier' brother to Terminology.

-- Tadej

On 28. 01. 2013 16:42, Mārcis Pinnis wrote:
> Hi Felix, all,
> 
> I also do not have anything against leaving everything as is.
> I however (as I made clear in my previous e-mail) don't think that the stand-off markup is a nice solution.
> 
> Best regards,
> Mārcis ;o)
> 
> -----Original Message-----
> From: Yves Savourel [mailto:ysavourel@enlaso.com] 
> Sent: Monday, January 28, 2013 5:31 PM
> To: 'Felix Sasaki'; Mārcis Pinnis
> Cc: public-multilingualweb-lt@w3.org; Artūrs Vasiļevskis
> Subject: RE: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup
> 
> Hi Felix, all,
> 
>> Just a judgment from my side: I think at the moment we don't have 
>> consensus for
>> 
>> - leaving everything as is (Dave's proposal)
> I don't have anything against leaving things as is.
> There is nothing really broken.
> 
> It's just that having both data categories fused would be a bit nicer. But overall if there is no time to make that work, we can indeed just leave it as it is.
> 
> cheers,
> -yves
> 
> 
Received on Monday, 28 January 2013 18:06:39 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:08:26 UTC