W3C home > Mailing lists > Public > public-i18n-its-ig@w3.org > October 2014

Re: [xliff] ITS scope with sm/em

From: Felix Sasaki <felix@sasakiatcf.com>
Date: Sun, 12 Oct 2014 15:28:59 +0200
Cc: "Estreen, Fredrik" <Fredrik.Estreen@lionbridge.com>, XLIFF Main List <xliff@lists.oasis-open.org>, public-i18n-its-ig <public-i18n-its-ig@w3.org>
Message-Id: <928581BE-44FA-4AFD-AFEF-805198337A16@sasakiatcf.com>
To: Yves Savourel <ysavourel@enlaso.com>
Hi Yves and all,

Am 12.10.2014 um 14:31 schrieb Yves Savourel <ysavourel@enlaso.com>:

> Hi Fredrik, all,
> 
>> This can be solved by lowering the <pc> into an <sc/>,<ec/> pair.
> 
> That is a good point for that example, and a solution that should work most of the time.
> 
> But I believe we will have some cases at least of overlapping annotations.
> 
> As an example, below is the result of two text analysis Web services that detected two entities: One "Port Metro Vancouver" and the
> other "City of Vancouver" based on the content "Port Metro of Vancouver City". So we end up with "Vancouver" being shared by the
> two--otherwise distinct--annotation spans. 
> 
> <sm id="m1" type="dbp:entity" ref="http://www.wikidata.org/wiki/Q1187234"/>Port Metro of <sm id="m2" type="oc:entity/City"
> value="City of Vancouver" ref="http://en.wikipedia.org/wiki/Vancouver"/>Vancouver<em startRef="m1"/> City</em startRef="m2“/>

it looks like even without trying to apply ITS information the above cannot be transformed to hierarchical markup, because there is an overlap. sm „m1“ starts, then sm „m2“ starts. Then „m1“ ends, then „m2“ ends.
If there would be a proper nesting like this

<sm id="m1" type="dbp:entity" ref="http://www.wikidata.org/wiki/Q1187234"/>Port Metro of <sm id="m2" type="oc:entity/City"
value="City of Vancouver" ref="http://en.wikipedia.org/wiki/Vancouver"/>Vancouver City</em startRef="m2“/><em startRef="m1“/>

one could generate

<mrk id="m1" type="dbp:entity" ref="http://www.wikidata.org/wiki/Q1187234">Port Metro of <mrk id="m2" type="oc:entity/City"
value="City of Vancouver" ref="http://en.wikipedia.org/wiki/Vancouver">Vancouver</mrk> City</mrk>

these are two nested entities:
Port Metro of Vancouver City
Vancouver City

Since ITS text analysis information does not inherit, the nesting shouldn’t create an issue.

If the annotation tool creates an overlap like in your example, you won’t be able to generate hierarchical markup from this. We pointed that out in the NIF2ITS section here
http://www.w3.org/TR/its20/#nif-backconversion
see case 3. 

> 
> One of the annotations could be set to an <mrk>, but that would leave one as <sm/>/<em/>.
> 
> And the point I was trying to make for Felix is that such annotation, unlike for a Translate data category for example, cannot be
> decomposed into several <mrk> because the ITS information (here it would some Text Analysis data), applies only to the complete span
> not its parts.
> 
> In other words we cannot do:
> 
> <mrk id="m1" type="dbp:entity" ref="http://www.wikidata.org/wiki/Q1187234">Port Metro of <mrk id="m2" type="oc:entity/City"
> value="City of Vancouver" ref="http://en.wikipedia.org/wiki/Vancouver">Vancouver</mrk></mrk><mrk id="m2bis" type="oc:entity/City"
> value="City of Vancouver" ref="http://en.wikipedia.org/wiki/Vancouver"/> City</mrk>
> 
> because "City" should not be associated alone with the ITS data.
> 
> Sure, a tool could detect that two consecutive <mrk> with the same ITS information should be seen as a single one, but that is not
> an ITS processing expectation.


Would it be possible to accommodate this in the global rules file, by having a rule that selects elements based on the same attribute values? Ideally one would repeat m2 in your example and then select all „mrk“ with the same „id“ value. Though you can’t repeat the id value of course.

Cheers,

Felix

> 
> I'm not sure what transformation would resolve this problem.
> 
> Cheers,
> -ys
> 
> 
Received on Sunday, 12 October 2014 13:29:29 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:11:31 UTC