W3C home > Mailing lists > Public > public-i18n-its-ig@w3.org > October 2014

RE: [xliff] ITS scope with sm/em

From: Yves Savourel <ysavourel@enlaso.com>
Date: Sun, 12 Oct 2014 06:31:46 -0600
To: "'Estreen, Fredrik'" <Fredrik.Estreen@lionbridge.com>, "'Felix Sasaki'" <felix@sasakiatcf.com>
CC: "XLIFF Main List" <xliff@lists.oasis-open.org>, "'public-i18n-its-ig'" <public-i18n-its-ig@w3.org>
Message-ID: <002201cfe618$7cf5c980$76e15c80$@enlaso.com>
Hi Fredrik, all,

> This can be solved by lowering the <pc> into an <sc/>,<ec/> pair.

That is a good point for that example, and a solution that should work most of the time.

But I believe we will have some cases at least of overlapping annotations.

As an example, below is the result of two text analysis Web services that detected two entities: One "Port Metro Vancouver" and the
other "City of Vancouver" based on the content "Port Metro of Vancouver City". So we end up with "Vancouver" being shared by the
two--otherwise distinct--annotation spans. 

<sm id="m1" type="dbp:entity" ref="http://www.wikidata.org/wiki/Q1187234"/>Port Metro of <sm id="m2" type="oc:entity/City"
value="City of Vancouver" ref="http://en.wikipedia.org/wiki/Vancouver"/>Vancouver<em startRef="m1"/> City</em startRef="m2"/>

One of the annotations could be set to an <mrk>, but that would leave one as <sm/>/<em/>.

And the point I was trying to make for Felix is that such annotation, unlike for a Translate data category for example, cannot be
decomposed into several <mrk> because the ITS information (here it would some Text Analysis data), applies only to the complete span
not its parts.

In other words we cannot do:

<mrk id="m1" type="dbp:entity" ref="http://www.wikidata.org/wiki/Q1187234">Port Metro of <mrk id="m2" type="oc:entity/City"
value="City of Vancouver" ref="http://en.wikipedia.org/wiki/Vancouver">Vancouver</mrk></mrk><mrk id="m2bis" type="oc:entity/City"
value="City of Vancouver" ref="http://en.wikipedia.org/wiki/Vancouver"/> City</mrk>

because "City" should not be associated alone with the ITS data.

Sure, a tool could detect that two consecutive <mrk> with the same ITS information should be seen as a single one, but that is not
an ITS processing expectation.

I'm not sure what transformation would resolve this problem.

Received on Sunday, 12 October 2014 12:32:15 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:11:31 UTC