Re: ITS rules for XLIFF 2.1 from Sergey Nozhenko on 2016-08-23 (public-i18n-its-ig@w3.org from August 2016)

From: Sergey Nozhenko <sergey.nozhenko@logrusglobal.com>
Date: Tue, 23 Aug 2016 22:50:18 +0300
To: Felix Sasaki <fsasaki@w3.org>
CC: Serge Gladkoff <serge.gladkoff@gmail.com>, "public-i18n-its-ig@w3.org" <public-i18n-its-ig@w3.org>, Renat Bikmatov <renat.bikmatov@logrusglobal.com>
Message-ID: <25b85751-2305-06f0-50c3-c7dc3cee071a@logrusglobal.com>
Hi,

sm and em elements may be nested in mrk and overlap it. For example:

<xliff version="2.0" xmlns="urn:oasis:names:tc:xliff:document:2.0" 
srcLang="en" xmlns:itsm="urn:oasis:names:tc:xliff:itsm:2.1">
  <file id="f1">
   <unit id="u1">
    <segment>
     <source><mrk id="m1" translate="no" type="term">Text1 <sm id="sm1" 
type="itsm:generic" itsm:taClassRef="http://example/ontology#Thing" 
itsm:taIdentRef="http://example.com/ref"/>Text2.</mrk></source>
    </segment>
    <segment>
     <source>Text4<em startRef="sm1"/> text5.</source>
    </segment>
   </unit>
  </file>
</xliff>

Serge

On 23.08.2016 19:48, Felix Sasaki wrote:
> Apologies for the late reply, Sergey, Serge and all.
>
> The issue is an XLIFF issue related to the annotations mechanism
> http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html#annotations
> if an annotation in XLIFF is represented with sm and em, the 
> application has to find the content relating to the annotation.
>
> I think this is doable, both for general XLIFF annotations (e.g. of 
> terms) and ITS annotations. I updated my implementation with an XPath 
> expression that has a larger search space than the previous one. The 
> new expression searches for the corresponding em tag in the following 
> nodes that have the same parent node type (e.g. all source elements or 
> all target elements).
>
> It seems to work, see the more complex input
> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/inputfile.xml
> and output
> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/output-inline-annotation.xml
> and the adapted XPath at
> https://github.com/fsasaki/its20-extractor/commit/62428b4484df7a073be3c2c0033e2a389dc83350
> in tools/datacategories-2-xsl.xsl.
>
> I’m happy to work on this more if you give me more XLIFF annotation 
> samples.
>
> Best,
>
> Felix
>
>
>> Am 18.08.2016 um 11:32 schrieb Sergey Nozhenko 
>> <sergey.nozhenko@logrusglobal.com 
>> <mailto:sergey.nozhenko@logrusglobal.com>>:
>>
>> How about this:
>>
>> <xliff version="2.0" xmlns="urn:oasis:names:tc:xliff:document:2.0" srcLang="en" trgLang="ru"
>>   xmlns:itsm="urn:oasis:names:tc:xliff:itsm:2.1">
>>   <file id="f1">
>>    <unit id="u1">
>>     <segment>
>>      <source><sm id="sm1" type="itsm:generic" itsm:taClassRef="http://nerd.eurecom.fr/ontology#Place"
>>       itsm:taIdentRef="http://dbpedia.org/resource/Arizona"/>Arizona</source>
>>      <target>Аризона</target>
>>     </segment>
>>     <segment>
>>      <source><em startRef="sm1"/> Yeah!</source>
>>      <target>Да!</target>
>>     </segment>
>>    </unit>
>>   </file>
>> </xliff>
>>
>> Serge
>>
>> *From:*Felix Sasaki <mailto:fsasaki@w3.org>
>> *Sent:*18 августа 2016 г. 8:18
>> *To:*Serge Gladkoff <mailto:serge.gladkoff@gmail.com>
>> *Cc:*public-i18n-its-ig@w3.org 
>> <mailto:public-i18n-its-ig@w3.org>;Renat Bikmatov 
>> <mailto:renat.bikmatov@logrusglobal.com>;Sergey Nozhenko 
>> <mailto:sergey.nozhenko@logrusglobal.com>
>> *Subject:*Re: ITS rules for XLIFF 2.1
>>
>>
>>> Am 17.08.2016 um 23:08 schrieb Serge Gladkoff 
>>> <serge.gladkoff@gmail.com <mailto:serge.gladkoff@gmail.com>>:
>>>
>>> Hello Felix,
>>> I am sorry to say this but our developers believe that this is a 
>>> clear case where ITS hit rock-bottom, so to speak.
>>> The function of <sm>/<em> tags is to markup the areas which cannot 
>>> be annotated by one tag because this would result in invalid XML 
>>> file. This happens when the markup is conflicting with other tags. 
>>> For example, with segmentation.
>>> In such cases inheritance does not work because the beginning of the 
>>> unit may find itself inside one tag, and the end – inside another, 
>>> and even on different levels.
>>
>> Indeed - that was exactly my point.
>>
>>> How one could describe ITS tags distribution in such cases?
>>
>> By keeping your ITS processor (including inheritance behavior) as is, 
>> and then specify additional processing for sm, as defined below. My 
>> main point was that this does not change the behavior of a conformant 
>> ITS processor. It is *additional* behavior.
>>
>>> Indeed, it is far from clear.
>>> I wouldn't call this “a small burden”.
>>
>> I implemented this as an additional behavior of my ITS processor. See
>> https://github.com/fsasaki/its20-extractor/commit/4816b29f8b7010f307c5dad98b1ab4aa92c4ae70
>> the changes to datacategories-2-xsl.xsl . The changes was 4 lines of 
>> code. I am happy to look at your code with your developers, if that 
>> helps, to lower the burden.
>>
>> Best,
>>
>> Felix
>>
>>> Regards,
>>> Serge
>>> *From:*Felix Sasaki [mailto:fsasaki@w3.org]
>>> *Sent:*Tuesday, August 16, 2016 7:20 PM
>>> *To:*public-i18n-its-ig@w3.org <mailto:public-i18n-its-ig@w3.org>
>>> *Subject:*ITS rules for XLIFF 2.1
>>> Hi all,
>>> in the OASIS TC, currently the support of ITS in XLIFF 2.1 is being 
>>> discussed.
>>> As part of the discussion an ITS rules file is developed. The file 
>>> should allow general ITS processors to work with XLIFF 2.X 
>>> documents. There is one issue: XLIFF has elements „sm“ and „em“ 
>>> which are empty markers. (ITS or any other) information then relates 
>>> to the content between the start and end marker.
>>> Below is a mail I had sent to the XLIFF list to find a work around. 
>>> This would put a (small) burden on ITS processors, to deal with the 
>>> sm / em elements. See below, I tried this with my general XSLT 
>>> implementation. What do people think on this, esp. implementers?
>>> Best,
>>> Felix
>>>
>>>
>>> Anfang der weitergeleiteten Nachricht:
>>> *Von: *Felix Sasaki <felix@sasakiatcf.com <mailto:felix@sasakiatcf.com>>
>>> *Betreff: Implementation of XLIFF 2.1 - ITS module*
>>> *Datum: *12. August 2016 um 11:51:14 MESZ
>>> *An: *XLIFF Main List <xliff@lists.oasis-open.org 
>>> <mailto:xliff@lists.oasis-open.org>>
>>> Hi all,
>>> I started an ITS module implementation relying on my generic ITS 
>>> processor. See the processed files here
>>> https://github.com/fsasaki/its20-extractor/tree/master/sample/xliff21sample
>>> external-rules.xml contains the rules, currently only for text 
>>> analytics. inputfile.xml is an XLIFF 2.1 input file, currently with 
>>> ITS Text Analytics information. The output is as a list of XPath 
>>> expressions in nodelist-with-its-information.xml and as inline 
>>> annotations in output-inline-annotation.xml
>>> The output shows one issue which we had discussed before, see below, 
>>> taken from output-inline-annotation.xml
>>> <source>
>>>                 <itsAnn xmlns=""/>
>>>                 <sm id="sm1"
>>>                     type="itsm:generic"
>>>                     itsm:taClassRef="http://nerd.eurecom.fr/ontology#Place"
>>>                     itsm:taIdentRef="http://dbpedia.org/resource/Arizona">
>>>                    <itsAnn xmlns="">
>>>                       <elem>
>>>                          <taClassRefPointer xmlns:xlf2="urn:oasis:names:tc:xliff:document:2.0"
>>>                                             xmlns:its="http://www.w3.org/2005/11/its"
>>>                                             xmlns:datc="http://example.com/datacats"
>>>                                             itsm:taClassRef="http://nerd.eurecom.fr/ontology#Place"/>
>>>                          <taIdentRefPointer xmlns:xlf2="urn:oasis:names:tc:xliff:document:2.0"
>>>                                             xmlns:its="http://www.w3.org/2005/11/its"
>>>                                             xmlns:datc="http://example.com/datacats"
>>>                                             itsm:taIdentRef="http://dbpedia.org/resource/Arizona"/>
>>>                       </elem>
>>>                    </itsAnn>
>>>                 </sm>Arizona<em startRef="sm1">
>>>                    <itsAnn xmlns=""/>
>>>                 </em>
>>>              </source>
>>>  With the ITS rules file, „sm“ is annotated to have the text 
>>> analytics information. But it is actually the content between sm and 
>>> em that should be annotated. I don’t know how to resolve this. Maybe 
>>> we should add to the ITS module the constraint that extends general 
>>> ITS processors: if the selected element is XLIFF sm, apply the ITS 
>>> information to the next em which corresponds to sm, via the startRef 
>>> attribute. This would be a small burden on the ITS processors, but 
>>> would greatly simply the creation of the ITS/XLIFF rules file.
>>> Thoughts?
>>> Best,
>>> Felix
>
Received on Tuesday, 23 August 2016 19:51:43 UTC