Re: ITS rules for XLIFF 2.1 from Sergey Nozhenko on 2016-08-31 (public-i18n-its-ig@w3.org from August 2016)

From: Sergey Nozhenko <sergey.nozhenko@logrusglobal.com>
Date: Wed, 31 Aug 2016 17:11:23 +0300
To: Felix Sasaki <fsasaki@w3.org>
CC: Serge Gladkoff <serge.gladkoff@gmail.com>, "public-i18n-its-ig@w3.org" <public-i18n-its-ig@w3.org>, Renat Bikmatov <renat.bikmatov@logrusglobal.com>
Message-ID: <3890f8ab-76fb-d90f-75fe-446b95321642@logrusglobal.com>
On 31.08.2016 15:38, Felix Sasaki wrote:
> As to your implementation, note that the <source> and <target> 
> elements may appear not only in the <segment>, but also in the 
> <ignorable> elements.
>>
>
> good point - do you have some example file(s)?
Not at hand, but judging from the XLIFF 2.0 specification, <ignorable> 
is just the same as <segment> as a parent for <source> and <target> 
elements and thus, can be an ancestor for the <source> and <target> 
child elements, including <sm> and <em>. Thus expressions referencing 
<segment> elements should include <ignorable> as an alternative.
>>
>> I've read old threads on the problem in the OASIS archive, learned 
>> that it was already discussed in some details.
>>
>
> Do you have some pointers to the discussions? It would be good to see 
> XLIFF solutions to how to process annotations in general (not only 
> XLIFF ones). I asked a related question on the XLIFF list but got no 
> reply so far.
I mean the discussion on this issue, you probably remember it: 
https://lists.oasis-open.org/archives/xliff/201410/msg00015.html
>>
>> Actually, as long as ITS module in XLIFF is just a set of annotation 
>> attributes that follows general XLIFF annotation rules, and has 
>> nothing to do with the approach of the ITS for general XML, excepts 
>> for the meaning of the data categories, it is ok with me. Yet 
>> deviating from that line, it complicates the matter. It is not clear, 
>> why there should be any specific ITS external rules, selectors and so 
>> on. For example, splitting/combining segments is quite a usual user 
>> option in the translations systems. In case of XLIFF, it leads to the 
>> changes in the underlying XML structure. Suppose there is a selector 
>> that becomes invalid due to such changes. The requirement of keeping 
>> the selectors valid would have made the general operation of 
>> splitting/combining segments dependent on the knowledge of the 
>> specifics of an optional module, which I believe is not particularly 
>> good.
>>
>
> Interesting - so you are basically saying that one should not process 
> XLIFF that contains ITS markup with an ITS processor. The idea of the 
> rules was to make such processing possible for an ITS processor. But 
> maybe the feedback is that this should not be tried and that we should 
> only push for ITS processing by XLIFF processors (which know how to 
> deal with annotations etc. anyway)?
This depends on what does "ITS processor" mean. If it is a processor in 
terms of ITS for general XML, then definitely one should not. ITS for 
XML/HTML5 is actually designed for annotation of the documents that 
could be translated. But XLIFF is not a document that is translated 
itself. It is a container for the data extracted from some other 
documents. Since ITS data obviously refers to the parts of the original 
documents (and not, say, to the values of the attributes of the XLIFF 
elements), it should use XLIFF structure, not general XML one. If some 
references to the XLIFF elements are necessary, native XLIFF references 
with the corresponding constraints should be used, not XPath expressions.
>
> Best,
>
> Felix
>
>> Serge
>>   
>> On 31.08.2016 10:53, Felix Sasaki wrote:
>>> Hi Sergey and all,
>>>
>>> do you have any feedback on my last mail, see below? I would like to 
>>> bring your feedback to the OASIS XLIFF TC and see what they think.
>>>
>>> Thanks,
>>>
>>> Felix
>>>
>>>> Am 24.08.2016 um 16:27 schrieb Felix Sasaki <fsasaki@w3.org 
>>>> <mailto:fsasaki@w3.org>>:
>>>>
>>>> Thanks, Sergey. I added you
>>>> <sergey.nozhenko@logrusglobal.com 
>>>> <mailto:sergey.nozhenko@logrusglobal.com>>
>>>> to the list, you can now post directly.
>>>>
>>>> Again I think this is a general XLIFF 2.x problem, to be addressed 
>>>> by all XLIFF 2.x implementations. With overlap this is no issue 
>>>> since we want a sequence of nodes, not an XML tree. I changed my 
>>>> implementation (see the links below.
>>>> input, now including your example
>>>> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/inputfile.xml
>>>> and new output
>>>> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/output-inline-annotation.xml
>>>> and the adapted XPath at
>>>> https://github.com/fsasaki/its20-extractor/commit/7fd5f04868aad6f788fd5c4850fa26ad996cee7f
>>>> in tools/datacategories-2-xsl.xsl.
>>>>
>>>> Let me know what you think.
>>>>
>>>> Best,
>>>>
>>>> Felix
>>>>
>>>>> Am 23.08.2016 um 21:50 schrieb Sergey Nozhenko 
>>>>> <sergey.nozhenko@logrusglobal.com 
>>>>> <mailto:sergey.nozhenko@logrusglobal.com>>:
>>>>>
>>>>> Hi,
>>>>>
>>>>> sm and em elements may be nested in mrk and overlap it. For example:
>>>>>
>>>>> <xliff version="2.0" xmlns="urn:oasis:names:tc:xliff:document:2.0" 
>>>>> srcLang="en" xmlns:itsm="urn:oasis:names:tc:xliff:itsm:2.1">
>>>>>  <file id="f1">
>>>>>   <unit id="u1">
>>>>>    <segment>
>>>>>     <source><mrk id="m1" translate="no" type="term">Text1 <sm 
>>>>> id="sm1" type="itsm:generic" 
>>>>> itsm:taClassRef="http://example/ontology#Thing" 
>>>>> itsm:taIdentRef="http://example.com/ref"/>Text2.</mrk></source>
>>>>>    </segment>
>>>>>    <segment>
>>>>>     <source>Text4<em startRef="sm1"/> text5.</source>
>>>>>    </segment>
>>>>>   </unit>
>>>>>  </file>
>>>>> </xliff>
>>>>>
>>>>> Serge
>>>>>
>>>>> On 23.08.2016 19:48, Felix Sasaki wrote:
>>>>>> Apologies for the late reply, Sergey, Serge and all.
>>>>>>
>>>>>> The issue is an XLIFF issue related to the annotations mechanism
>>>>>> http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html#annotations
>>>>>> if an annotation in XLIFF is represented with sm and em, the 
>>>>>> application has to find the content relating to the annotation.
>>>>>>
>>>>>> I think this is doable, both for general XLIFF annotations (e.g. 
>>>>>> of terms) and ITS annotations. I updated my implementation with 
>>>>>> an XPath expression that has a larger search space than the 
>>>>>> previous one. The new expression searches for the corresponding 
>>>>>> em tag in the following nodes that have the same parent node type 
>>>>>> (e.g. all source elements or all target elements).
>>>>>>
>>>>>> It seems to work, see the more complex input
>>>>>> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/inputfile.xml
>>>>>> and output
>>>>>> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/output-inline-annotation.xml
>>>>>> and the adapted XPath at
>>>>>> https://github.com/fsasaki/its20-extractor/commit/62428b4484df7a073be3c2c0033e2a389dc83350
>>>>>> in tools/datacategories-2-xsl.xsl.
>>>>>>
>>>>>> I’m happy to work on this more if you give me more XLIFF 
>>>>>> annotation samples.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Felix
>>>>>>
>>>>>>
>>>>>>> Am 18.08.2016 um 11:32 schrieb Sergey Nozhenko 
>>>>>>> <sergey.nozhenko@logrusglobal.com 
>>>>>>> <mailto:sergey.nozhenko@logrusglobal.com>>:
>>>>>>>
>>>>>>> How about this:
>>>>>>>
>>>>>>> <xliff version="2.0" xmlns="urn:oasis:names:tc:xliff:document:2.0" srcLang="en" trgLang="ru"
>>>>>>>   xmlns:itsm="urn:oasis:names:tc:xliff:itsm:2.1">
>>>>>>>   <file id="f1">
>>>>>>>    <unit id="u1">
>>>>>>>     <segment>
>>>>>>>      <source><sm id="sm1" type="itsm:generic" itsm:taClassRef="http://nerd.eurecom.fr/ontology#Place"
>>>>>>>       itsm:taIdentRef="http://dbpedia.org/resource/Arizona"/>Arizona</source>
>>>>>>>      <target>Аризона</target>
>>>>>>>     </segment>
>>>>>>>     <segment>
>>>>>>>      <source><em startRef="sm1"/> Yeah!</source>
>>>>>>>      <target>Да!</target>
>>>>>>>     </segment>
>>>>>>>    </unit>
>>>>>>>   </file>
>>>>>>> </xliff>
>>>>>>>
>>>>>>> Serge
>>>>>>>
>>>>>>> *From:*Felix Sasaki <mailto:fsasaki@w3.org>
>>>>>>> *Sent:*18 августа 2016 г. 8:18
>>>>>>> *To:*Serge Gladkoff <mailto:serge.gladkoff@gmail.com>
>>>>>>> *Cc:*public-i18n-its-ig@w3.org 
>>>>>>> <mailto:public-i18n-its-ig@w3.org>;Renat Bikmatov 
>>>>>>> <mailto:renat.bikmatov@logrusglobal.com>;Sergey Nozhenko 
>>>>>>> <mailto:sergey.nozhenko@logrusglobal.com>
>>>>>>> *Subject:*Re: ITS rules for XLIFF 2.1
>>>>>>>
>>>>>>>
>>>>>>>> Am 17.08.2016 um 23:08 schrieb Serge Gladkoff 
>>>>>>>> <serge.gladkoff@gmail.com <mailto:serge.gladkoff@gmail.com>>:
>>>>>>>>
>>>>>>>> Hello Felix,
>>>>>>>> I am sorry to say this but our developers believe that this is 
>>>>>>>> a clear case where ITS hit rock-bottom, so to speak.
>>>>>>>> The function of <sm>/<em> tags is to markup the areas which 
>>>>>>>> cannot be annotated by one tag because this would result in 
>>>>>>>> invalid XML file. This happens when the markup is conflicting 
>>>>>>>> with other tags. For example, with segmentation.
>>>>>>>> In such cases inheritance does not work because the beginning 
>>>>>>>> of the unit may find itself inside one tag, and the end – 
>>>>>>>> inside another, and even on different levels.
>>>>>>>
>>>>>>> Indeed - that was exactly my point.
>>>>>>>
>>>>>>>> How one could describe ITS tags distribution in such cases?
>>>>>>>
>>>>>>> By keeping your ITS processor (including inheritance behavior) 
>>>>>>> as is, and then specify additional processing for sm, as defined 
>>>>>>> below. My main point was that this does not change the behavior 
>>>>>>> of a conformant ITS processor. It is *additional* behavior.
>>>>>>>
>>>>>>>> Indeed, it is far from clear.
>>>>>>>> I wouldn't call this “a small burden”.
>>>>>>>
>>>>>>> I implemented this as an additional behavior of my ITS 
>>>>>>> processor. See
>>>>>>> https://github.com/fsasaki/its20-extractor/commit/4816b29f8b7010f307c5dad98b1ab4aa92c4ae70
>>>>>>> the changes to datacategories-2-xsl.xsl . The changes was 4 
>>>>>>> lines of code. I am happy to look at your code with your 
>>>>>>> developers, if that helps, to lower the burden.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Felix
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Serge
>>>>>>>> *From:*Felix Sasaki [mailto:fsasaki@w3.org]
>>>>>>>> *Sent:*Tuesday, August 16, 2016 7:20 PM
>>>>>>>> *To:*public-i18n-its-ig@w3.org <mailto:public-i18n-its-ig@w3.org>
>>>>>>>> *Subject:*ITS rules for XLIFF 2.1
>>>>>>>> Hi all,
>>>>>>>> in the OASIS TC, currently the support of ITS in XLIFF 2.1 is 
>>>>>>>> being discussed.
>>>>>>>> As part of the discussion an ITS rules file is developed. The 
>>>>>>>> file should allow general ITS processors to work with XLIFF 2.X 
>>>>>>>> documents. There is one issue: XLIFF has elements „sm“ and „em“ 
>>>>>>>> which are empty markers. (ITS or any other) information then 
>>>>>>>> relates to the content between the start and end marker.
>>>>>>>> Below is a mail I had sent to the XLIFF list to find a work 
>>>>>>>> around. This would put a (small) burden on ITS processors, to 
>>>>>>>> deal with the sm / em elements. See below, I tried this with my 
>>>>>>>> general XSLT implementation. What do people think on this, esp. 
>>>>>>>> implementers?
>>>>>>>> Best,
>>>>>>>> Felix
>>>>>>>>
>>>>>>>>
>>>>>>>> Anfang der weitergeleiteten Nachricht:
>>>>>>>> *Von: *Felix Sasaki <felix@sasakiatcf.com 
>>>>>>>> <mailto:felix@sasakiatcf.com>>
>>>>>>>> *Betreff: Implementation of XLIFF 2.1 - ITS module*
>>>>>>>> *Datum: *12. August 2016 um 11:51:14 MESZ
>>>>>>>> *An: *XLIFF Main List <xliff@lists.oasis-open.org 
>>>>>>>> <mailto:xliff@lists.oasis-open.org>>
>>>>>>>> Hi all,
>>>>>>>> I started an ITS module implementation relying on my generic 
>>>>>>>> ITS processor. See the processed files here
>>>>>>>> https://github.com/fsasaki/its20-extractor/tree/master/sample/xliff21sample
>>>>>>>> external-rules.xml contains the rules, currently only for text 
>>>>>>>> analytics. inputfile.xml is an XLIFF 2.1 input file, currently 
>>>>>>>> with ITS Text Analytics information. The output is as a list of 
>>>>>>>> XPath expressions in nodelist-with-its-information.xml and as 
>>>>>>>> inline annotations in output-inline-annotation.xml
>>>>>>>> The output shows one issue which we had discussed before, see 
>>>>>>>> below, taken from output-inline-annotation.xml
>>>>>>>> <source>
>>>>>>>>                 <itsAnn xmlns=""/>
>>>>>>>>                 <sm id="sm1"
>>>>>>>>                     type="itsm:generic"
>>>>>>>>                     itsm:taClassRef="http://nerd.eurecom.fr/ontology#Place"
>>>>>>>>                     itsm:taIdentRef="http://dbpedia.org/resource/Arizona">
>>>>>>>>                    <itsAnn xmlns="">
>>>>>>>>                       <elem>
>>>>>>>>                          <taClassRefPointer xmlns:xlf2="urn:oasis:names:tc:xliff:document:2.0"
>>>>>>>>                                             xmlns:its="http://www.w3.org/2005/11/its"
>>>>>>>>                                             xmlns:datc="http://example.com/datacats"
>>>>>>>>                                             itsm:taClassRef="http://nerd.eurecom.fr/ontology#Place"/>
>>>>>>>>                          <taIdentRefPointer xmlns:xlf2="urn:oasis:names:tc:xliff:document:2.0"
>>>>>>>>                                             xmlns:its="http://www.w3.org/2005/11/its"
>>>>>>>>                                             xmlns:datc="http://example.com/datacats"
>>>>>>>>                                             itsm:taIdentRef="http://dbpedia.org/resource/Arizona"/>
>>>>>>>>                       </elem>
>>>>>>>>                    </itsAnn>
>>>>>>>>                 </sm>Arizona<em startRef="sm1">
>>>>>>>>                    <itsAnn xmlns=""/>
>>>>>>>>                 </em>
>>>>>>>>              </source>
>>>>>>>>  With the ITS rules file, „sm“ is annotated to have the text 
>>>>>>>> analytics information. But it is actually the content between 
>>>>>>>> sm and em that should be annotated. I don’t know how to resolve 
>>>>>>>> this. Maybe we should add to the ITS module the constraint that 
>>>>>>>> extends general ITS processors: if the selected element is 
>>>>>>>> XLIFF sm, apply the ITS information to the next em which 
>>>>>>>> corresponds to sm, via the startRef attribute. This would be a 
>>>>>>>> small burden on the ITS processors, but would greatly simply 
>>>>>>>> the creation of the ITS/XLIFF rules file.
>>>>>>>> Thoughts?
>>>>>>>> Best,
>>>>>>>> Felix
>>>>>>
>>>>>
>>>>
>>>
>>
>
Received on Wednesday, 31 August 2016 14:12:19 UTC