RE: ITS rules for XLIFF 2.1 from Yves Savourel on 2016-09-04 (public-i18n-its-ig@w3.org from September 2016)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Sun, 4 Sep 2016 08:08:34 -0600
To: "'Sergey Nozhenko'" <sergey.nozhenko@logrusglobal.com>, "'Felix Sasaki'" <fsasaki@w3.org>
CC: "'Serge Gladkoff'" <serge.gladkoff@gmail.com>, <public-i18n-its-ig@w3.org>, "'Renat Bikmatov'" <renat.bikmatov@logrusglobal.com>
Message-ID: <001701d206b5$d32e9ba0$798bd2e0$@enlaso.com>
Hi all,

 

As much as agree with Serge that it’s very unlikely someone would want to process ITS data in an XLIFF file with an ITS-only processor, the possibility exists. For example David was mentioning the case of someone gathering some ITS data from various sources (e.g. localization quality issues) and they may have only an ITS-only tool to do this. In theory it should be possible.

 

The sm/em elements are a problem difficult to resolve for ITS 2.0 processors, as Felix’s experiments demonstrated. But not all files have sm/em, and some of those which do may be converted to mrk syntax.

 

So, I think having the rules file is still useful.

 

I just think we should not spend more time on trying to resolve the sm/em special case; and a note in the rules file explaining its limitation should be good enough.

 

As for a longer term view: XLIFF’s sm/em may not be the only use case where ITS 2.0 has trouble. Like Serge noted, that may need to be addressed at the ITS level, some day.

 

Cheers,

-yves

 

From: Sergey Nozhenko [mailto:sergey.nozhenko@logrusglobal.com] 
Sent: Sunday, September 4, 2016 7:26 AM
To: Felix Sasaki <fsasaki@w3.org>; Yves Savourel <ysavourel@enlaso.com>
Cc: Serge Gladkoff <serge.gladkoff@gmail.com>; public-i18n-its-ig@w3.org; Renat Bikmatov <renat.bikmatov@logrusglobal.com>
Subject: RE: ITS rules for XLIFF 2.1

 

Why in the world would anyone even want to process XLIFF with ITS processor, I cannot understand. XLIFF does not follow ITS specification for XML files, as the fact that the processor is unable to process it correctly without auxiliary rules file and the hacking of the code clearly shows. Well, let’s suppose there is a good reason. Then the question arises: does the ability to process XLIFF files represent a unique feature of the particular ITS processor, or is it something that all ITS processors are required to implement? In the latter case, it should be described formally in the ITS specification, so that the developers of other ITS processors could implement it without seeing into the XSLT-based example.

 

Serge

 

От: Felix Sasaki <mailto:fsasaki@w3.org> 
Отправлено: 2 сентября 2016 г. в 7:37
Кому: Yves Savourel <mailto:ysavourel@enlaso.com> 
Копия: Sergey Nozhenko <mailto:sergey.nozhenko@logrusglobal.com> ; Serge Gladkoff <mailto:serge.gladkoff@gmail.com> ; public-i18n-its-ig@w3.org <mailto:public-i18n-its-ig@w3.org> ; Renat Bikmatov <mailto:renat.bikmatov@logrusglobal.com> 
Тема: Re: ITS rules for XLIFF 2.1

 

 

Hi Yves, 

 

Am 01.09.2016 um 13:59 schrieb Yves Savourel <ysavourel@enlaso.com <mailto:ysavourel@enlaso.com> >:

 

I see. Thanks for clarifying.

 

In my opinion if it’s doable with XPath2 then it’s enough to provide those rules.

 

 

Looking into this further, I think it is not doable with XPath 2.0.

 

You would need XPath 2.0 expressions that select an „sm" element and then all subsequent nodes until the reference of the „em“ element matches the „sm“ element „id“ attribute. That can be done with an XPath 2.0 user defined function or an XPath 2.0 FOR expression. 

 

Now, the issue is that ITS selectors allow only relative or absolute path expressions - that is, no functions or FOR expressions. I tried the XPath 2.0 expression with my XSLT based implementation and it breaks, because that implementation can only process relative or absolute path expressions (as part of XSLT template „match“ attributes). If I would need to change the implementation to process XPath 2.0 functions or FOR expressions, it would be a complete re-implementation.

 

So for me it is the other way round: the burden was smaller to implement the additional behavior on top of the ITS behavior. But I agree with your point that this is not a good approach: the result is a processor which is not an ITS processor and also not an XLIFF processor - something fuzzy in-between. We don’t want that.

 

So one could now

 

1) drop the idea of an ITS rules file and say: you can process only ITS in XLIFF with an XLIFF processor

2) have an ITS rules file that covers the XLIFF „mrk" case and say we cannot cover „sm“ / „em“

3) have instead (or in addition to 2)) in the advanced validation some schematron checks for assuring that sm is always followed by em. But maybe this is already the case and we don’t need this for the ITS case.

4) anything else?

 

 

Best,

 

Felix





 

As a developer, if I had to update an ITS processor to allow support for XLIFF sm/em, I’d prefer to try upgrading the XPath support for the processor to XPath2 rather than try to code some special behavior for sm/em.

 

Cheers,

-ys

 

 

From: Felix Sasaki [mailto:fsasaki@w3.org] 
Sent: Thursday, September 1, 2016 5:36 AM
To: Yves Savourel <ysavourel@enlaso.com <mailto:ysavourel@enlaso.com> >
Cc: Sergey Nozhenko <sergey.nozhenko@logrusglobal.com <mailto:sergey.nozhenko@logrusglobal.com> >; Serge Gladkoff <serge.gladkoff@gmail.com <mailto:serge.gladkoff@gmail.com> >; public-i18n-its-ig@w3.org <mailto:public-i18n-its-ig@w3.org> ; Renat Bikmatov <renat.bikmatov@logrusglobal.com <mailto:renat.bikmatov@logrusglobal.com> >
Subject: Re: ITS rules for XLIFF 2.1

 

Hi Yves,

 

Am 01.09.2016 um 13:03 schrieb Yves Savourel < <mailto:ysavourel@enlaso.com> ysavourel@enlaso.com>:

 

Hi Felix,

 

I believe you are referring this this:

 

*  Maybe we should add to the ITS module the constraint that extends general ITS processors: if the selected element is XLIFF sm, apply the ITS information to the next em which corresponds to sm, via the startRef attribute. This would be a small burden on the ITS processors, but would greatly simply the creation of the ITS/XLIFF rules file.

 

I’m not sure I understand it.

An XLIFF+ITS processor knows about sm/em and already works that way. So I don’t think we need to add such information in the ITS module as it’s not ITS specific.

 

Did you mean “should add to the ITS processor the constraint…” 

 

Yes, I meant this constraint and I was wondering if we should try to express this with the ITS rules file. That would mean that the rules file needs XPath 2.0. Alternatively, we could keep the rules file simple and XPath 1.0 based (see below), and describe the constraint separately. The W3C ITS group could publish a small, informative document „processing XLIFF 2.x with ITS processors“, that contains above paragraph and the warning about this processing, gathered in this thread (e.g. „best thing is to use an XLIFF processor ….“).

 

Best,

 

Felix

 

(Not the “ITS module”).

 

Cheers,

-ys

 

 

From: Felix Sasaki [ <mailto:fsasaki@w3.org> mailto:fsasaki@w3.org] 
Sent: Thursday, September 1, 2016 12:44 AM
To: Yves Savourel < <mailto:ysavourel@enlaso.com> ysavourel@enlaso.com>
Cc: Sergey Nozhenko < <mailto:sergey.nozhenko@logrusglobal.com> sergey.nozhenko@logrusglobal.com>; Serge Gladkoff < <mailto:serge.gladkoff@gmail.com> serge.gladkoff@gmail.com>;  <mailto:public-i18n-its-ig@w3.org> public-i18n-its-ig@w3.org; Renat Bikmatov < <mailto:renat.bikmatov@logrusglobal.com> renat.bikmatov@logrusglobal.com>
Subject: Re: ITS rules for XLIFF 2.1

 

Thanks for the feedback, Yves and Sergey. I agree with what you said. On 

 

But we should not have anything in the ITS module that would exist solely for helping an ITS-only processor to work.

 

XLIFF implementers should worry just about implementing the ITS module.

 

I agree. I just don’t know yet what to conclude with regards to the rules file. If we keep it simple like it is currently, see below:

 


<its:rules xmlns:its=" <http://www.w3.org/2005/11/its> http://www.w3.org/2005/11/its" version="2.0"
    xmlns:xlf2="urn:oasis:names:tc:xliff:document:2.0" queryLanguage="xpath" xmlns:itsm="urn:oasis:names:tc:xliff:itsm:2.1">
    <its:textAnalysisRule selector="//xlf2:mrk[@type='itsm:generic' and (@itsm:taClassRef or @itsm:taIdentRef)]" taClassRefPointer="@itsm:taClassRef" taIdentRefPointer="@itsm:taIdentRef"/>
    <its:textAnalysisRule selector="//xlf2:sm[@type='itsm:generic' and (@itsm:taClassRef or @itsm:taIdentRef)]" taClassRefPointer="@itsm:taClassRef" taIdentRefPointer="@itsm:taIdentRef"/>
</its:rules>

it needs the additional machinery discussed in this thread. If we make it more complex (the XPath 2.0) stuff, it will fulfill the 

"A nice ready-to-use reference that an ITS-only processor can use out of the box (as long as it supports XPath 2)“

requirement  but only for ITS processors that support XPath 2.0. Not sure how to move forward.

 

Best,

 

Felix

 

Am 31.08.2016 um 16:18 schrieb Yves Savourel < <mailto:ysavourel@enlaso.com> ysavourel@enlaso.com>:

 

Hi all,

 

I think Serge’s feedback is a good illustration of why we need to make sure we keep very separate the description of anything related to processing XLIFF with a pure ITS processor from the ITS Module itself.

 

The first choice of processing an XLIFF file should always be to use an XLIFF processor, including for the ITS module.

To me the ITS rules file for XLIFF and any guideline on how to deal with <sm>/<em> is just supplemental information that offers two things:

-   A way of “validating” (in the large sense of “confirming”) that the module makes sense and works.

-   A nice ready-to-use reference that an ITS-only processor can use out of the box (as long as it supports XPath 2)

 

But we should not have anything in the ITS module that would exist solely for helping an ITS-only processor to work.

 

XLIFF implementers should worry just about implementing the ITS module.

 

Cheers,

-yves

 

 

From: Felix Sasaki [ <mailto:fsasaki@w3.org> mailto:fsasaki@w3.org] 
Sent: Wednesday, August 31, 2016 6:39 AM
To: Sergey Nozhenko < <mailto:sergey.nozhenko@logrusglobal.com> sergey.nozhenko@logrusglobal.com>
Cc: Serge Gladkoff < <mailto:serge.gladkoff@gmail.com> serge.gladkoff@gmail.com>;  <mailto:public-i18n-its-ig@w3.org> public-i18n-its-ig@w3.org; Renat Bikmatov < <mailto:renat.bikmatov@logrusglobal.com> renat.bikmatov@logrusglobal.com>
Subject: Re: ITS rules for XLIFF 2.1

 

Dear Sergey,

 

Am 31.08.2016 um 14:19 schrieb Sergey Nozhenko < <mailto:sergey.nozhenko@logrusglobal.com> sergey.nozhenko@logrusglobal.com>:

 

I apologize for the late reply.

 

 

NP at all.




As to your implementation, note that the <source> and <target> elements may appear not only in the <segment>, but also in the <ignorable> elements.

 

 

good point - do you have some example file(s)?




I've read old threads on the problem in the OASIS archive, learned that it was already discussed in some details. 

 

Do you have some pointers to the discussions? It would be good to see XLIFF solutions to how to process annotations in general (not only XLIFF ones). I asked a related question on the XLIFF list but got no reply so far.




Actually, as long as ITS module in XLIFF is just a set of annotation attributes that follows general XLIFF annotation rules, and has nothing to do with the approach of the ITS for general XML, excepts for the meaning of the data categories, it is ok with me. Yet deviating from that line, it complicates the matter. It is not clear, why there should be any specific ITS external rules, selectors and so on. For example, splitting/combining segments is quite a usual user option in the translations systems. In case of XLIFF, it leads to the changes in the underlying XML structure. Suppose there is a selector that becomes invalid due to such changes. The requirement of keeping the selectors valid would have made the general operation of splitting/combining segments dependent on the knowledge of the specifics of an optional module, which I believe is not particularly good.

 

Interesting - so you are basically saying that one should not process XLIFF that contains ITS markup with an ITS processor. The idea of the rules was to make such processing possible for an ITS processor. But maybe the feedback is that this should not be tried and that we should only push for ITS processing by XLIFF processors (which know how to deal with annotations etc. anyway)?

 

Best,

 

Felix






Serge
 

On 31.08.2016 10:53, Felix Sasaki wrote:

Hi Sergey and all, 

 

do you have any feedback on my last mail, see below? I would like to bring your feedback to the OASIS XLIFF TC and see what they think.

 

Thanks,

 

Felix

 

Am 24.08.2016 um 16:27 schrieb Felix Sasaki < <mailto:fsasaki@w3.org> fsasaki@w3.org>:

 

Thanks, Sergey. I added you 

< <mailto:sergey.nozhenko@logrusglobal.com> sergey.nozhenko@logrusglobal.com>

to the list, you can now post directly. 

 

Again I think this is a general XLIFF 2.x problem, to be addressed by all XLIFF 2.x implementations. With overlap this is no issue since we want a sequence of nodes, not an XML tree. I changed my implementation (see the links below. 

 

input, now including your example

 <https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/inputfile.xml> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/inputfile.xml
and new output
 <https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/output-inline-annotation.xml> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/output-inline-annotation.xml
and the adapted XPath at
 <https://github.com/fsasaki/its20-extractor/commit/7fd5f04868aad6f788fd5c4850fa26ad996cee7f> https://github.com/fsasaki/its20-extractor/commit/7fd5f04868aad6f788fd5c4850fa26ad996cee7f

in tools/datacategories-2-xsl.xsl.

 

Let me know what you think.

 

Best,

 

Felix

 

Am 23.08.2016 um 21:50 schrieb Sergey Nozhenko < <mailto:sergey.nozhenko@logrusglobal.com> sergey.nozhenko@logrusglobal.com>:

 

Hi,

sm and em elements may be nested in mrk and overlap it. For example:

<xliff version="2.0" xmlns="urn:oasis:names:tc:xliff:document:2.0" srcLang="en" xmlns:itsm="urn:oasis:names:tc:xliff:itsm:2.1">
 <file id="f1">
  <unit id="u1">
   <segment>
    <source><mrk id="m1" translate="no" type="term">Text1 <sm id="sm1" type="itsm:generic" itsm:taClassRef= <http://example/ontology#Thing> "http://example/ontology#Thing"itsm:taIdentRef= <http://example.com/ref> "http://example.com/ref"/>Text2.</mrk></source>
   </segment>
   <segment>
    <source>Text4<em startRef="sm1"/> text5.</source>
   </segment>
  </unit>
 </file>
</xliff>

Serge
 

On 23.08.2016 19:48, Felix Sasaki wrote:

Apologies for the late reply, Sergey, Serge and all.  

 

The issue is an XLIFF issue related to the annotations mechanism

 <http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html#annotations> http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html#annotations

if an annotation in XLIFF is represented with sm and em, the application has to find the content relating to the annotation.

 

I think this is doable, both for general XLIFF annotations (e.g. of terms) and ITS annotations. I updated my implementation with an XPath expression that has a larger search space than the previous one. The new expression searches for the corresponding em tag in the following nodes that have the same parent node type (e.g. all source elements or all target elements).

 

It seems to work, see the more complex input

 <https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/inputfile.xml> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/inputfile.xml

and output

 <https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/output-inline-annotation.xml> https://github.com/fsasaki/its20-extractor/blob/master/sample/xliff21sample/output-inline-annotation.xml

and the adapted XPath at

 <https://github.com/fsasaki/its20-extractor/commit/62428b4484df7a073be3c2c0033e2a389dc83350> https://github.com/fsasaki/its20-extractor/commit/62428b4484df7a073be3c2c0033e2a389dc83350

in tools/datacategories-2-xsl.xsl.

 

I’m happy to work on this more if you give me more XLIFF annotation samples.

 

Best,

 

Felix

 

 

Am 18.08.2016 um 11:32 schrieb Sergey Nozhenko < <mailto:sergey.nozhenko@logrusglobal.com> sergey.nozhenko@logrusglobal.com>:

 

How about this:

 

<xliff version="2.0" xmlns="urn:oasis:names:tc:xliff:document:2.0" srcLang="en" trgLang="ru"
 xmlns:itsm="urn:oasis:names:tc:xliff:itsm:2.1">
 <file id="f1">
  <unit id="u1">
   <segment>
    <source><sm id="sm1" type="itsm:generic" itsm:taClassRef=" <http://nerd.eurecom.fr/ontology#Place> http://nerd.eurecom.fr/ontology#Place"
     itsm:taIdentRef=" <http://dbpedia.org/resource/Arizona> http://dbpedia.org/resource/Arizona"/>Arizona</source>
    <target>Аризона</target>
   </segment>
   <segment>
    <source><em startRef="sm1"/> Yeah!</source>
    <target>Да!</target>
   </segment>
  </unit>
 </file>

</xliff>

 

Serge

 

From:  <mailto:fsasaki@w3.org> Felix Sasaki
Sent: 18 августа 2016 г. 8:18
To:  <mailto:serge.gladkoff@gmail.com> Serge Gladkoff
Cc:  <mailto:public-i18n-its-ig@w3.org> public-i18n-its-ig@w3.org;  <mailto:renat.bikmatov@logrusglobal.com> Renat Bikmatov;  <mailto:sergey.nozhenko@logrusglobal.com> Sergey Nozhenko
Subject: Re: ITS rules for XLIFF 2.1

 

 

Am 17.08.2016 um 23:08 schrieb Serge Gladkoff < <mailto:serge.gladkoff@gmail.com> serge.gladkoff@gmail.com>:

 

Hello Felix,

 

I am sorry to say this but our developers believe that this is a clear case where ITS hit rock-bottom, so to speak.

 

The function of <sm>/<em> tags is to markup the areas which cannot be annotated by one tag because this would result in invalid XML file. This happens when the markup is conflicting with other tags. For example, with segmentation. 

 

In such cases inheritance does not work because the beginning of the unit may find itself inside one tag, and the end – inside another, and even on different levels.

 

Indeed - that was exactly my point. 






 

How one could describe ITS tags distribution in such cases?

 

By keeping your ITS processor (including inheritance behavior) as is, and then specify additional processing for sm, as defined below. My main point was that this does not change the behavior of a conformant ITS processor. It is *additional* behavior. 






Indeed, it is far from clear.

 

I wouldn't call this “a small burden”.

 

I implemented this as an additional behavior of my ITS processor. See 

 <https://github.com/fsasaki/its20-extractor/commit/4816b29f8b7010f307c5dad98b1ab4aa92c4ae70> https://github.com/fsasaki/its20-extractor/commit/4816b29f8b7010f307c5dad98b1ab4aa92c4ae70

the changes to datacategories-2-xsl.xsl . The changes was 4 lines of code. I am happy to look at your code with your developers, if that helps, to lower the burden.

 

Best,

 

Felix






 

Regards,

Serge

 

 

 

From: Felix Sasaki [ <mailto:fsasaki@w3.org> mailto:fsasaki@w3.org] 
Sent: Tuesday, August 16, 2016 7:20 PM
To:  <mailto:public-i18n-its-ig@w3.org> public-i18n-its-ig@w3.org
Subject: ITS rules for XLIFF 2.1

 

Hi all,

 

in the OASIS TC, currently the support of ITS in XLIFF 2.1 is being discussed.

 

As part of the discussion an ITS rules file is developed. The file should allow general ITS processors to work with XLIFF 2.X documents. There is one issue: XLIFF has elements „sm“ and „em“ which are empty markers. (ITS or any other) information then relates to the content between the start and end marker.

 

Below is a mail I had sent to the XLIFF list to find a work around. This would put a (small) burden on ITS processors, to deal with the sm / em elements. See below, I tried this with my general XSLT implementation. What do people think on this, esp. implementers?

 

Best,

 

Felix 

 

 

 

Anfang der weitergeleiteten Nachricht:

 

Von: Felix Sasaki < <mailto:felix@sasakiatcf.com> felix@sasakiatcf.com>

Betreff: Implementation of XLIFF 2.1 - ITS module

Datum: 12. August 2016 um 11:51:14 MESZ

An: XLIFF Main List < <mailto:xliff@lists.oasis-open.org> xliff@lists.oasis-open.org>

 

Hi all,

 

I started an ITS module implementation relying on my generic ITS processor. See the processed files here

 <https://github.com/fsasaki/its20-extractor/tree/master/sample/xliff21sample> https://github.com/fsasaki/its20-extractor/tree/master/sample/xliff21sample

external-rules.xml contains the rules, currently only for text analytics. inputfile.xml is an XLIFF 2.1 input file, currently with ITS Text Analytics information. The output is as a list of XPath expressions in nodelist-with-its-information.xml and as inline annotations in output-inline-annotation.xml

 

The output shows one issue which we had discussed before, see below, taken from output-inline-annotation.xml

 

<source>
               <itsAnn xmlns=""/>
               <sm id="sm1"
                   type="itsm:generic"
                   itsm:taClassRef=" <http://nerd.eurecom.fr/ontology#Place> http://nerd.eurecom.fr/ontology#Place"
                   itsm:taIdentRef=" <http://dbpedia.org/resource/Arizona> http://dbpedia.org/resource/Arizona">
                  <itsAnn xmlns="">
                     <elem>
                        <taClassRefPointer xmlns:xlf2="urn:oasis:names:tc:xliff:document:2.0"
                                           xmlns:its=" <http://www.w3.org/2005/11/its> http://www.w3.org/2005/11/its"
                                           xmlns:datc=" <http://example.com/datacats> http://example.com/datacats"
                                           itsm:taClassRef=" <http://nerd.eurecom.fr/ontology#Place> http://nerd.eurecom.fr/ontology#Place"/>
                        <taIdentRefPointer xmlns:xlf2="urn:oasis:names:tc:xliff:document:2.0"
                                           xmlns:its=" <http://www.w3.org/2005/11/its> http://www.w3.org/2005/11/its"
                                           xmlns:datc=" <http://example.com/datacats> http://example.com/datacats"
                                           itsm:taIdentRef=" <http://dbpedia.org/resource/Arizona> http://dbpedia.org/resource/Arizona"/>
                     </elem>
                  </itsAnn>
               </sm>Arizona<em startRef="sm1">
                  <itsAnn xmlns=""/>
               </em>
            </source>

 

 With the ITS rules file, „sm“ is annotated to have the text analytics information. But it is actually the content between sm and em that should be annotated. I don’t know how to resolve this. Maybe we should add to the ITS module the constraint that extends general ITS processors: if the selected element is XLIFF sm, apply the ITS information to the next em which corresponds to sm, via the startRef attribute. This would be a small burden on the ITS processors, but would greatly simply the creation of the ITS/XLIFF rules file. 

 

Thoughts?

 

Best,

 

Felix
Received on Sunday, 4 September 2016 14:09:10 UTC