RE: [xliff] ITS scope with sm/em from Estreen, Fredrik on 2014-10-09 (public-i18n-its-ig@w3.org from October 2014)

From: Estreen, Fredrik <Fredrik.Estreen@lionbridge.com>
Date: Thu, 9 Oct 2014 22:28:18 +0000
To: Yves Savourel <ysavourel@enlaso.com>, 'Felix Sasaki' <fsasaki@w3.org>
CC: XLIFF Main List <xliff@lists.oasis-open.org>, 'public-i18n-its-ig' <public-i18n-its-ig@w3.org>
Message-ID: <DE61C801D4C11842BC3FF36757FF3FC501A0DE3CF8@BIL-EXC11-01.corpnet.liox.org>
I left that complication out of my initial post, I rarely think in terms of <pc></pc>. This can be solved by lowering the <pc> into an <sc/>,<ec/> pair. Personally I'm leaning towards only supporting the later form internally and do transforms to and from <pc> at the edge as I must support the non spanning form anyway. So we could modify the algorithm to do that.

Is it even possible to have overlapping ITS markup in a well formed XML document? If it is we would use the same mechanism in the transform algorithm. If not it would be ill formed to do it and should be reported as an error or discarded depending on context. In any case the behavior should be defined in the XLIFF mapping module.

Regards,
Fredrik Estreen
________________________________________
From: xliff@lists.oasis-open.org [xliff@lists.oasis-open.org] on behalf of Yves Savourel [ysavourel@enlaso.com]
Sent: Thursday, October 09, 2014 9:16 PM
To: 'Felix Sasaki'
Cc: XLIFF Main List; 'public-i18n-its-ig'
Subject: RE: [xliff] ITS scope with sm/em

> Sorry, I don't get this.
> Do you have small examples (e.g. one for translate, on for text analysis)
> of the difference?

My understanding is that to work with an ITS processor we would change each span marked by <sm/>/<em/> to a set of <mrk>/</mrk>. At
least this is what I read from your algorithm (not from Fredrik's option).

For example:

<sm id='1' translate='no'/>French <pc id='2'>Canadian<em startRef='1'/> hockey</pc>.

Would be changed to:

<mrk id='1' translate='no'>French </mrk><pc id='2'><mrk id='m1' translate='no'>Canadian</mrk> hockey</pc>.

This would get you the same properties as the original for each character of the content.

But you can't do this for a data category where the content is related in a meaningful way to the attributes of the data category.
For example:

<sm id='1' type='term' ref="http://en.wikipedia.org/wiki/Qu%C3%A9b%C3%A9cois'/>French <pc id='2'>Canadian<em startRef='1'/>
hockey</pc>.

Cannot be split into two instances:

<mrk id='1' type='term' ref="http://en.wikipedia.org/wiki/Qu%C3%A9b%C3%A9cois'>French </mrk><pc id='2'><mrk id='3' type='term'
ref="http://en.wikipedia.org/wiki/Qu%C3%A9b%C3%A9cois'>Canadian</mrk> hockey</pc>.

That is: the term is not "French " or "Canadian" it is "French Canadian".

But maybe I didn't get your algorithm correctly and it doesn't result in multiple <mrk>/</mrk> for a single <sm/>/<em/>.

Looking at Fredrik's note: there are various ways to reduce the <sm/>/<em/> in favor of well-formed <mrk> elements (joining all
segments, prioritizing well-formness of annotation over inline codes, etc.) but I'm not sure it can be an absolute solution: the
markup may originate from the XLIFF tool and care little about well-formness, and one always can have overlapping annotations.

But maybe this is a restriction we can live with.

-ys



-----Original Message-----
From: Felix Sasaki [mailto:felix@sasakiatcf.com]
Sent: Thursday, October 9, 2014 12:20 PM
To: Yves Savourel
Cc: XLIFF Main List; public-i18n-its-ig
Subject: Re: [xliff] ITS scope with sm/em


Am 09.10.2014 um 14:18 schrieb Yves Savourel <ysavourel@enlaso.com>:

> Yes, something like the MT Confidence value is different, but that
> conversion can be described in the mapping itself (If I recall correctly). So an ITS processor has nothing 'special' to do: it
just applies the rules.
>
> I suppose we could have additional pre-processing steps for a case
> like <sm>/<em>. But that means you can't really use a 'pure' ITS processor to look at an XLIFF file because it would not know how
to do the pre-processing.
> But that is probably acceptable, especially if we provide generic ways to do the transformation.
>
> This said, I'm not 100% sure you can transform <sm>/<em> into
> <mrk>/</mrk> for all data categories: it would be ok for things like translate, domain, etc. But info like terminology, Text
Analysis, LQI make sense only when set as a single content.

Sorry, I don't get this. Do you have small examples (e.g. one for translate, on for text analysis) of the difference?

- Felix

>
> -ys
>
>
> -----Original Message-----
> From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org]
> On Behalf Of Felix Sasaki
> Sent: Thursday, October 9, 2014 5:53 AM
> To: Yves Savourel
> Cc: Dr. David Filip; Estreen, Fredrik; XLIFF Main List;
> public-i18n-its-ig
> Subject: Re: [xliff] ITS scope with sm/em
>
> Hi Yves, al,
>
> understand. Though: aren't there also other parts of the xliff/its
> mapping that require from an ITS1 or 2 processor special handling?
> E.g. mt confidence
> https://www.w3.org/International/its/wiki/XLIFF_2.0_Mapping#MT_Confide
> nce_.28.3D.3D.3D.3D.3D.3D.3D.3D.3D.3DTO_REVIEW.29
> which requires a computation of values. In the other thread you just listed the types of processors:
>
> "- An XLIFF Extractor aware of both ITS and the ITS module for any data coming from the original source document.
> - An XLIFF Modifier aware of the ITS Module for data generated during the life time of the XLIFF document.
> - An XLIFF Merger aware of both the ITS Module and the ITS syntax if any of that data is merged back into the translated
document."
>
> Couldn't we require in the mapping specification that before using a
> general ITS processor uses XLIFF+ITS content, it has to do the preprocessing described in this thread, the one for MT confidence
etc?
>
> Cheers,
>
> Felix
>
> Am 09.10.2014 um 13:33 schrieb Yves Savourel <ysavourel@enlaso.com>:
>
>> Hi all,
>>
>> Thanks for the input Fredrik and Felix.
>>
>> I'm not worried about the XLIFF implementation of those cases: We
>> have had working code for those since a long time (a good use
> case is mrk with translate='yes|no').
>>
>> I was thinking more about the ITS aspect of it.
>>
>>> From an ITS viewpoint something like this: <sm id='1'
>>> itx:domain='travel'/>...<em startRef='1'/> the scope of the domain
>>> is an
> empty content (the content of <sm/>). There is nothing in ITS that allows to use distinct elements to annotate a span.
>>
>> Because, while on the XLIFF side the processing expectation is to
>> treat the content between a given <sm/> and its corresponding
> <em/> as a span, on the ITS side there is no semantic for such construct.
>>
>> Cheers,
>> -ys
>>
>>
>> From: Dr. David Filip [mailto:David.Filip@ul.ie]
>> Sent: Thursday, October 9, 2014 5:08 AM
>> To: Felix Sasaki
>> Cc: Estreen, Fredrik; Yves Savourel; XLIFF Main List;
>> public-i18n-its-ig
>> Subject: Re: [xliff] ITS scope with sm/em
>>
>> Felix, I like the algorithmic approach that is open to different implementations.
>>
>> After all ITS is a set of abstract categories that should not be restricted to hierarchical structured formats.
>>
>> Now to your proposed algorithm.
>>
>> Unlike native codes, annotations MUST have the opening and closing tag in the same unit.
>> So you will be always creating <mrk> nodes from <sm/> tags if you consider the whole <unit> content, which is the point..
>>
>> Cheers
>> dF
>>
>>
>> Dr. David Filip
>> =======================
>> OASIS XLIFF TC Secretary, Editor, and Liaison Officer LRC | CNGL |
>> CSIS University of Limerick, Ireland
>> telephone: +353-6120-2781
>> cellphone: +353-86-0222-158
>> facsimile: +353-6120-2734
>> http://www.cngl.ie/profile/?i=452
>> mailto: david.filip@ul.ie
>>
>> On Thu, Oct 9, 2014 at 3:49 AM, Felix Sasaki <felix@sasakiatcf.com> wrote:
>> I agree with Fredrik. Processing of overlapping hierarchies is a task
>> that cannot be solved in general and discarding
> non-hierarchical structures is a good strategy for XML / HTML content.
>>
>>
>> If people don't want to specify an XSLT conversion we could also define the conversion process in an algorithmic way like this:
>>
>> 0) set current content to whole content to be processed.
>> 1) is there an s tag in current content?
>>       Then output text before s tag and do 2)
>>       else just output all text in current content.
>> 2) has the s tag an e tag with corresponding id?
>>       Then create a mrk node
>>       set the content between s and e to new current content
>>       do 1)
>> else discard s and go to 1)
>> 3) output rest of text
>>
>> and say: you can implement this as XSLT (example given) or in
>> different programing languages. That would have the benefit to keep
> the door open to future non XML, API focsued XLIFF.
>>
>> - Felix
>>
>> Am 08.10.2014 um 18:41 schrieb Estreen, Fredrik <Fredrik.Estreen@lionbridge.com>:
>>
>>> Hi Yves,
>>>
>>>> Hi all,
>>>>
>>>> Looking at the ITS mapping: In many case we put the ITS information
>>>> on a marker (<mrk> element).
>>>>
>>>> But such element can be represented by <sm/>...<em/> when it's
>>>> overlapping another element.
>>>> In that case the normal ITS scope mechanism can't work because it
>>>> applies to the empty content of <sm/>, not the content between
>>>> <sm/> and the corresponding <em/>.
>>>>
>>>> We can have provision for this in the XLIFF module. But I'm not
>>>> sure it's doable in the ITS rules, especially with inheritance when
>>>> there are nested annotations.
>>>
>>> This is an interesting problem and I doubt it is solvable in a
>>> general way without additional steps. It might be solvable when
> the <sm/> and <em/> is in the same segment, but I doubt it is in the case where they start and end in different segments (ie.
> different sibling trees).
>>>
>>> One potentially workable solution would be to apply an XSLT transform on the XLIFF that merges all segments in each unit.
> Discards any non ITS carrying marker (to reduce risk of overlapping
> markers) and finally normalize the remaining markers to  the
> <mrk></mrk> spanning form. Since ITS information will likely be coming
> from and going to an XML source there should not be any overlapping
> markers at that stage as they would be difficult to represent in the source format. It is not guaranteed but we could declare that
ill-formed. ITS global rules could then be evaluated against the transformed version. Admittedly not the most beautiful solution but
I think it could work.
>>>
>>>> I vaguely recall that such topic was discussed at some point in the ITS-WG no?
>>>> Does anyone recall the outcome?
>>>>
>>>> Cheers,
>>>> -ys
>>>
>>> Regards,
>>> Fredrik Estreen
>>>
>>> --------------------------------------------------------------------
>>> - To unsubscribe from this mail list, you must leave the OASIS TC
>>> that generates this mail.  Follow this link to all your TCs in OASIS
>>> at:
>>> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.p
>>> h
>>> p
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe from this mail list, you must leave the OASIS TC that
>> generates this mail.  Follow this link to all your TCs in OASIS at:
>> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.ph
>> p
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail.  Follow this link to all your TCs in OASIS
> at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
>
>
>


---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
Received on Thursday, 9 October 2014 22:29:03 UTC