W3C home > Mailing lists > Public > public-i18n-its-ig@w3.org > October 2014

Re: [xliff] ITS: Preserve space and Language Information

From: Dr. David Filip <David.Filip@ul.ie>
Date: Thu, 23 Oct 2014 19:37:05 +0100
Message-ID: <CANw5LKmHmYf+DgbGbLdjb7n9yvuJM8mOXjYvbwyNO1jZMbTh0Q@mail.gmail.com>
To: Yves Savourel <ysavourel@enlaso.com>, XLIFF Main List <xliff@lists.oasis-open.org>
CC: public-i18n-its-ig <public-i18n-its-ig@w3.org>
Thanks, Yves, inline..

On Thu, Oct 23, 2014 at 3:13 PM, Yves Savourel <ysavourel@enlaso.com> wrote:

> Hi David, all,
>
>
>
> While in some cases (like multiple spaces between sentences) using
> <ignorable> with xml:space could be a solution, that can’t solve all use
> cases, and, as pointed out, that will cause trouble when re-segmenting.
>
>
>
> The other solution (using inline codes to store spans of white-spaces)
> looks like asking for troubles: The main reason for such complicated option
> would be because xml:space can’t be set in <mrk>. It would also not solve
> the xml:lang case. In general we do not want to encourage using more inline
> codes.
>
>
>
> I think the simplest and most comprehensive solution is to have its:space
> and its:lang defined and behaving just like xml:space and xml:lang, but
> with the sm-specific scope. That doesn’t preclude anyone to use the other
> options if they really want to go that road.
>
I am not strongly opposed to defining its:space and its:lang if it indeed
proves the best and simplest solution. I am however far from being
convinced it is..
It would be an irony, as they would be introduced for ITS - where
non-wellformed spans are currently not an option - to cater for non-well
formed span transformations between <mrk> and <sm/>/<em/>.
While this solution looks as the most systematic, I doubt that it is the
simplest.
The more ITS categories are using potentially non-wellformed spans with
<sm/>/<em/>, the more likely it will be that the ITS data won't make it
through the roundtrip, because the equivalence reduction to <mrk> variants
will be less likely to succeed.
In case you are restricted to using the needed xml namespace attributes on
static structural elements down to unit AND dynamically on <source> and
<target> you have one layer of ITS markup with guaranteed wellformedness,
so 2 down to worry about while making the <sm/>/<em/> to <mrk> transforms.

In case of terminology, we did say that all terminology is encoded as
inline, even though it may apparently exist at structural elements in
various source formats.. We said that the use case where the whole element
is terminology is not statistically significant to warrant different
handling.

The situation is opposite but analogical here. IMHO and AFAIK whitespace
handling and language information are inherently structural characteristics
when encoding natural language text. and we actually do NOT inhibit
expressivity of XLIFF by not introducing the truly inline variants that
could possibly be transformed into <sm/>/<em/> pairs.

if you indeed have to introduce different language or differnt sort of
whitespace handling on sub-unit level. I don't think that separating such a
portion as its own segment or ignorable is unwaranted. If you want
guaranteed roundtrip for such a construction you can protect it by the
canResegment flag set to "no", which again seems warranted for such a
special case.

While I see that the introduction of its:space and its:lang looks
systematic and I am fairly confident is doable. I do think that such a
solution is an overkill that brings more complexity than is actually
warranted by any real life case where you'd need this type of metadata
truly inline..

When you need an example or password field or array, or whatever with
different whitespace handling, it hardly seems unwarranted to extract it as
a different unit or at least a different segment.

Similarly if you are using examples, poems, quotations in a different
language, these seem inherently structurally different to the normal text
flow in the main source language.
Even if you are using one word examples tightly mixed within the source
language, it seems plausible to set them as separate segments that can be
e.g. handled by different services/translators. Again I do not see a
significant use case for introducing a full blown <mrk><-></sm>/<em/>
machinery for this metadata that actually is inherently structural..

I should like to challenge people on both mailing lists (Felix?, Fredrik?)
to come up with valid and frequent use cases where structural extraction
seems inadequate.


>
> It simply means that if you want to handle Preserve Space or Language
> Information at the inline level, you have to support that part of the ITS
> module (which is really not complicated when you already have to handle
> xml:space and xml:lang for the Core).
>
I do not understand this reasoning. Based on core you need only to support
xml namespace attributes through simple inheritance and do not need to
worry about analogical semantics on non-well formed spans. So introduction
of those new attributes on annotation markers actually does bring a whole
new complexity..
Do you remember how complicated it is to determine translatability across
non-welformed spans and cross-segment? I think there is a value in avoiding
this complexity for xml:space and xml:lang

> That means one cannot guarantee those features will be preserved by
> Core-only processors. But it’s already the case in 2.0.
>
Do you mean that xml namespace is also allowed on structural extension
points? I think it was a bad decision and I was trying to sway it.. Anyways
now we are not talking extensibility at higher structural levels but about
introducing a new inline complexity through a fully protected module. A
wholly different issue. You are trying to introduce an non-xml-like
behavior for two xml:namespace attributes (of course their counterparts in
the module namespace but anyway), that I'd argue don't really need that, as
we would have hard time thinking of valid use cases where use of preserve
space or language information actually is not structural.

>
>
> Cheers,
>
> -yves
>
>
>
>
>
> *From:* Dr. David Filip [mailto:David.Filip@ul.ie]
> *Sent:* Thursday, October 23, 2014 7:04 AM
> *To:* Yves Savourel
> *Cc:* XLIFF Main List; public-i18n-its-ig
> *Subject:* Re: [xliff] ITS: Preserve space and Language Information
>
>
>
> Thanks, Yves,
>
>
>
> I was thinking about two possible solutions.
>
> One of them would be as you propose to introduce its attributes that could
> work with empty markers as span delimiters.
>
>
>
> Another way would be to use the fact that the two relevant XML namespace
> attributes are still available on <source> and <target>
>
> Not sure if this is an omission, probably not as we have PR for
> resegmentation accounting for that.
>
>
>
> This would be somewhat restrictive but would have the advantage that the
> related mark up would be always well formed
>
>
>
> I tried to write up such restrictive solution for Preserve Space in the
> Current Working draft.
>
> It also notes that you can use originalData to preserve whitespace..
>
>
>
> I copy paste it here:
>
>
>
> Preserve Space
>
> Indicates how to handle whitespace in a given content portion. See [ITS]
> Preserve Space for details.
>
> Structural Elements
>
>  Whitespace handling at the structural level is indicated with xml:space
> in XLIFF Core and extensions:
>
> Extraction of preserved whitespace at the structural level
>
> Original:
>
>
>
> <listing xml:space='preserve'>Line 1
>
> Line 2</listing>
>
>
>
> Extraction:
>
>
>
> <unit id='1' xml:space='preserve'>
>
>  <segment>
>
>   <source>Line 1
>
> Line 2</source>
>
>  </segment>
>
> </unit>
>
>
>
>
>
> Inline Elements
>
>  It is not possble to use [XML namespace] on XLIFF inline elements. It is
> advised that mixed Preserve Space behavior is NOT used inline in source
> formats. In case of extraction of source format inline elements with mixed
> Preserve Space behavior, it is advised to extract all discernable portions
> with uniform whitespace handling into different <unit> elements that can
> have their whitespace handling set independently.
>
> Whitespace handling can be also set independently for text segments and
> ignorable text portions within an Extracted unit and for the source ad
> target language within the same <segment> or <ignorable> element using the
> optional xml:space attribute at the <source> and <target> elements.
> However, mixed whitespace handling behavior is not likely to survive
> Segmentation Modification. So this method is not advised unless the
> <segment> elements are protected by the canResegment flag value set to or
> inhrited as no.
>
> Preserved whitespaces can be also extracted as original data stored
> outside of the translatable content at the unit level and referenced from
> placeholder codes. It is importnat to note that the value of the xml:space
> attribute is restricted to preserve on the <data> element.
>
> Extraction of preserved whitespaces as referenced original data
>
> Original:
>
>
>
>  <p>
>
>    <span xml:space='preserve'>Item 1      Item 2      Item n+1
>
>    </span> are all used to build Item n+2.
>
>  </p>
>
>
>
> Extraction:
>
>
>
> <unit id='1'>
>
>   <originalData>
>
>     <data id="d1">&lt;span xml:space='preserve'></data>
>
>     <data id="d2">&lt;/span></data>
>
>     <data id="d3">      </data>
>
>     <data id="d4">
>
>     </data>
>
>   </originalData>
>
>   <segment>
>
>     <source><pc id="1" dataRefStart="d1" dataRefEnd="d2">Item 1<ph id="2"
> dataRef="d3">Item 2<ph id="2" dataRef="d3">Item n+1<ph id="2"
> dataRef="d4"></pc> are all used to build Item n+2.</source>
>
>   </segment>
>
> </unit>
>
>
>
>
>
> Not sure really which solution is better, but I'd say we should explore
> both..
>
>
>
> Cheers
>
> dF
>
>
> Dr. David Filip
>
> =======================
>
> OASIS XLIFF TC Secretary, Editor, and Liaison Officer
>
> LRC | CNGL | CSIS
>
> University of Limerick, Ireland
>
> telephone: +353-6120-2781
>
> *cellphone: +353-86-0222-158*
>
> facsimile: +353-6120-2734
>
> http://www.cngl.ie/profile/?i=452
>
> mailto: david.filip@ul.ie
>
>
>
> On Thu, Oct 23, 2014 at 1:41 PM, Yves Savourel <ysavourel@enlaso.com>
> wrote:
>
> Hi all,
>
> It seems to me that we don't have a good solution for the inline cases of
> the Preserve Space and Language Information data
> categories.
>
> In the original draft mapping we used xml:space and xml:lang on <mrk>.
> But, as David pointed out, this can't work because these attributes are
> not allowed on <mrk>/<sm>.
> I believe we did this because of <sm>: both xml:lang and xml:space scopes
> would apply to an empty element.
>
> But we cannot have no inline solution for those two data categories.
> So it seems they would fall into the class of the data categories only
> partially supported directly by the core, and we need
> ITS-module attributes to handle them inline. Something like this: <mrk
> id='1' type="its:any" its:space="preserve" its:lang="iu">.
>
> Cheers,
> -yves
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail.  Follow this link to all your TCs in OASIS at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
>
>
>
Received on Thursday, 23 October 2014 18:38:20 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:11:31 UTC