Re: targetPointer Requirement update from Felix Sasaki on 2012-05-08 (public-multilingualweb-lt@w3.org from May 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 8 May 2012 08:39:44 +0200
To: David Lewis <dave.lewis@cs.tcd.ie>
Cc: Yves Savourel <ysavourel@enlaso.com>, public-multilingualweb-lt@w3.org
Message-ID: <CAL58czrOWQPEdz5KpF4yDssHsCQSZo9g8g=h8b0yYRie5t5DmQ@mail.gmail.com>
Hi Dave,

this thread has some interesting general points. You have many areas where
one can choose: should you use ITS metadata to generate XLIFF (or TMX) out
of a given piece of content, or should you process the format "as is". For
example, you can use the "translate" data category both for generating
XLIFF and for processing "as is". I assume that e.g. the online MT services
that process HTML5 "translate" do not generate XLIFF, but process the HTML5
as is (though I'm not sure).

Another benefit for staying in the source content format without any
extraction process is that you can make use of tooling that is available
for that format. In the case of the Web or deep web XML formats like
DocBook or DITA, or e.g. the XHTML5 based) ePub3, that tooling more and
more is aware of other metadata like "dir" or "ruby" (currently being
discussed for HTML5), or even vertical layout features. I doubt that one
would try to re-implement that functionality in XLIFF or other formats that
are focusing on an extraction + re-insertion scenario.

Best,

Felix

2012/5/8 David Lewis <dave.lewis@cs.tcd.ie>

> Hi Yves, (and Chase),
> Thanks, that's clearer for me now. It seemed to me from your previous post
> that XLIFF and TMX were the principle multilingual formats behind the use
> case,  whereas in fact it is the need for tools to handle a wide _variety_
> of multilingual formats that offers the benefit in this use case.
>
> This still leaves the more philosophical question of where we should be
> encouraging the proliferation of multilingual file formats by making them
> easier to handle?
>
> cheers,
> Dave
>
>
>
> On 07/05/2012 20:51, Yves Savourel wrote:
>
>> Hi Dave,
>>
>>  Where there is already an element structure in
>>> the host document that indicates source and target
>>> content, what is the use case where the implementer
>>> wouldn't read the relevant XLIFF or TMX schema
>>> document to figure out how to parse this themselves.
>>>
>> When the implementer want to develop a generic tool that rely on ITS, and
>> only ITS, to access the documents it processes. That tool does not want to
>> know anything about XLIFF or TMX specifics other than the information it
>> gets through the ITS rules.
>>
>>
>>  This seems simpler than defining a new standard tag
>>> in ITS to essentially explain the schema of XLIFF
>>> and TMX.
>>>
>> It's simpler only if you develop just for XLIFF or just for TMX. If you
>> target "any XML format" targetPointer is not only simpler it is the only
>> way to go. If you have the proper ITS rule, you don't need to know each
>> format you are working with. You can make you tool generic, and even work
>> for formats that do not exists yet.
>>
>> Let's start with the translate rule:
>>
>> A given XML tool that implements ITS should be able to learn from the ITS
>> rules (and only from them) what part of the text of an XML format ABC is
>> translatable or not. It shouldn't need to know anything about the format
>> ABC.
>>
>> I assume we all are in agreement with that statement. If not, we need to
>> stop here and debate that specific point, because I think it's one of the
>> foundations of ITS.
>>
>> Assuming we agree on that... Now, among the various XML formats, some of
>> them do store the same text in several languages. XLIFF and TMX are two
>> examples of such formats. But you have other cases: translation formats
>> like TS, some CMS exports (e.g. Vignette), some types of resource files,
>> etc.
>>
>> With those type of formats, a given tool may need to know not only where
>> is the translatable text, but also where the translated version of the same
>> text resides in relation to the source. The targetPointer feature would
>> allow that.
>>
>>
>>
>>  Is there some class of useage of XLIFF and TMX
>>> that makes the interpretation of their source-target
>>> binding difficult to parse directly in practice?
>>>
>> The idea is that the tool does not necessarily has to know about XLIFF,
>> TMX, etc. It can work in an abstract way by understanding the ITS rules.
>>
>> Sure, if the type of work you want to do is complex, it may make sense to
>> actually use a true XLIFF or TMX parser. But we shouldn't assume it's
>> always the case. You can do plenty of things generically. Look at what
>> applications such as ITS-Tool or Rainbow can do with XML documents they
>> know only through their ITS rules.
>>
>>
>>  Also, consideration non-translation use cases such as
>>> semantic tagging or parallel text extraction , it doesn't
>>> seem likely that you'd do these without needing either
>>> to write to the file or understand say the distinction
>>> between translation and an alt-trans - in which case
>>> you'd need a working understanding of XLIFF/TMX anyway.
>>>
>> Parallel text is actually a good example: Imagine you write some
>> XSLT-based tool that can take the source and target entries of a XLIFF file
>> and create one plain text file for the source entries and one for the
>> target entries (a bit like the two parallel files needed to train Moses).
>>
>> You can do it by hard-coding which XLIFF element stores the source and
>> which one stores the target. ...Or you can use the ITS translateRule with
>> its handy targetPointer information to write a generic tool that will work
>> not only on XLIFF, but also TMX, TS, and any other XML formats for which
>> you can define a targetPointer.
>>
>> Cheers,
>> -yves
>>
>>
>>
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Tuesday, 8 May 2012 06:40:12 UTC