Re: targetPointer Requirement update from Felix Sasaki on 2012-05-08 (public-multilingualweb-lt@w3.org from May 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 8 May 2012 22:41:36 +0200
To: Dave Lewis <dave.lewis@cs.tcd.ie>
Cc: Yves Savourel <ysavourel@enlaso.com>, public-multilingualweb-lt@w3.org
Message-ID: <CAL58czrQZCSU7cDVPJsA85zy8kw=oAqe6Be+y2c4nQGDjotfRw@mail.gmail.com>
Hi Dave,

2012/5/8 Dave Lewis <dave.lewis@cs.tcd.ie>

>  Hi Felix,
> Yes, i think we probably need to flag more clearly in the requirements
> document some of the assumptions that motivate some of the different data
> categories. In the targetPointer case there is a clear desire to support
> processing ITS and only ITS mark-up, but as you indicate, this may not be
> the assumption in all cases, especially for data categories being used
> further upstream in CMS or content editors, where tool's knowledge of the
> host document schema will be more likely.
>
> We should aim therefore to at least try and provide some clear classes of
> scenarios to help make these assumptions a bit more explicit. As Pedro
> pointed out in the last call the scenairo we describe at we miss the whole
> class of real-time translations;
>
> http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#Support_an_End-to-End_Use_Case
>
> The current text is more of an automation of the traditional localization
> flow, and therefore does not highlight the differing requirements and
> assumptions of the 'real-time' translation use scenario.
>
> Davidf, Arle and i are looking currently at refining the process values we
> have from Pedro with the ones Arle defined the google spreadsheet, so i
> suggest for this upcoming release we use these process values aligned to
> the following two scenario descriptions:
> i) as per
> http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#Support_an_End-to-End_Use_Casewith Lt integration into a more 'traditional' localization workflow
> extended to include authoring and publishing, requiring XLIFF roundtrip in
> the middle
> ii) a realtime translation workflow, where content is put on a cache (I
> prefer perhaps a term like 'staging server' to avoid confusion with 'web
> cache')  from where it is subject to more automated text enrichment and MT
> without professional, LSP-based review, though perhaps with the opportunity
> for content management stakeholders to provide feedback on quality. This
> could be wrapped up with the processes to recover parallel text for MT
> retraining.
>
> While there should also be any number of combination elements form these
> process flows, these two examples will cover most of the bases I think.
>

Here I disagree: there might be scenarios without translation and MT
training at all. E.g. the ITS 1.0 data categories Directionality and Ruby
are examples of these, but also automatic named entity recognition - not
necessarily used for translation of scenario i) and ii) at all, but e.g. as
a basis for automatic summarization or other LT processes . Also, there
might be an editor that pics up metadata to easy the translation process -
ITS 1.0 Translate data category is e.g. realized in SDL trados (IIRC).
Again this doesn't fit into type i) or ii).

Above situation is not bad IMO - take the BCP 47 language identifiers as an
example: these are used for a huge variety of scenarios (search, rendering
(font selection), basis for locale identifiers, spelling correction etc.).
Users are not confused by this variety, since the identifier is very
clearly identified.

So rather than trying to group everything under two IMO (very localization
specific) scenarios, I would propose to mention in the introduction that we
want to cover a variety of scenarios - like we say in the charter: "...
producers of content, localization workers, language technology experts,
browser vendors, tool makers and users.". I will make some detailed
comments in the requirements document tomorrow along these lines.

I am also wondering whether we can re-use some of the usage scenarios
described in the ITS requirements document
http://www.w3.org/TR/itsreq/#scenarios
that document talks about XML, but besides that the approach is rather
general.

Best,

Felix


>
> cheers,
> Dave
>
>
>
>
> On 08/05/2012 07:39, Felix Sasaki wrote:
>
> Hi Dave,
>
>  this thread has some interesting general points. You have many areas
> where one can choose: should you use ITS metadata to generate XLIFF (or
> TMX) out of a given piece of content, or should you process the format "as
> is". For example, you can use the "translate" data category both for
> generating XLIFF and for processing "as is". I assume that e.g. the online
> MT services that process HTML5 "translate" do not generate XLIFF, but
> process the HTML5 as is (though I'm not sure).
>
>  Another benefit for staying in the source content format without any
> extraction process is that you can make use of tooling that is available
> for that format. In the case of the Web or deep web XML formats like
> DocBook or DITA, or e.g. the XHTML5 based) ePub3, that tooling more and
> more is aware of other metadata like "dir" or "ruby" (currently being
> discussed for HTML5), or even vertical layout features. I doubt that one
> would try to re-implement that functionality in XLIFF or other formats that
> are focusing on an extraction + re-insertion scenario.
>
>  Best,
>
>  Felix
>
> 2012/5/8 David Lewis <dave.lewis@cs.tcd.ie>
>
>> Hi Yves, (and Chase),
>> Thanks, that's clearer for me now. It seemed to me from your previous
>> post that XLIFF and TMX were the principle multilingual formats behind the
>> use case,  whereas in fact it is the need for tools to handle a wide
>> _variety_ of multilingual formats that offers the benefit in this use case.
>>
>> This still leaves the more philosophical question of where we should be
>> encouraging the proliferation of multilingual file formats by making them
>> easier to handle?
>>
>> cheers,
>> Dave
>>
>>
>>
>> On 07/05/2012 20:51, Yves Savourel wrote:
>>
>>> Hi Dave,
>>>
>>>  Where there is already an element structure in
>>>> the host document that indicates source and target
>>>> content, what is the use case where the implementer
>>>> wouldn't read the relevant XLIFF or TMX schema
>>>> document to figure out how to parse this themselves.
>>>>
>>> When the implementer want to develop a generic tool that rely on ITS,
>>> and only ITS, to access the documents it processes. That tool does not want
>>> to know anything about XLIFF or TMX specifics other than the information it
>>> gets through the ITS rules.
>>>
>>>
>>>  This seems simpler than defining a new standard tag
>>>> in ITS to essentially explain the schema of XLIFF
>>>> and TMX.
>>>>
>>> It's simpler only if you develop just for XLIFF or just for TMX. If you
>>> target "any XML format" targetPointer is not only simpler it is the only
>>> way to go. If you have the proper ITS rule, you don't need to know each
>>> format you are working with. You can make you tool generic, and even work
>>> for formats that do not exists yet.
>>>
>>> Let's start with the translate rule:
>>>
>>> A given XML tool that implements ITS should be able to learn from the
>>> ITS rules (and only from them) what part of the text of an XML format ABC
>>> is translatable or not. It shouldn't need to know anything about the format
>>> ABC.
>>>
>>> I assume we all are in agreement with that statement. If not, we need to
>>> stop here and debate that specific point, because I think it's one of the
>>> foundations of ITS.
>>>
>>> Assuming we agree on that... Now, among the various XML formats, some of
>>> them do store the same text in several languages. XLIFF and TMX are two
>>> examples of such formats. But you have other cases: translation formats
>>> like TS, some CMS exports (e.g. Vignette), some types of resource files,
>>> etc.
>>>
>>> With those type of formats, a given tool may need to know not only where
>>> is the translatable text, but also where the translated version of the same
>>> text resides in relation to the source. The targetPointer feature would
>>> allow that.
>>>
>>>
>>>
>>>  Is there some class of useage of XLIFF and TMX
>>>> that makes the interpretation of their source-target
>>>> binding difficult to parse directly in practice?
>>>>
>>> The idea is that the tool does not necessarily has to know about XLIFF,
>>> TMX, etc. It can work in an abstract way by understanding the ITS rules.
>>>
>>> Sure, if the type of work you want to do is complex, it may make sense
>>> to actually use a true XLIFF or TMX parser. But we shouldn't assume it's
>>> always the case. You can do plenty of things generically. Look at what
>>> applications such as ITS-Tool or Rainbow can do with XML documents they
>>> know only through their ITS rules.
>>>
>>>
>>>  Also, consideration non-translation use cases such as
>>>> semantic tagging or parallel text extraction , it doesn't
>>>> seem likely that you'd do these without needing either
>>>> to write to the file or understand say the distinction
>>>> between translation and an alt-trans - in which case
>>>> you'd need a working understanding of XLIFF/TMX anyway.
>>>>
>>> Parallel text is actually a good example: Imagine you write some
>>> XSLT-based tool that can take the source and target entries of a XLIFF file
>>> and create one plain text file for the source entries and one for the
>>> target entries (a bit like the two parallel files needed to train Moses).
>>>
>>> You can do it by hard-coding which XLIFF element stores the source and
>>> which one stores the target. ...Or you can use the ITS translateRule with
>>> its handy targetPointer information to write a generic tool that will work
>>> not only on XLIFF, but also TMX, TS, and any other XML formats for which
>>> you can define a targetPointer.
>>>
>>> Cheers,
>>> -yves
>>>
>>>
>>>
>>
>>
>
>
>  --
> Felix Sasaki
> DFKI / W3C Fellow
>
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Tuesday, 8 May 2012 20:42:03 UTC