- From: Dave Lewis <dave.lewis@cs.tcd.ie>
- Date: Tue, 08 May 2012 12:49:05 +0100
- To: Felix Sasaki <fsasaki@w3.org>
- CC: Yves Savourel <ysavourel@enlaso.com>, public-multilingualweb-lt@w3.org
- Message-ID: <4FA90831.2020204@cs.tcd.ie>
Hi Felix, Yes, i think we probably need to flag more clearly in the requirements document some of the assumptions that motivate some of the different data categories. In the targetPointer case there is a clear desire to support processing ITS and only ITS mark-up, but as you indicate, this may not be the assumption in all cases, especially for data categories being used further upstream in CMS or content editors, where tool's knowledge of the host document schema will be more likely. We should aim therefore to at least try and provide some clear classes of scenarios to help make these assumptions a bit more explicit. As Pedro pointed out in the last call the scenairo we describe at we miss the whole class of real-time translations; http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#Support_an_End-to-End_Use_Case The current text is more of an automation of the traditional localization flow, and therefore does not highlight the differing requirements and assumptions of the 'real-time' translation use scenario. Davidf, Arle and i are looking currently at refining the process values we have from Pedro with the ones Arle defined the google spreadsheet, so i suggest for this upcoming release we use these process values aligned to the following two scenario descriptions: i) as per http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#Support_an_End-to-End_Use_Case with Lt integration into a more 'traditional' localization workflow extended to include authoring and publishing, requiring XLIFF roundtrip in the middle ii) a realtime translation workflow, where content is put on a cache (I prefer perhaps a term like 'staging server' to avoid confusion with 'web cache') from where it is subject to more automated text enrichment and MT without professional, LSP-based review, though perhaps with the opportunity for content management stakeholders to provide feedback on quality. This could be wrapped up with the processes to recover parallel text for MT retraining. While there should also be any number of combination elements form these process flows, these two examples will cover most of the bases I think. cheers, Dave On 08/05/2012 07:39, Felix Sasaki wrote: > Hi Dave, > > this thread has some interesting general points. You have many areas > where one can choose: should you use ITS metadata to generate XLIFF > (or TMX) out of a given piece of content, or should you process the > format "as is". For example, you can use the "translate" data category > both for generating XLIFF and for processing "as is". I assume that > e.g. the online MT services that process HTML5 "translate" do not > generate XLIFF, but process the HTML5 as is (though I'm not sure). > > Another benefit for staying in the source content format without any > extraction process is that you can make use of tooling that is > available for that format. In the case of the Web or deep web XML > formats like DocBook or DITA, or e.g. the XHTML5 based) ePub3, that > tooling more and more is aware of other metadata like "dir" or "ruby" > (currently being discussed for HTML5), or even vertical layout > features. I doubt that one would try to re-implement that > functionality in XLIFF or other formats that are focusing on an > extraction + re-insertion scenario. > > Best, > > Felix > > 2012/5/8 David Lewis <dave.lewis@cs.tcd.ie <mailto:dave.lewis@cs.tcd.ie>> > > Hi Yves, (and Chase), > Thanks, that's clearer for me now. It seemed to me from your > previous post that XLIFF and TMX were the principle multilingual > formats behind the use case, whereas in fact it is the need for > tools to handle a wide _variety_ of multilingual formats that > offers the benefit in this use case. > > This still leaves the more philosophical question of where we > should be encouraging the proliferation of multilingual file > formats by making them easier to handle? > > cheers, > Dave > > > > On 07/05/2012 20:51, Yves Savourel wrote: > > Hi Dave, > > Where there is already an element structure in > the host document that indicates source and target > content, what is the use case where the implementer > wouldn't read the relevant XLIFF or TMX schema > document to figure out how to parse this themselves. > > When the implementer want to develop a generic tool that rely > on ITS, and only ITS, to access the documents it processes. > That tool does not want to know anything about XLIFF or TMX > specifics other than the information it gets through the ITS > rules. > > > This seems simpler than defining a new standard tag > in ITS to essentially explain the schema of XLIFF > and TMX. > > It's simpler only if you develop just for XLIFF or just for > TMX. If you target "any XML format" targetPointer is not only > simpler it is the only way to go. If you have the proper ITS > rule, you don't need to know each format you are working with. > You can make you tool generic, and even work for formats that > do not exists yet. > > Let's start with the translate rule: > > A given XML tool that implements ITS should be able to learn > from the ITS rules (and only from them) what part of the text > of an XML format ABC is translatable or not. It shouldn't need > to know anything about the format ABC. > > I assume we all are in agreement with that statement. If not, > we need to stop here and debate that specific point, because I > think it's one of the foundations of ITS. > > Assuming we agree on that... Now, among the various XML > formats, some of them do store the same text in several > languages. XLIFF and TMX are two examples of such formats. But > you have other cases: translation formats like TS, some CMS > exports (e.g. Vignette), some types of resource files, etc. > > With those type of formats, a given tool may need to know not > only where is the translatable text, but also where the > translated version of the same text resides in relation to the > source. The targetPointer feature would allow that. > > > > Is there some class of useage of XLIFF and TMX > that makes the interpretation of their source-target > binding difficult to parse directly in practice? > > The idea is that the tool does not necessarily has to know > about XLIFF, TMX, etc. It can work in an abstract way by > understanding the ITS rules. > > Sure, if the type of work you want to do is complex, it may > make sense to actually use a true XLIFF or TMX parser. But we > shouldn't assume it's always the case. You can do plenty of > things generically. Look at what applications such as ITS-Tool > or Rainbow can do with XML documents they know only through > their ITS rules. > > > Also, consideration non-translation use cases such as > semantic tagging or parallel text extraction , it doesn't > seem likely that you'd do these without needing either > to write to the file or understand say the distinction > between translation and an alt-trans - in which case > you'd need a working understanding of XLIFF/TMX anyway. > > Parallel text is actually a good example: Imagine you write > some XSLT-based tool that can take the source and target > entries of a XLIFF file and create one plain text file for the > source entries and one for the target entries (a bit like the > two parallel files needed to train Moses). > > You can do it by hard-coding which XLIFF element stores the > source and which one stores the target. ...Or you can use the > ITS translateRule with its handy targetPointer information to > write a generic tool that will work not only on XLIFF, but > also TMX, TS, and any other XML formats for which you can > define a targetPointer. > > Cheers, > -yves > > > > > > > > -- > Felix Sasaki > DFKI / W3C Fellow >
Received on Tuesday, 8 May 2012 11:41:47 UTC