RE: Datacat for inline element from Yves Savourel on 2006-01-30 (public-i18n-its@w3.org from January to March 2006)

From: Yves Savourel <ysavourel@translate.com>
Date: Mon, 30 Jan 2006 10:12:16 -0700
To: <public-i18n-its@w3.org>
Message-ID: <001a01c625c0$51f07720$8f05a8c0@Breizh>

Hi Martin, Richard, Felix, Andrzej, and all,

>> "There should be a means of indicating whether an element is 
>> equivalent or not to a unit that will be used for automated 
>> translation processing. Some elements may contain other elements which 
>> are translation units in their own right."
>> 
> Caution: That section was not thought out thoroughly at the time, 
> as indicated by the note in the text.  
> ...
> I guess one requirement for translation is for translation tools to be 
> able to work out which elements should not constitute segment 
> delimiters. This applies in HTML to inline elements such as <em>, 
> <strong>, <span>, etc.
> Note that this is meta information, and I'm wondering (without a great 
> deal of thought on the matter) whether this is really ITS related or 
> actually pure localization properties stuff.
>
> Another potential requirement is perhaps for an element that indicates 
> translation segment boundaries when they may not otherwise be apparent.
> For example, something like XHTML 2's line element, which replaces the 
> <br> element in HTML, or something that can be used to indicate 
> sentence-like units within markup, that can be used for detailed 
> segmentation prior to source matching (eg. translation memory). 
> This is a very different requirement, and at a very different level, 
> than the previous one.

I agree that it is important that we have clearly an i18n need for inline/block identification if we are to provide an ITS data
category for it.

It seems that a great deals of linguistic-related tasks other than localization require some form of segmentation. For example
machine translation and text mining. Currently this is handled by having such processes know about the semantic of the tags (like an
MT engine working on an HTML document), but with a generic way of providing basic inline/block indicators they would be able to work
on XML documents for which they don't know the semantic.

To some degree this is indeed related to presentation, but just like for ruby or bidi, maybe this needs to be specified outside
presentation-level information. I think linguistic processes and rendering are separate aspects.

Looking at Richard's two requirements, I wonder if there is a way to reduce the need for the data category to #2. I assume #1 can be
inferred from the DTD, the XML Schema or Relax-NG: anything in an element allowing also text content would be inline. Then, within
these inline elements, we would need to identify a) the segment breakers, b) the sub-flow containers. ... Just thinking.

-yves

Received on Monday, 30 January 2006 17:12:30 UTC