[Bug 2878] Segmentation data category from bugzilla@wiggum.w3.org on 2006-03-10 (public-i18n-its@w3.org from January to March 2006)

From: <bugzilla@wiggum.w3.org>
Date: Fri, 10 Mar 2006 22:26:12 +0000
To: public-i18n-its@w3.org
CC:
Message-Id: <E1FHq3o-0005AT-Hz@wiggum.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=2878





------- Comment #3 from ysavourel@translate.com  2006-03-10 22:26 -------
This is a note from Andrzej and Yves:

We discussed the "segmentation/inliness" topic today and we came up with a
proposal for it. Here it is:


===1: The name

Since we wanted to avoid 'inline' for its other meanings in the the domain of
representation/rendering and 'segment' for its meaning in localization, we came
up with "Elements within text" as the name for this data category.

===2: The aim

The aim of this data category is to identify the elements that are within text
content and do not contain a text node that belongs to a different text unit.

Knowing these elements allow linguistics-related tool to break down the text of
the document into text units that are meaningful. No schema information or
programmatic methods allow to detect all cases of such elements.

Example, in the following code:

<p><b>Palouse</b> horses<fn callout="#">A Palouse horse is 
the same as an <b>Appaloosa</b>.</fn> have spotted coats.</p>

The element <b> is the only to be defined as "within text".

In the following OpenDocument code:

<text:p text:style-name="Standard">
 Palouse horses
 <text:note text:id="ftn1" text:note-class="footnote">
  <text:note-citation>1</text:note-citation>
  <text:note-body>
   <text:p text:style-name="Footnote"> A Palouse horse is the same as 
an Appaloosa.</text:p>
  </text:note-body>
 </text:note>
 have spotted coats.</text:p>

None of the elements is to be defined at "within text".

The processing expectation for this data category is to break down the text of
a document in separate text units where: a) Any element identified as 'within
text' remain with its enclosing text. b) And any other element is removed or
left in the form of a place-holder.

> ===3: ITS Markup

We came up with two different possible solutions to code this information in
ITS: One using XPath expression, the other using a list of element names.

With XPath:

<its:documentRules>
 <its:withinTextRule its:selector="//em" its:withinText="yes" />  
<its:withinTextRule its:selector="//strong" its:withinText="yes" />  
...
</its:documentRules>

With list:

<its:documentRules>
 <its:withinTextRule its:list="em strong..." its:withinText="yes" /> 
</its:documentRules>


-- Yves is of the opinion to use the list (but could live with the selector):

Using XPath would force (at least in DOM) to decorate the document to know
whether an element is "wintin text" or nor when traversing the document tree.
There are no easy or unexpensive way to know if a given element is matching or
not an XPath expression when accessing the tree directly. Since we have not
been able so far to come up with cases where an element would be "within text"
or not depending on its context, it seems using XPath is not as justified here
as it is in other data categories.

-- Andrzej preferes XPath:

It provides more control and might well be required in certain conditions. One
can imagine that there could well be situations where an element is 'within
text' in one context, and not in another, so XPath provides the maximum
flexibility.

-Andrzej and Yves
Received on Friday, 10 March 2006 22:46:20 UTC