W3C home > Mailing lists > Public > public-i18n-its@w3.org > January to March 2006

[ESW Wiki] Update of "its0601ReqInlineElements" by YvesSavourel

From: <w3t-archive+esw-wiki@w3.org>
Date: Mon, 13 Feb 2006 13:39:03 -0000
To: w3t-archive+esw-wiki@w3.org
Message-ID: <20060213133903.14749.41484@localhost.localdomain>
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "ESW Wiki" for change notification.

The following page has been changed by YvesSavourel:
http://esw.w3.org/topic/its0601ReqInlineElements


------------------------------------------------------------------------------
  ie. please focus on technical content, rather than wordsmithing at this stage.
  
  
- = Segmentation Hints =
+ = Segmentation =
- 
- [R025] Methods, independent of the semantic, of the elements must exist to provide hints on how to break down document content into meaningful runs of text.
- 
- Many applications that process content for linguistic-related tasks need to be able to perform a basic segmentation. They need to be able to do this without knowing about the semantic of the elements. The elements marking up the document content should provide generic clues to help such process.
- 
- In this requirement a 'text run' is defined as the longest collection of sequentially traversable nodes that, if you remove the tags, has a continuous linguistic meaning. (for example a paragraph with two consecutive sentences is a single text run, but two sentences with one embedded in the other constitute two text runs.) '''[YS- Not sure about this definition]'''
- 
- From this viewpoint, one can distinguish several types of element:
- 
-  * Type 1: Elements that do not contain direct text nodes. For example: <table> in XHTML.
- 
-  * Type 2: Element that contain mixed nodes belonging to one or more text runs. For example: <p> in XHTML.
- 
-  * Type 3: Elements that contain mixed nodes belonging to a single text run. For example: <img> in XHTML 2, or <image> in DITA. (Note: <img> XHTML 2 is different from <img> of XHTML 1.1).
- 
-  * Type 4: Elements that contain mixed nodes belonging to an element of type 2 or 3. For example: <strong> or <span> in XHTML.
- 
-  * '''[And possibly]''' Type 5: Empty elements belonging to an element of type 2, 3, or 4 and indicating a strong possibility of sub-segmentation. For example: <br/> in XHTML. '''[But I'm not sure this belong to ITS, because: a) it affects sub-segmentation not the text runs, b) such elements would probably be considered bad practice).]'''
- 
- A processor should be able to know from a method or infer from the content to which category or categories each element belongs.
- 
- 
- '''[YS- Previous text is below just in case we need it back]'''
- ----------
- 
- = Identifying inline elements =
- 
- Initial input: [http://people.w3.org/rishida/localizable-dtds/#inline-elements]
  
  == Summary ==
  
+ [R025] Methods, independent of the semantic, of the elements must exist to provide hints on how to break down document content into meaningful runs of text.
- [R025] Methods must exist to allow the distinction between block and inline elements.
- 
- '''[YS]- Andrzej, because we want to move forward quickly on this topic, and we were not sure if you would have the time to work on it, I've taken the action item to get it started. Obviously, feel free to edit it as needed.'''
  
  
  == Challenges ==
  
- Most applications preparing data for linguistic-related processes need to be able to make the distinction between elements that associate properties to spans of text content (e.g. formatting properties), and elements that structure the content.
+ Many applications that process content for linguistic-related tasks need to be able to perform a basic segmentation. They need to be able to do this without knowing about the semantic of the elements. The elements marking up the document content should provide generic clues to help such process.
  
- Some of the reasons such distinction is often necessary are the following.
+ Two types of information are needed:
  
-  * Segmentation
+ 1- A way to distinguish elements that may hold text content, from elements that never have text content.
  
-  The application of linguistic process to a document is greatly enhanced by the ability to segment the document's content into sentences. Such segmentation has to be, in part, driven from the knowledge whether elements are block or inline. For example, given the following content:
+ Example: The element <p> may hold text:
  
+ {{{<p>
+  <b>This is bold.</b>
+  <i>This is italic.</i>
+ </p>}}}
-  {{{<section><title><kw>select</kw> Element</title>
- <p><a>The main problems are:</a></p><ul><li>
- <p>users may not have the fonts needed to display the text and graphics
- cannot be used</p></li><li><p>it is hard to find a <kw>label</kw> for 
- the list that is not language-specific</p>
- </li><li><p>users cannot see or access the links <em>straight away</em></p>
- </li></ul></section>}}}
  
-  A processor without specific semantic knowledge about the tags or the text "sees" the content like this:
+ Example: The element <ul> should not hold text:
  
+ {{{<ul>
+  <li>This is the first item.</li>
+  <li>This is the second item.</li>
-  {{{<x>
-  <x>
-   <x>
-    zzzzzz
-   </x>
-   Zzzzzzz
-  </x>
-  <x>
-   <x>
-    Zzz zzzz zzzzzzzz zzz:
-   </x>
-  </x>
-  <x> 
-   <x>
-    <x>
-     zzzzz zzz zzz zzzz zzz zzzzz zzzzzz zz zzzzzzz zzz zzzz zzz zzzzzzzz zzzzzz zz zzzz
-    </x>
-   </x> 
-   <x>
-    <x>
-     zz zz zzzz zz zzzz zz 
-     <x>
-      zzzzz
-     </x>
-      zzz zzz zzzz zzzz zz zzz zzzzzzzz-zzzzzzzz
-    </x>
-   </x> 
-   <x>
-    <x>
-     zzzzz zzzzzz zzz zz zzzzzz zzz zzzzz 
-     <x>
-      zzzzzzzz zzzz
-     </x>
-    </x>
-   </x> 
-  </x>
- </x>}}}
+ </ul>}}}
- 
-  While a process with a simple knowledge of whether elements are inline or block can "sees" the content like this:
- 
-  {{{...
-  <B><I>zzzzzz</I> Zzzzzzz</B>
-  <B><I>Zzz zzzz zzzzzzzz zzz:</I></B>
-  ...
-  <B>zzzzz zzz zzz zzzz zzz zzzzz zzzzzz zz zzzzzzz zzz zzzz zzz zzzzzzzz zzzzzz zz zzzz</B>
-  ...
-  <B>zz zz zzzz zz zzzz a <I>zzzzz</I> zzz zzz zzzz zzzz zz zzz zzzzzzzz-zzzzzzzz</B>
-  ...
-  <B>zzzzz zzzzzz zzz zz zzzzzz zzz zzzzz <I>zzzzzzzz zzzz</I></B>
-  ...}}}
- 
-  This later view of the content provides much better chances to perform successfully linguistic tasks such as machine translation, terminology extraction, translation memory matching, spell-checking, or grammar verification.
- 
-  * Modification
- 
-  During linguistic-related processes inline elements are needed along with the text they markup so they can be:
- 
-   * modified
-   * deleted
-   * moved around
-   * used as anchor for text alignment
-   * used in text comparison
-   * used to help identifying part-of-speech
- 
-  The block elements, in the other hand, rarely need to be passed on to the linguistic process.
- 
- Inferring whether an element is inline or block using only the document context is not always enough. For example in the following code:
-  
- {{{<title><em>Special text</em></title>
- <li><p>Less special text</p></li>}}}
- 
- without sementic knowledge of the tags there is no programmatic way of guessing that <p> is block, and <em> is inline.
  
  
- Provisions may also be needed to address the cases where a block element may have the characteristics of an inline element (or conversely). For example, in the following code:
+ 2- A way to distinguish independent text content that is nested within another content, from text content that is part of its parent element's content.
  
+ Example: The text in <fn> is distinct from the text of <p>
- {{{<para>Palouse horses<footnote>A Palouse horse is the same as 
- an Appaloosa.</footnote> have spotted coats.</para>}}}
  
- The content of the <footnote> element should be treated as a separate block of text from the content of <para>.
+ {{{<p>Palouse horses<fn callout="#">A Palouse horse is 
+ the same as an Appaloosa.</fn> have spotted coats.</p>}}}
  
+ This are two distinct text runs:
+ 
+  * Palouse horses have spotted coats.
+  * A Palouse horse is the same as an Appaloosa.
+ 
+ Example: The text in <term> is part of the text of <p>
+ 
+ {{{<p><term>Palouse horses</term>
+ have spotted coats.</p>}}}
+ 
+ This is one text run:
+ 
+  * Palouse horses have spotted coats.
+ 
+ A processor should be able to know from a method or infer from the context such information.
+ 
Received on Monday, 13 February 2006 22:32:36 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:43:06 UTC