W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > June 2012

ISSUE 12: segmentation marker

From: Yves Savourel <ysavourel@enlaso.com>
Date: Fri, 1 Jun 2012 10:57:37 +0200
To: <public-multilingualweb-lt@w3.org>
Message-ID: <assp.04998f0c70.assp.0499faa413.004f01cd3fd4$9850ef30$c8f2cd90$@com>
Hi all,

I have the action item to summarize the discussion about the segmentation markers that occurred in the XLIFF TC so all in the MLW-LT WG have some general background.

The TC has come up with representation for segment in XLIFF 2.0. But there is an outstanding item of possibly providing extra information on inline codes for un-segmented entries in order to help the segmentation engines.

For example, in the following:

<source>Some text<ph id='1'/>Some other text</source>

The inline code <ph/> could represent an HTML <BR> element and if the information was available in a standardized way it could be used to segment the text.

There has been also some discussion on the proposal by ULI (the Unicode localization interest group) for two special Unicode characters: SENTENCE JOINER and SENTENCE NON JOINER.

You can see a summary of the proposal here:
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012May/0081.html

The proposal has received mixed feedback from the UTC and the W3C i18n core WG, and is currently suspended.

How this relates to ITS 2.0?

There have been some discussions about providing segmentation-related information that could be used in the consumer tools. For example a (temporary?) <span> element of some sort.

Another possibility is a data category to indicate if an inline code should be treated as a segment breaker or not. For example, in HTML the <br/> element, and in some XML format a <break/> element: they could be associated with a rule that indicates that they should be seen as potential segment break indicator.


I hope this helps.
Cheers,
-yves
Received on Friday, 1 June 2012 08:58:03 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 9 June 2013 00:24:56 UTC