mlw-lt-track-ISSUE-12 (Segmentation Markup): Create a (Sentence) Segmentation Markup System compatible with the proposed Unicode segmentation characters [MLW-LT Requirements Document]

mlw-lt-track-ISSUE-12 (Segmentation Markup): Create a (Sentence) Segmentation Markup System compatible with the proposed Unicode segmentation characters [MLW-LT Requirements Document]

http://www.w3.org/International/multilingualweb/lt/track/issues/12

Raised by: Arle Lommel
On product: MLW-LT Requirements Document

The Unicode Technical Committee (UTC) is considering a proposal to encode two characters in plain text for improving the output of UAX #29-based sentence segmentation processes by allowing for text to contain override characters to correct results. For example, given the following string:

“Mrs. Smith and Mr. Jones ate lunch at Mme. Flaubert’s apartment.”

UAX #29 would incorrectly treat it as four segments:

1. Mrs.
2. Smith and Mr.
3. Jones ate lunch at Mme.
4. Flaubert’s apartment.

The introduction of a SENTENCE JOINER (SJ) and corresponding SENTENCE NON-JOINER would allow processes to explicitly override UAX #29 behavior, e.g. by rendering the example as:

“Mrs.<SJ> Smith and Mr.<SJ> Jones ate lunch at Mme.<SJ> Flaubert’s apartment.”

Where the SENTENCE JOINER (<SJ>) overrides the default UAX #29 rule.

The UTC requested that if this proposal moves forward, that “somebody” also work on a standard markup-oriented equivalent.

I see two ways to handle this request (interpreting it pretty broadly):

==1. Directly compatible with the character model==

Add two empty elements, e.g., <sj/> and <snj/>, that can substitute directly for the proposed characters.

Pros: directly equivalent to UTC proposal. Light-weight and minimally intrusive in the document structure. Can be added only where needed, making implementation simple.

Cons: limited utility for addressing individual segments (elements are not text-containing nodes); embeds process-dependent information in the document (i.e., the segmentation characters are useful only in a UAX #29-compliant process and segmentation thus requires continual reprocessing).


==2. Segments as text nodes.==

Add a new non-empty element, <segment>, e.g., (in HTML5) 

<p><segment>Mrs. Smith and Mr. Jones ate lunch at Mme. Flaubert’s apartment. </segment><segment>They had filet of herring and boiled potatoes with a cream sauce.</segment></p>

Pros: allows referencing of contents directly. Equivalent to <span> but with defined semantics. Definite boundaries to segments that do not require reprocessing. No dependence on UAX #29. More flexible than option 1.

Cons: Not equivalent to the UTC proposal. Would really be best implemented as an element in HTML5, which will be tougher than getting an attribute. Heavier than the alternative and requires explicit marking of all boundaries: cannot rely on UAX #29: an all-or-nothing model

Note: This could be implemented using <span> plus attributes:

<p><span type="segment">Mrs. Smith and Mr. Jones ate lunch at Mme. Flaubert’s apartment. </span><span type="segment">They had filet of herring and boiled potatoes with a cream sauce.</span></p>



I do not have a sense as to which is better. These really serve different needs. Option 1 provides a fix for UAX #29 and matches the proposal there. Option 2 is better where segments need to be addressable in the document. Perhaps using <span> elements already solves the problem of addressability, in which case option 1 may be attractive.

We need to discuss this topic further to determine what the use requirements are.

Received on Wednesday, 9 May 2012 12:50:33 UTC