[ACTION-499]: Check normalization in XLIFF2 from Yves Savourel on 2013-04-24 (public-multilingualweb-lt@w3.org from April 2013)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Wed, 24 Apr 2013 09:07:42 -0600
To: <public-multilingualweb-lt@w3.org>
Message-ID: <000101ce40fd$783f4900$68bddb00$@com>

I had ACTION-499: Check normalization in XLIFF2 and see whether it relates to ITS2:

In XLIFF 2.0 there are currently three places where Unicode normalization is mentioned:

=== 1) Content Comparison (in the core)

This section discuss how to compare content (for example what are considered the 'same' content).
Unicode normalization is part of the more general normalization described here:

This specification defines two types of content equality:
• Equality type A: Two contents are equals if their normalized forms are equal.
• Equality type B: Two contents are equals if, in their normalized forms and with all inline code markers
replaced by the value of their equiv attributes, the resulting strings are equal.
A content is normalized when:
• The text nodes are in Unicode Normalized Form C defined in the Unicode Annex #15: Unicode
Normalization Forms [LDML].
• All annotation markers are removed.
• All pairs of <sc> and <ec> elements that can be converted into a <pc> element, are converted.
• All adjacent text nodes are merged into a single text node.
• For all the text nodes with the white space property set to default, all adjacent white spaces are
collapsed into a single space.

=== 2) Size Restriction Module

This module provides ways to encode size restrictions.
It has a <normalization> element with two attributes describing how to normalize the content to process:

- general: This attribute specifies the normalization to apply for general size restrictions. The only normalization forms C and D as specified by the Unicode Consortium are supported, see Unicode Standard Annex #15
#15 [http://unicode.org/reports/tr15/].

- storage: This attribute specifies the normalization to apply for storage size restrictions. The only normalization forms C and D as specified by the Unicode Consortium are supported, see Unicode Standard Annex #15

Both attributes can have the following values:

none = No additional normalization should be done, content should be used as represented in the document. It is possible that other agents have already done some type of normalization when modifying content. This means that this setting could give different results depending on what agent(s) are used to preform a specific action on the document.

nfc = Normalization Form C should be used

nfd = Normalization Form D should be used

both attributes have the default value 'none'.

=== 3) Validation Module

This module provides rules to perform various types of verification on the content.
The <rule> element has an optional normalization attribute:

- normalization specifies the normalization type to apply when validating a rule. Only the normalization forms C and D as specified by the Unicode Consortium are supported, see Unicode Standard Annex #15.

The values are:

none = No normalization should be done.

nfc = Normalization Form C must be used.

nfd = Normalization Form D must be used.

=== Relation to ITS2

I'm not familiar enough with normalization forms to know if the ITS recommendation to use a Normalizing Transcoder would have an effect on the possible requirement of XLIFF tools to convert to NFD. That's the only point that I've seen as a potential issue.

-ys

Received on Wednesday, 24 April 2013 15:08:16 UTC