Re: Translation Memory (TM) and text-transform from Chris Lilley on 2003-10-22 (www-international@w3.org from October to December 2003)

From: Chris Lilley <chris@w3.org>
Date: Wed, 22 Oct 2003 16:30:10 +0200
To: "Richard Ishida" <ishida@w3.org>
Cc: www-international@w3.org, www-style@w3.org
Message-ID: <1585009573.20031022163010@w3.org>
On Wednesday, October 22, 2003, 2:58:51 PM, Richard wrote:


RI> See below a transcript of a mail exchange between myself and François
RI> Richard (top to bottom order).


RI> Francois wrote:
RI> I have been looking around for more info on the CSS 'text-transform',
RI> its purpose and  usage. I have the feeling that it might make the
RI> processing of text more complex since it actually transforms characters.

It doesn't transform characters, and is thus designed to make text
processing in general (including use of TM) *more* efficient.

Consider a page style where the major title is capitalised, first
level subheadings have initial caps, and body text is lower case
except for required capitalisation.

The straightforward, but wrong, way to do this is to change the
characters:

<major-title>THE EFFECT OF CHARACTER MANIPULATION ON TRANSLATION
MEMORY</major-title>
<subhead>The Effect of Character Manipulation on Translation
Memory</subhead>
<para>Manipulation of characters can have a negative impact on the
efficiency of Translation Memory, in the same way that multiple URIs
for the same resource have a negative effect on Web proxy cache
efficiency ...</para>

Additional variations are possible if some sections (eg, the first two
lines of the first paragraph after a subhead) are in small caps,
depending on whether your smallcaps font puts those glyphs on upper
case, lower case, or - as is usual - both cases (in which case the
FolLoWing tEXt wOUld disPLaY just fine)


The correct way to do this is to separate the stylability (and
restylability) of the text from the content of the text.

<major-title>The effect of character manipulation on Translation
Memory</major-title>
<subhead>The effect of character manipulation on Translation
Memory</subhead>
<para>Manipulation of characters can have a negative impact on the
efficiency of Translation Memory, in the same way that multiple URIs
for the same resource have a negative effect on Web proxy cache
efficiency ...</para>

This will, with two lines of CSS, display identically to the first
example. However, by using a consistent capitalisation throughout the
text, the efficiency of Translation Memory is improved. Restylability
(once the designers decide in two years time that capitalized headings
are *so* 2003) is also enhanced, as the new style requires a one line
change in site.css rather than multiple line changes in all of the
content.

As with all styling (eg, relative and absolute positioning) its also
possible to make egregious hacks with it, but the intended usage helps,
rather than hindering, translation. So yes, its possible to have
rAnsOm nOTe cAPiTaliZatIon and then rely on CSS to regularize the
capitalisation, thus totally messing up the Translation Memory; this
does not seem to be at all common, and would be bad practice.

So on balance, text-transform helps much more than it hinders.



RI> Richard's postscript:
RI> François and Yves are expressing concerns that I'm sure will be shared
RI> by a large number of localization folks out there.  I think it is
RI> important to state things clearly in the CSS spec -
RI> http://www.w3.org/TR/CSS21/text.html#propdef-text-transform should
RI> contain a paragraph that clearly spells out that this is only 'smoke and
RI> mirrors'.  That it should not be relied upon to 'make the text look
RI> right', only to apply an alternative styling effect that may not be
RI> desirable or applicable for all languages (eg. German or Turkish).

I agree with this good practice note and support its inclusion, plus a
good practice and bad practice example.

I will also add the examples and discussion from this thread to the
TAG finding on the separation of content and presentation.
Translatability forms a part of this separation that is not often
addressed. Translatability is affected both by contamination of
content with styling, as above, but also the contamination of styling
with content (especially in XSLT templates, for example).

RI> I also suspect that TM tools might work better if they used case
RI> independent (and even Unicode normalised) matching - possibly comparing
RI> case as a second level differentiator where appropriate (like a sorting
RI> algorithm).  (If you want to respond to this para, maybe just reply to
RI> www-international).

Certainly, Translation Memory is aided by early and consistent
Unicode normalisation. Within an organisation, if a pass is made over
legacy content to remove muddled styling and make the content well
formed, then also making in normalized before committing the revised
files back into the repository would yield benefits in TM efficiency
by reducing false negatives.

-- 
 Chris                            mailto:chris@w3.org
Received on Wednesday, 22 October 2003 11:07:45 UTC