Very, very, very nice! from Felix Sasaki on 2005-10-07 (public-i18n-its@w3.org from October to December 2005)

From: Felix Sasaki <fsasaki@w3.org>
Date: Fri, 07 Oct 2005 11:29:16 +0900
To: "public-i18n-its@w3.org" <public-i18n-its@w3.org>
Message-ID: <op.sx87u2bbx1753t@ibm-60d333fc0ec.mag.keio.ac.jp>
Dear all,

I'm very sorry that this message comes so late, but I'm very happy that I  
can write it. We got some very detailed feedback from Andres Vega who is  
working for Tektrans about the ITS requirements working draft. I would  
like to give him feedback about our opinion, so could we talk about this  
at the next teleconf? Everybody, please read this and be prepared to talk  
about it.

Best,

Felix

------- Forwarded message -------
From: "Andres Vega" <av@tektrans.com>
To: "Felix Sasaki" <fsasaki@w3.org>
Cc:
Subject: RE: Comments on the I18N ITS Requirements Working Draft
Date: Sat, 01 Oct 2005 04:27:30 +0900

Hello Felix

Here are my comments regarding the proposal at  
http://www.w3.org/TR/2005/WD-itsreq-20050805/

Usage Scenarios
2.1 Content Authoring
For this case I would recommend the use of a tag attribute (i.e. LOCALIZE)  
that could be applied at any level, very much like the LANG attribute. It  
could default to LOCALIZE=YES thus being omitted most of the time. To mark  
a specific section as not to be changed it should have the specific value  
LOCALIZE=NO. Any other value will provide localization specific  
information. The attribute would be inserted at two different stages:  
during content authoring (probably only the LOCALIZE=NO value most of the  
time) and at the I18N or L10N stage (when the informative values are more  
likely to be added).
The attribute should be read by any localization tool so as to block any  
section marked as NO and to allow localization for any other value, while  
displaying it as an informative reference to the translator.

An issue here appears with attribute fields that contain information that  
should be itself marked as localizable or not. (analog to the HTML image  
ALT attribute). These cases would probably still need to be treated  
differently (i.e. through schema or templates)

2.2 Terminology
In this case a tag could possibly be defined to enclose the term (i.e.  
<Term>XXX</Term>. Attributes could be used to link the term with an  
external source (a glossary or terminology database) that would provide  
all the term specific information needed. During authoring that  
information may or may not be updated, in the latter case both terms and  
glossaries could be semi-automatically updated by the terminology owner,  
prior to document localization.
Other approach could be to make use of the LOCALIZE attribute. This would  
be combined with the use of ID attributes, and would allow marking any  
element as a term without excess marking. See comments on 3.7 further  
below.

2.3 Software development
A set of tag attributes seems appropriate in this case, such as <Span  
SizeLimit=15 SizeUnits=Bytes/Characters/Pixels...
The encoding could possibly be addressed separately, by using a tag  
attribute (ENCODING or maybe CHARSET) probably at document level.

Example 1 would appear as:
<string id="s123" SizeLimit="15"  
SizeUnits="Characters">Printing...</string>

...

3.2.1 Challenges

Example 5 would imply very good I18N by integrating software and  
documentation to use the same localization resource bundles. While this is  
probably the best scenario, it is not the more likely one. I would  
consider Example 4 the one more likely to be needed. Following the  
LOCALIZE attribute terminology it would appear as:

The Java statement <code><span  
localize="no">System.out.println("</span>Hello world!<span  
localize="no">");</span></code> prints the text...

...

3.4 Unique Identifier

About this section maybe I am a bit TraDOS biased, as that is the tool we  
use most often.

It is true that TM techniques lacked context orientation in the past; but  
now they provide some contextual techniques (i.e. Xtranslate) that take  
into account not only the specific sentence to be translated but also the  
previous and following sentences.

Other tools, such as Content Management Systems, allow storing information  
in small elements that can be identified and reused from one document to  
another. This systems might be combined with the use of an ID attribute to  
allow for easy reuse of localized content. However one issue that often  
appears with CMS is that either the number and size of the content  
elements is reduced to very small units in order to allow more reuse (but  
increasing the complexity of the administration of the CMS) or it is  
defined using bigger units, which has the added problem that some markup  
is more likely to appear inside the unit and it may need to be different  
for different content output formats. If such is the case, those  
differences may cause change analysis tools to be unable to recognize the  
units as equivalent, further reducing the reusability of the localized  
content.

Nevertheless, the possibility to define a unique identifier to any item  
opens many other possibilities and is in itself advisable. (For example to  
identify terms as suggested above)

3.5 Handling of Entities.

 From my experience, it is best not to use entities (or variables in other  
context) that are smaller than a sentence and bigger than a character  
unit. For the reasons you already point out, it is very likely that the  
documentation author does not foresee syntactic or gender/number/case  
considerations of other languages different than the one the documentation  
is written in. The use of sentence size entities is on the other hand  
recommended, especially if they can be linked to software resources.

3.6 Identifying Language/Locale

Not much to add here. Maybe there should be separate identifiers for  
Language/Locale and Script, as this could avoid diachronical issues  
(languages that have changed the script in which they are written recently  
enough for electronic documents existing in both; scripts coexisting for  
the same language and locale as the Azerbaijan sample mentioned on 3.9,...)


3.7 Identifying terms

As stated above (2.2) I agree with the need to link terms to a Terminology  
Database that provides for most of the required attributes. Term  
identification could be done at the Authoring stage, thus defining the  
terms that will populate/update the TD; at a later stage terminologists  
could develop the needed content for each specific term.

Term specification could make use of the LOCALIZE attribute, along with  
the ID attribute. This would imply that every term would have to be  
localized (which is not necessarily a bad approach, as this would give its  
localization control to the terminologist). This would also allow marking  
any element as a term without excess marking. If more than one Terminology  
Database is needed, the values of the LOCALIZE attribute could be changed  
accordingly.

Regarding indexation, index entries should probably need its own separate  
treatment (i.e. an <Index> identifier).  If the index entry is itself a  
term, then format and sorting specifics could be addressed by a  
combination of the use of the default LANG attribute of the section and  
two INDEX specific attributes indicating display and phonetics. I.e.
<index id="jk07" localize="term" indexlevel="Sorting:index"  
sortstr="sorting:index">Index sorting</index>
Would both define the index entry to be displayed as:

Sorting,
   index

And sorted using the "sorting:index" (or any other phonetic string); and  
also identify the term "Index sorting" as a term to be stored in the TD  
with a unique id ("jko7"). At the same time it would be implicit that the  
term is translatable content.

3.8 Purpose Specification/Mapping

This specification seems a bit ambitious to me. Although I see its  
application, I also see the complexity of mapping all source specific  
attributes. Whenever possible I would rather make use of attributes that  
can have local specific values that can be defaulted to a generic value.  
(as with the LOCALIZE attribute).

The mapping technique could make good use of this and also allow for  
introducing or updating markup at a later stage away from authoring.

3.9 Cultural aspects

Regarding orthography I would make use of a SCRIPT attribute (possibly  
defaulting to the most extended script if missing).

Regarding other cultural, dialectal or stylistic variations I would  
recommend to make use of the LOCALIZE attribute at a document or paragraph  
size level.

3.11 Bidirectional text support.
This is fairly standard already, maybe a SCRIPT attribute could interfere  
with it, or it may be complementary. I should think more about it

3.12 Translatability

I think this would be covered by a LOCALIZE attribute.
Rather than allow other tags to carry implicit information on  
translatability I would prefer to postprocess already authored document at  
the I18N or L10N stage, adding the appropriate LOCALIZE attributes were  
needed. This also applies to 3.14 Limited impact.

Hope any of these suggestions are of any help.
I would appreciate your comments.

Best regards.
Andrés Vega Muñoz
Localisation Engineer
Tek Translation International
Tel:  + 34 91 414 4434
Fax: + 34 91 414 4444
OneWorld Localization Center
www.tektrans.com







-----Original Message-----
From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: 26 September 2005 05:24
To: Andres Vega
Cc: Richard Ishida
Subject: Comments on the I18N ITS Requirements Working Draft

Dear Andres,

This is Felix Sasaki from the i18n activity of W3C [1]. We met at the  
Unicode conference in Florida. I hope you had a save trip back and are  
doing well.
At the conference you showed some interest in the work of the ITS Working  
Group, after the presentation from Richard and me. I was wondering if you  
had time to take a look at the working draft on the topic [2] which our  
working grou published in August. Every comment or suggestion from you  
would be very welcome.
Looking forward to hear from you & with best regards,

Felix Sasaki

[1] http://www.w3.org/International/
[2] http://www.w3.org/TR/2005/WD-itsreq-20050805/
Received on Friday, 7 October 2005 02:29:27 UTC