Tim's comments on Requirements for localisable DTD design

Hi Folks,

I promised this at the first call, so here goes : my thoughts on
the draft document of the requirements for localisable DTD
design that Richard and Yves put together (I'm referring to the
7th July Working Draft) at :

http://people.w3.org/rishida/localizable-dtds/

General comments - 
I think this document is spot-on in most cases. I think it captures
the problems that are frequently encountered by folks trying to
translate XML documents, and it's adoption would certainly help
there.

I'm guessing that the target audience for this document is people
either writing XML dtds, or teams producing content. In particular
for the latter group, it's important that any requirements that
we put on them should be as easy to implement as possible, and
shouldn't put too much of a strain on any decent authoring tool.
(Oh, just re-read the document - you mention this already,
excellent)

I'll go through the sections I have thoughts about, and will refer
to their section number and title below :

2.2 Direct identification of content that should not be localised

Could I expand on this and ask for :

 * identification of content that should not be word-counted
 * identification of content that should not be segmented

- in our generic XML, HTML and SGML Docbook XLIFF filters, we've
needed these extra identifiers in order to spot particular 
types of text that could appear within translatable sections, but 
which could trip up either segmentation or wordcounting algorithms.

They should be translated, but just with care, or special skills
perhaps :

eg. 
<para>This is a <filename>com.sun.java.foo</filename> java package.</para>
<para>
<programlisting>
public class Tim
  public static void main (String[] args){
    System.out.println("Hello World!");
  }
}
</programlisting>
</para>

 - we'd want to wordcount 5 words (7 if we had a means to <span> 
around "Hello World!" - but we don't (yet!)) but specifically,
protect the contents of the programlisting and the filename
from being passed through a simple segmentation algorithm, 
which (assuming sentence-level segmentation) could create some
pretty weird segments otherwise.


2.7 Emphasis & document conventions

I get the intention of this, but I think it's important not to
make XML-ITS into a DTP-type application - I don't think we
should specify these under the ITS namespace (where do you
stop ? eg. We provide <importance>, <irony>, <sarcasm>, but not
<mildannoyance>, <subtlehumour>, etc.)

Along with the strict tag-set in ITS, are we planning on having
a "Suggestions to DTD authors", so that things that don't directly
fall inside the tag-set can still be mentioned for consideration ?


2.11 Declaring the language of the content

Multi-lingual XML documents are usually a pain in the neck for
translators to deal with : typically they'd have to split a document
up into mono-lingual bites, translate each section (presumably by
several different translators) and then recombine the document.

At Sun, where possible, we always try to keep language resources
separate, so the user can easily install a new language package at
runtime : multilingual resource files make this extremely complex
for installer programs...

Now, of course, it's good to have a way of marking up multi-lingual
documents in terms of providing some way in which you can display
the separate elements, I guess I'm suggesting "if you're going to be
publishing XML documents, and the source text is mostly in one language,
then please don't combine the translations in with the source document".

I've come across XML documents where we have to add elements in order
to provide translations, and it's a real pain.


2.12 Describing other cultural aspects of the content

Similar comments as to the above : I'm not against multi-lingual
documents, just so long as they're done properly (and when necessary)
but in general, I'd really ask people if they really need all
translations in a single file.


2.13 Citations

Is this outside the scope of XML-ITS ? Shouldn't entity declarations
in the XML document be used to do this ?


2.14 References to UI messages in Documentation

Yep, by all means provide clues that such a string might be a UI
message (eg. Docbook's <computeroutput> or <screen> tags) do this at the
moment - clues which we use in our TM system to aid segmentation...
but I don't know if I'd call-out message strings directly -- the XML
document would be illegible without the message resource file being
avaiable, it might be better to just mark up the section, and let 
the TM system fill in the translation.


2.16 Infinite Naming Scheme

Yay !! (I've seen this problem in the wild too)


2.17 Allowed Characters

How can you enforce this ? Isn't it up to the content authoring
tool to do this job ? Are there hooks that already provide such
functionality ? (eg. notes to translators imploring them to use
only ASCII, or strings of a certain length ?)


2.18 Term identification

Yep, this is good : who decides what a Term is though ? This
sounds like a bit of repetition though - wasn't 


2.24 Support for localisable resource data

Do you mean the stuff that the Mozilla folks have done with
the way they translate .dtd files for the UI ?


Anyway, that's all for now - sorry for going on so long :-)

	cheers,
			tim

-- 
Tim Foster - Tools Engineer, Software Globalisation
http://sunweb.ireland/~timf http://blogs.sun.com/timf
http://www.netsoc.ucd.ie/~timf

Received on Tuesday, 22 February 2005 16:00:28 UTC