Best use of variants

I think the best way to way to use variants
is to limit one's use of variants to being a mechanically 
generated rendition of some other preexisting variant of the 
document. For example, the original document might be a word 
processing document. Another variant of it could be a 
postscript rendition of the document. For an image, a second 
variant could be a postage stamp version of the image,
or a PDF version of the image.

I don't think that a hand translated version of a document
should normally be stored as a variant. I think that in 
general the best thing to do is to store hand translated 
versions as separate documents:

Consider the properties of the document. Properties like
title, author, and keywords are generally string properties
that should be translated along with the content when the 
document is hand translated. Often, the character set should 
be different in the translated version as well. 

For example, let's say you have an English document in ASCII
and you translate it to Japanese using an Asian character 
set. People fluent in multiple languages are relatively 
rare. Most of us have one language we prefer to operate in, 
so if you are a system administrator designing the schema 
for a document collection, you should accommodate that case 
well. Therefore, you want to design the schema such that 
English speakers can find the English version of the document 
by its title, author, and keywords in English in ASCII, 
and a Japanese speaker can find the document by those 
attributes in Japanese in an Asian character set. This means 
you need an English title attribute for the document and a
Japanese title attribute for the document if you store
the hand translated version of the document as just a
variant of the original document.

But that approach doesn't scale. Suppose you could 
potentially also have a French, Norwegian, German, or "any
language on Earth" translation of the document? Having an 
English_title, a Japanese_title, a French_title, etc. for 
the document is an approach that is not very appealing and 
clearly doesn't scale well. (For example, if you stored
the properties in certain RDBMS's, if you had a title, author,
and keywords attribute defined for every language on Earth, 
you would exceed the maximum number of columns allowed for 
a table.)

The leading alternative is to simply have one title
attribute, e.g., Title, and store each translation of the
document as a separate document. Each separate version of
the document would have the title in the appropriate 
language in the appropriate character set. Then, all users 
could query the Title attribute in their preferred language
in their preferred character set. (Let's assume that the
collection uses UCS or Unicode as the underlying character 
set, so that it can deal with virtually any character set in
a uniform manner). Then everyone's query would find a copy 
of the document they can read, if one exists.

(Of course, associating hand translated versions of a
document with each other is often desirable, but doing that 
is a separate issue. Multiple alternatives exist.)

So, I think we should refrain from using hand translations 
of documents as an example of variants, because, while we
can't prevent people from storing hand
translations as variants, I feel that would be encouraging 
bad practice. I think we should use better chosen examples,
based on mechanically generated renditions, and design for 
that case.


Alan Babich

Received on Wednesday, 29 July 1998 18:46:51 UTC