Re: character encoding assumptions and approaches

On Fri, Mar 08, 2002 at 10:54:35AM -0000, Matthew Dovey wrote:
> However, coming back to the issue in hand, as I see it, we have two
> choices: 
>  
> We do a proper sound engineering job on this issue...
> Or
> We do a quick fix...

I lean slightly to the second. That is, come up with a single mindset
and push it through - not to solve everything, but to solve one area
that is pretty common.

For example, add an options bit to mean 'Unicode' where strings are
encoded in UTF-8.

However, for this to succeed, I would like to see for every string in
the Z39.50 spec a definition of whether that string is Unicode or not.
I would make everything Unicode whenever possible: result set names,
database names, etc. The only exceptions I see are things like MARC
in an external etc.

There is huge scope for arguing here. What I think it would benefit
from is someone with sufficient arrogance (and I mean this in a positive
way here) to come up with a complete proposal for every OCTET STRING,
InternaltionString, GeneralString etc in the printed Z39.50 spec
(ie, including all the record syntaxes and other externals published
in the spec itself) and say if they should be UTF-8 or not. And then
be willing to defend that position for a period of time. I think every
string needs to be defined if UTF-8 or not to avoid any possible cause
of ambiguity.

The problem is the potential for long debates. So step one to me
seems to be coming up with agreed interpretation of the options bit.
I think a meaning such as 'Everything possible is UTF-8 encoded'
could be agreed to. Then the only argument is what bits *cannot*
be validly and safely interpreted as UTF-8 (eg: MARC records in
externals is probably unsafe). This would hopefully reduce the
number of arguments (eg: why make result set names Unicode? Answer
because everything is Unicode unless there is a reason for it not
to be Unicode.)

Alan

Received on Monday, 11 March 2002 18:35:51 UTC