Re: A truly multilingual WWW

jag@scndprsn.Eng.Sun.COM (James Gosling) writes:

>Your paper deals mostly with transport issues (things that http
>could help with), but there are a pile of issues that are related:

Undoubtedly. I don't claim to be an expert in this field
either. Rather, I am just basing things on my practical experience. 

>4. Data entry issues?

Yes. I'd forgotten about those ;-) Seriously though, these are
primarily outside the scope of the paper, except to not that the
proposed method for data exchange should also be used for data a
client generates. ie. All servers *should* be able to accept data in
UTF-[78] and UCS-2.

>Some are a real quagmire, like "I expect a date to be entered here
>and it's value returned to me using the ISO conventions".

For this, other work will need to be done, but this is outside the
scope of my paper.

>Other things get represented in a document besides characters.  This is
>almost certainly outside of the realm of http, but might fit in with
>html-42.0.  For example, dates and measures.  A hypthetical document
>might contain:

True. Again, my proposal is only about establishing a multilingual
document exchange foundation. If the different systems cannot
interoperate at the lowest level, the higher level doesn't stand a
chance :-)

>in http.  For simiplicity I'd recommend a really limited set of
>allowed encodings: ISO-8859-1 and UTF8.

I thought about this. I think that UTF-7 and UCS-2 are so trivial to
parse (and to to convert to in most cases), that the mail-safety of
UTF-7, and the compactness of UCS-2 for Asian languages offer enough
benfit that they should also be included.

>>    However, high-level tag use (eg. defining them in a DTD) fails for
>>    the following reasons:
>>     1. It is not transparent. The application processing the data stream
>>        must be able to parse the tags, even if it can not do anything
>>        with them. This necessarily complicates the parser.
>>     2. There are probably a huge number of presentation hints that could
>>        be used, and the list is dynamic as societal trends tend to alter
>>        languages. Good examples can be found by comparing almost any
>>        current written form of a language to that used 100 years ago.
>>        Some languages have even changed dramatically in the last 50
>>        years.
>
>These problems affect even low-level tags such as those you proposed.
>This whole area should be left to standards above http.

To a degree, you are correct. However, most (and especially SGML)
parsing takes place at an abstract level somewhat above the actual
character codes. The choice between low-level and high-level is driven
by the choice of complicating the upper layers, and having to modify
them at the slightest change, or to complicate the lower layers, and
have the higher layers work irrespective (because they do not
*require* that the presentational hints even be present).

High level tags (or attributes) *might* be useful for things like
deciding data and measurement formats, but I am not convinced of that
yet. 

The presentation hints I talk about are primarily useful for glyph
image disambiguation, not data display format decisions. Note that in
Method 3. "tag" does not mean "SGMl tag".

>> Method 1: Code-based presentation hints
>The big problem with the use of the private use area in this way is that
>it is "syntax without semantics".  These numbers are meaningless unless there
>is some mechanism for defining how they should be interpreted.  Something
>higher-level is required if, for example, a document using one of these
>extended characters is ever to be displayed.

I think you misunderstand what I mean. In this method, the private use
area codes do *not* represent characters, but rather presentation hint
data. As such, applications can safely ignore this data entirely
(ie. it would be thrown away at the lowest level) if they are simple
minded. The results will be legible, and the higher level parser will
work with, or without them.

Albert Lunde writes:

>At what point do char set issues get handed off to the HTML standard?
>
>It seems like some of the alternative glyph rendering issues
>could get mixed in with things like font changes and presentation
>control (which are possible, though controversial, issues for HTML).

In my opinion, HTML should not be concerned with character sets at
all. If the parser assigns a characters' class based on Unicode, and
the lowest level of the application passes Unicode characters up to
the parser proper, then the parser is totally divorced from data
storage format. The ERCS proposal goes into more detail, and proposes
basically the above.

>Of course text/html is not the only text/* type we might have to
>transport....

Surely most parsers for other textual data types need not grovel
around at the data storage format level? Most can surely be abstracted
enough that they can deal with a character as an atom of information
rather than as a sequence of bytes?

Received on Wednesday, 28 December 1994 11:04:32 UTC