- From: Gavin Nicol <gtn@ebt.com>
- Date: Wed, 28 Dec 1994 09:51:08 -0500
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
jag@scndprsn.Eng.Sun.COM (James Gosling) writes: >Your paper deals mostly with transport issues (things that http >could help with), but there are a pile of issues that are related: Undoubtedly. I don't claim to be an expert in this field either. Rather, I am just basing things on my practical experience. >4. Data entry issues? Yes. I'd forgotten about those ;-) Seriously though, these are primarily outside the scope of the paper, except to not that the proposed method for data exchange should also be used for data a client generates. ie. All servers *should* be able to accept data in UTF-[78] and UCS-2. >Some are a real quagmire, like "I expect a date to be entered here >and it's value returned to me using the ISO conventions". For this, other work will need to be done, but this is outside the scope of my paper. >Other things get represented in a document besides characters. This is >almost certainly outside of the realm of http, but might fit in with >html-42.0. For example, dates and measures. A hypthetical document >might contain: True. Again, my proposal is only about establishing a multilingual document exchange foundation. If the different systems cannot interoperate at the lowest level, the higher level doesn't stand a chance :-) >in http. For simiplicity I'd recommend a really limited set of >allowed encodings: ISO-8859-1 and UTF8. I thought about this. I think that UTF-7 and UCS-2 are so trivial to parse (and to to convert to in most cases), that the mail-safety of UTF-7, and the compactness of UCS-2 for Asian languages offer enough benfit that they should also be included. >> However, high-level tag use (eg. defining them in a DTD) fails for >> the following reasons: >> 1. It is not transparent. The application processing the data stream >> must be able to parse the tags, even if it can not do anything >> with them. This necessarily complicates the parser. >> 2. There are probably a huge number of presentation hints that could >> be used, and the list is dynamic as societal trends tend to alter >> languages. Good examples can be found by comparing almost any >> current written form of a language to that used 100 years ago. >> Some languages have even changed dramatically in the last 50 >> years. > >These problems affect even low-level tags such as those you proposed. >This whole area should be left to standards above http. To a degree, you are correct. However, most (and especially SGML) parsing takes place at an abstract level somewhat above the actual character codes. The choice between low-level and high-level is driven by the choice of complicating the upper layers, and having to modify them at the slightest change, or to complicate the lower layers, and have the higher layers work irrespective (because they do not *require* that the presentational hints even be present). High level tags (or attributes) *might* be useful for things like deciding data and measurement formats, but I am not convinced of that yet. The presentation hints I talk about are primarily useful for glyph image disambiguation, not data display format decisions. Note that in Method 3. "tag" does not mean "SGMl tag". >> Method 1: Code-based presentation hints >The big problem with the use of the private use area in this way is that >it is "syntax without semantics". These numbers are meaningless unless there >is some mechanism for defining how they should be interpreted. Something >higher-level is required if, for example, a document using one of these >extended characters is ever to be displayed. I think you misunderstand what I mean. In this method, the private use area codes do *not* represent characters, but rather presentation hint data. As such, applications can safely ignore this data entirely (ie. it would be thrown away at the lowest level) if they are simple minded. The results will be legible, and the higher level parser will work with, or without them. Albert Lunde writes: >At what point do char set issues get handed off to the HTML standard? > >It seems like some of the alternative glyph rendering issues >could get mixed in with things like font changes and presentation >control (which are possible, though controversial, issues for HTML). In my opinion, HTML should not be concerned with character sets at all. If the parser assigns a characters' class based on Unicode, and the lowest level of the application passes Unicode characters up to the parser proper, then the parser is totally divorced from data storage format. The ERCS proposal goes into more detail, and proposes basically the above. >Of course text/html is not the only text/* type we might have to >transport.... Surely most parsers for other textual data types need not grovel around at the data storage format level? Most can surely be abstracted enough that they can deal with a character as an atom of information rather than as a sequence of bytes?
Received on Wednesday, 28 December 1994 11:04:32 UTC