Re: Future of HTML

Jukka Korpela (jkorpela@cc.hut.fi)
Thu, 26 Mar 1998 16:22:37 +0200 (EET)


Date: Thu, 26 Mar 1998 16:22:37 +0200 (EET)
From: Jukka Korpela <jkorpela@cc.hut.fi>
To: david_richmond@nl.compuware.com
cc: www-html@w3.org
In-Reply-To: <9802268909.AA890946600@ccm2smtp.nl.compuware.com>
Message-ID: <Pine.OSF.3.96.980326154304.17555A-100000@beta.hut.fi>
Subject: Re: Future of HTML

On Thu, 26 Mar 1998 david_richmond@nl.compuware.com wrote:

>      I would like to see a formal HTML way of formatting data-type values, 
>      such as dates and numbers.

This sounds presentation-oriented, and the current trend is drop
such feature out of HTML itself. I suggest that you rephrase
suggestions in terms of structural markup (which _may_ involve
information used by browsers to select suitable presentation).

For example, you might suggest a text-level (phrase) element for
marking up some text as consisting of data of some specific kind,
such as being a date notation. You would then need to explain why
should markup would be useful.

I think dates are not very interesting in this way, as I'll try to
explain. But there might be other reasons for introducing
"data type markup", as I'll propose.

>      The raw value would be specified using USA 
>      conventions,

Why? This is the World Wide Web. And there is an international
convention for date notations, developed for the presentation of
dates in international contexts. See
http://www.roguewave.com/resources/exchange/iso8601.html
for a description of the standard, ISO 8601:1988. It has been
criticized for being too strange to normal people to read, but
this counterargument does not apply to things like notations
in HTML source. This is the approach adopted in
http://www.w3.org/TR/REC-html40/types.html#h-6.11

>      but can be reformatted to the user agent's conventions. 
>      For example, an american date of 3/26/98 would be shown in a European 
>      user agent as 26/3/98 or even as 26 March 1998.

Don't you find this confusing? If I were reading a document in English
on a browser configured for the Finnish locale, I would probably see
the date as 26. maaliskuuta 1998. The formatting should take place
according to the language of the _document_. Thus, the date can
be written that way in the first place. (Quite apart from this,
I think it might be useful to switch to the 1998-03-26 style
universally.)

>      As for the best way of doing this I am not sure, but adding a 
>      <DATATYPE> tag would be one way.

Well, a bit long name. The element for "data type markup" could be
called just DATA. For example,
  <DATA TYPE="date">1998-03-26</DATA>
could just indicate that 1998-03-26 is specifically a date. This
might be marginally useful in the sense that various checkers could
validate the format of dates against some syntax, preferably
some ISO 8601:1988 based syntax in this case. A style sheet might
suggest some particular font face for dates, for example. But for
reasons explained above, it is highly questionable whether the
_essentials_ of presentation should be affected.

Similarly, something like

>         <DATATYPE type=number>1000.00</DATATYPE>

should not affect the use of decimal point versus decimal comma,
for example, but it might affect the font face and size, if the
browser so decides.

Much more importantly, such markup could be essential in
_translation_. I once translated some texts on HTML 3.2 using
BabelFish. I noticed that the translation program was too clever:
it noticed the 3.2 and converted this number according to the
conventions of the target language, making it 3,2 for languages
which use decimal comma! Thus, an author might wish to assist
translation software to make it realize that 3.2 is a code-like
notation, not a number with a decimal point in it.

On the other hand, something like <DATA type=realnumber> vs.
<DATA type=versionnumber> would make things rather awkward.
Perhaps a better solution would be an element (nestable with
other text level markup of course) for simply indicating that
its content is to remain invariant in translations. This would
allow us to use, say, the name of an HTML element or a C language
keyword in running English text so that a translation program had
a chance of realizing that those names, although found in English
glossaries, are not to be translated. Someone might says that the
technically simplest way to achieve this would be to introduce a specifing
language name (LANG attribute value), such as "none" (with the
meaning 'no human language'), e.g.
   The <LANG="none">TITLE</LANG> element...
but the problem is that the HTML element name TITLE _is_ an English
word in the sense that it is _pronounced_ as an English word.

>      In the case of INPUT elements simple datatype validation could also be 
>      performed by the user agent on any values modified by the user.

This is an interesting question but quite distinct from marking up
the document content. It relates to specifying to allowable type of
_user input_. The current trend seems to be to handle them in the
server side, possibly with client-side checking (using JavaScript)
before submission. The situation is unsatisfactory, and it would
be a cleaner solution to allow the HTML code specify the expected
input format. The problem is that although simple checks (such
as data being numerical) would be easy, authors need all kinds of
checks, and any method which provides the necessary universality
would inevitably have the power of a simple programming language at least.

Yucca, http://www.hut.fi/u/jkorpela/