Re: Text that's not in any language from Gary Adams - Sun Microsystems Labs BOS on 1997-01-08 (www-international@w3.org from January to March 1997)

From: Gary Adams - Sun Microsystems Labs BOS <gra@zeppo.East.Sun.COM>
Date: Wed, 8 Jan 1997 13:33:26 -0500 (EST)
To: www-international@www10.w3.org, bert@www10.w3.org
Message-ID: <libSDtMail.9701081333.23904.gra@zeppo/zeppo>

> From: Bert Bos <bert@www10.w3.org>
> Subject: Text that's not in any language
> Date: Wed, 08 Jan 1997 18:51:56 +0100
> 
> RFC 2070 (html-i18n) says that the LANG attribute is only for natural
> languages, not for computer languages, but recently I've started
> wondering why.

What is the purpose of LANG in HTML? Fundamentally, there is a need for
more semantic information in the text in order to perform application 
specific processing on the text. e.g. in multilingual documents alternate 
rendering operations may be needed. For speech rendered documents
alternate voices might be selected appropriately, for BIDI languages 
alternate display operations are needed, for indexing operations the
source language is important, etc.

For XML documents, a much richer set of tags might accompany the text to 
highlight proper names, dates, currency, whatever.

> 
> It may happen in a text that there is a word or phrase that is not in
> any human language, such as the name of somebody, or some code.

I believe the NLP community would refer to your list of examples as the
"non-grammatical tokens" in the text. A text to speech system benefits
a lot from simple markup like <EMAIL>java@www.sun.com</EMAIL> to 'render'
the text as "java", "at", "w", "w", "w","dot", "sun", "dot", "com".


> 
> HTML has some mark-up for the computer code: it can be put inside
> <CODE>, but there is no element for the name of a person.

There's a slippery slope here about how much semantic information you
allow in markup. I'd love for the text to be marked up down to the
disambiguated senses of words for better client side machine translation,
but that would not be a very generally useful construct for most applications.

> 
> Maybe LANG should be extended to cover
> 
>   - computer languages (Pascal, C, HTML, CSS,...)
>   - proper names (language "none"?)
>   - "unknown" and "any" languages
> 

Perhaps if we understood how this level of markup would be used we'd
have a better sense of it's mapping on to the LANG attribute. e.g
Why not '<SPAN CLASS="name,proper,company">Sun  Microsystems</SPAN>'
and '<SPAN CLASS="name,proper,company,abbrev">IBM</SPAN>'
and '<SPAN CLASS="name,proper,company,stock">APPL</SPAN>'?

> The last two would be useful, resp., for a text that is in some
> language, but the author doesn't know which, and for a text that is the
> same in every language. An example would be the SI units mm, s, etc.

<CODE CLASS="C++"> .....</CODE>

> 
> Comments?
> 
> 
> Bert
> 

gra

$.02

Received on Wednesday, 8 January 1997 13:36:57 UTC