- From: Gary Adams - Sun Microsystems Labs BOS <gra@zeppo.East.Sun.COM>
- Date: Wed, 8 Jan 1997 13:33:26 -0500 (EST)
- To: www-international@www10.w3.org, bert@www10.w3.org
> From: Bert Bos <bert@www10.w3.org> > Subject: Text that's not in any language > Date: Wed, 08 Jan 1997 18:51:56 +0100 > > RFC 2070 (html-i18n) says that the LANG attribute is only for natural > languages, not for computer languages, but recently I've started > wondering why. What is the purpose of LANG in HTML? Fundamentally, there is a need for more semantic information in the text in order to perform application specific processing on the text. e.g. in multilingual documents alternate rendering operations may be needed. For speech rendered documents alternate voices might be selected appropriately, for BIDI languages alternate display operations are needed, for indexing operations the source language is important, etc. For XML documents, a much richer set of tags might accompany the text to highlight proper names, dates, currency, whatever. > > It may happen in a text that there is a word or phrase that is not in > any human language, such as the name of somebody, or some code. I believe the NLP community would refer to your list of examples as the "non-grammatical tokens" in the text. A text to speech system benefits a lot from simple markup like <EMAIL>java@www.sun.com</EMAIL> to 'render' the text as "java", "at", "w", "w", "w","dot", "sun", "dot", "com". > > HTML has some mark-up for the computer code: it can be put inside > <CODE>, but there is no element for the name of a person. There's a slippery slope here about how much semantic information you allow in markup. I'd love for the text to be marked up down to the disambiguated senses of words for better client side machine translation, but that would not be a very generally useful construct for most applications. > > Maybe LANG should be extended to cover > > - computer languages (Pascal, C, HTML, CSS,...) > - proper names (language "none"?) > - "unknown" and "any" languages > Perhaps if we understood how this level of markup would be used we'd have a better sense of it's mapping on to the LANG attribute. e.g Why not '<SPAN CLASS="name,proper,company">Sun Microsystems</SPAN>' and '<SPAN CLASS="name,proper,company,abbrev">IBM</SPAN>' and '<SPAN CLASS="name,proper,company,stock">APPL</SPAN>'? > The last two would be useful, resp., for a text that is in some > language, but the author doesn't know which, and for a text that is the > same in every language. An example would be the SI units mm, s, etc. <CODE CLASS="C++"> .....</CODE> > > Comments? > > > Bert > gra $.02
Received on Wednesday, 8 January 1997 13:36:57 UTC