Re: HTML markup for language identification?

On Wed, 11 Jun 1997, Jason White wrote:

> it is necessary to be able to ascertain the language in which a
> document is written. It might well be argued that such
> functionality can be achieved within the HTML user agent by means
> of dictionaries, perhaps combined with a grammatical analysis of
> the text. However, in the case of multilingual documents, it may
> be more difficult to determine which words are intended to be
> written in which language, especially if there is some
> correspondence in spelling. How reliable are software-based
> "language identification techniques?

My understanding this that techniques for morphological
analysis are really quite good, but are better suited
to supporting authoring, where the authoring tool can
take some of the grind out of entering the markup. 

HTML Cougar basically subsumes RFC 2070 including the LANG
attribute which is intended for this purpose. The following
is extracted from the spec:

The lang attribute's value is a language code that identifies a
natural language spoken, written, or otherwise conveyed by human
beings for communication of information to other human beings.
Computer languages are explicitly excluded from language codes.

The syntax and registry of HTML language codes is the same as that
defined by [RFC1766]. Language codes consist of a primary code and a
possibly empty series of subcodes:

        language-code  = primary-code *( "-" subcode )
        primary-code   = 1*8ALPHA
        subcode        = 1*8ALPHA

EXAMPLE:

       "en": English 
       "en-US": the U.S. version of English. 
       "en-cockney": the Cockney version of English. 
       "i-cherokee": the Cherokee language spoken by
                     some Native Americans. 

Two-letter primary codes are reserved for [ISO639] language
abbreviations.  Three-letter primary codes are reserved for
"Ethnologue"  abbreviations ([ETHNO]) (in in addition to the
requirements of [RFC1766]). Any two-letter primary code is a
[ISO3166] country code..

Regards,

Dave Raggett <dsr@w3.org>

tel +44 122 578 2521 url http://www.w3.org/People/Raggett
World Wide Web Consortium (on assignment from HP Labs)

Received on Wednesday, 11 June 1997 08:18:32 UTC