RE: Why should I use the 'lang' attribute? from Addison Phillips [wM] on 2004-06-14 (public-i18n-geo@w3.org from June 2004)

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Mon, 14 Jun 2004 16:55:50 -0700
To: "Deborah Cawkwell" <deborah.cawkwell@bbc.co.uk>, <public-i18n-geo@w3.org>
Message-ID: <PNEHIBAMBMLHDMJDDFLHKEBGIDAA.aphillips@webmethods.com>
Hi Deborah,

This is wonderful to see. I have placed some comments interlinearly below, written stream-of-(un)conciousness. Hope these are at least marginally helpful.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -----Original Message-----
> From: public-i18n-geo-request@w3.org [mailto:public-i18n-geo-request@w3.
> org]On Behalf Of Deborah Cawkwell
> Sent: 2004年6月14日 16:11
> To: public-i18n-geo@w3.org
> Subject: FW: Why should I use the 'lang' attribute?
> 
> 
> I'm re-sending because I got an error for this message ('no content'!).
> 
>  -----Original Message----- 
>  From: Deborah Cawkwell 
>  Sent: Mon 14/06/2004 23:20 
>  To: GEO 
>  Cc: 
>  Subject: FAQ: Why should I use the 'lang' attribute?
>  
>  
> 
>  Hi All
> 
>  In feedback about my last attempt at this FAQ, the group 
> suggest I make a strong initial argument. I hope I've done that 
> (but I'm sure there will be some input - which of course I welcome).
> 
>  I have to admit that I've not really worked (more) on the 
> 'applications' part... So I welcome any text fragments that I 
> could incorporate tomorrow night (from 19:00 BST/GMT+0100), so 
> that on our Wednesday teleconference, we might get somewhere near 
> publishing this. 
> 
>  Best regards to all & thanks (to any contributors)
> 
>  Deborah
> 
>  -------------------------------------
> 
>  QUESTION
> 
>  Why should I use the 'lang' attribute?
> 
>  
>  ANWSWER
> 
>  Overview
> 
>  The 'lang' attribute contains information about the 
> 'natural' language of content. 

I would quote 'natural language' together as a phrase, since you are about to define the term.
> 
>  A 'natural' ('human') language is a language with which 

Perhaps: "A 'natural language' (sometimes called a 'human language')..."

> people communicate with one another such as Arabic or Brazilian 
> Portuguese. This stands in comparison to an 'artificial' 
> language, such as C or Perl, with which people communicate with machines.
> 
>  It is useful to identify the language of content and to 
> make that language information 'semantically' available, so that 

I would not quote "semantically", since the phrase is self-supporting.

> it can serve people's needs better. For example, when searching 
> for information, it is useful to narrow that search to the 
> languages that the searcher can understand. In addition, it may 
> be desirable to display different natural languages in ways known 
> by users of those languages, for example, quotation marks have 
> different written representations in different languages.
> 
>  The 'lang' attribute serves to uniquely identify the 
> 'language of content'. Other means of identifying that language 
> of content, such as 'character encoding', do not uniquely 

Don't quote 'character encoding'. Also: I think you really ought to omit character encoding as an example in this sentence. As you go on to point out, character encodings are orthogonal to language.

> identify the natural language and may change over time. 
> Currently, natural language could be identified by 'character 
> encoding'. However, that character encoding does not uniquely 

I would say "Currently, some applications attempt to identify natural language based on the character encoding used by the data."

> identify a natural language. One character encoding can be used 
> for multiple natural languages, eg, Latin 1 (iso-8859-1) can 
> encode both French and English. In addition, the character 

after English: "(as well as a great many other languages)."

> encoding can vary over a single language, eg, Arabic could be 
> encoded with 'windows-1256' or 'iso-8859-6' or 'utf-8' (or 
> another Unicode encoding). 

There is no reason why there has to be a 1:1 relationship, language to encoding, for encodings to identify languages. However, there *must* be a 1:1 mapping between encoding to language for it to work... and there isn't one. At which point you must use a heuristic, aka "guessing"...

> 
>  Unicode - which can encode all languages - is likely to 
> become the dominant encoding form, bacause it can resolve many 

s/encoding form/character set/

UTF-8 and UTF-16 are (some of the) encoding forms of Unicode. Cf. http://www.w3.org/TR/2004/WD-charmod-20040225/#sec-Digital

> problems.  Therefore, character encoding will cease to have any 
> use at all for identifying natural language(s) of web content. An 

It has a use now? :-) Perhaps "As Unicode encodings become more prevalent, legacy character encodings will become even less valuable as a means of identifying the natural language of content."

Also: you might want to note that the default encodings for XML are Unicode encodings...

> additional problem is that character encoding may be specified in 
> different places: in the http header and/or in a metatag, where 

"META tag"  or "meta tag" (with a space between)?

> that encoding relates to the whole page (forms can be an exception). 

Forms are not an exception in HTML. They are rendered in the page encoding. They can (sometimes) be SUBMITTED in another encoding (http://www.w3.org/TR/html401/interact/forms.html#h-17.3), but this is not widely used (or implemented). 

> 
>  The more pages that are correctly marked up with 
> appropriate semantic language information, the more applications 
> will emerge to harness it, to deliver information relevant to 
> people in the languages they understand.

This seems like the heart of the topic. Might I suggest something like:

    Applications exist that can use natural language metadata about content to deliver users the most relevent information based on the language preferences of end users. The more content that is tagged and tagged correctly, the more useful and pervasive such applications will become. Metadata that indicates content language can assist many audiences for a particular page or section of a page. For example, authoring tools can supply appropriate spelling and grammer checking based on the language of a segment. Translation tools can use the tags to help recognize sections of text in a particular language. Search engines can group or filter results based on the user's preferences. And user-agents can (and do) use the content language to select language-appropriate fonts, which improves the overall user experience of the page.
> 
>  
>  Implementation
> 
>  The 'lang' attribute can be applied to the HTML container 
> of the whole web page, ie, the HTML element, or to individual 
> HTML elements (span, div, td, p, etc) when the language varies 
> from that specified as the 'primary' language. [What will happen 
> when people use multiple languages as a matter of course - with 
> Unicode, I think this is inevitable?] 

Lots of markup or lots of "blanket" markup that it wrong for some segment of the text in the document. Note, though, that the majority of documents are written in a single language because few people read two languages at the same time.
> 
>  The 'lang' attribute of an HTML element is specified 
> slightly differently in HTML and XML, eg: 
> 
>  <html lang="en" xml:lang="en"...
> 
>  lang='en' = HTML markup
>  xml:lang='en' = X(HT)ML markup
> 
>  When using XHTML both syntaxes should be used.
> 
>  
>  Application
> 
>  Accessibility
>  The 'lang' attribute assists speech synthesizers and 
> Braille translators; it is required by the W3C Web Accessibility 
> Initiative (WAI) and enforced governmental policies in some 
> countries, eg, UK - Disability Discrimination Act (UK) [other 
> countries? contact WAI? and/or specifically request this 
> information from users - useful way to get people more involved?] 
> 
>  Page rendering
>  CSS2 uses the 'lang' attribute powerfully as a pseudo 
> class.  (http://www.w3.org/International/questions/qa-css-lang.html).
>  Unfortunately it doesn't work in IE yet. [Clarify scope - 
> changes with versions and operating systems - in order to keep 
> FAQs up-to-date - refer to tests & results from tests]  But the 
> concept of language specific styling is a very powerful one. 
> [Need to add some examples.]
> 
>  Search
>  A common use for meta is to specify keywords that a search 
> engine may use to improve the quality of search results. When 
> several meta elements provide language-dependent information 
> about a document, search engines may filter on the xml:lang 
> attribute to display search results using the language 
> preferences of the user. 
(http://www.w3.org/TR/2002/WD-xhtml2-20020805/mod-meta.html)
 XML
 The 'xml:lang' attribute is the standard way to identify language information in XML. [Information about tasks]
 cf Google

 Processing
 eg XSLT

 
 USEFUL LINKS

 FAQ: HTTP and meta language information - http://www.w3.org/International/questions/qa-http-and-lang.html
 [Will check following - from previous]
 HTML 4.01 Specification W3C Recommendation 24 December 1999: http://www.w3.org/TR/html401/struct/dirlang.html#h-8.1.3.
 XHTML 2.0 W3C Working Draft 5 August 2002 http://www.w3.org/TR/2002/WD-xhtml2-20020805/mod-meta.html
 Web Accessbility Initiative: lang attribute - http://www.w3.org/TR/WCAG10/#gl-abbreviated-and-foreign
 Tutorial: Language markup in XHTML and CSS (DRAFT): http://www.w3.org/International/tutorials/tutorial-lang.html
 Authoring Techniques for XHTML & HTML Internationalization: Specifying the language of content 1.0 - http://www.w3.org/International/geo/html-tech/tech-lang.html
 FAQ: Styling using the lang attribute: http://www.w3.org/International/questions/qa-css-lang.html
 FAQ: Two-letter or three-letter language codes: http://www.w3.org/International/questions/qa-lang-2or3.html
 From the usability perspective: http://diveintoaccessibility.org/day_7_identifying_your_language.html
 An interesting view on Google usage across cultures:
 http://www.google.com/press/zeitgeist2003.html
 http://www.google.com/press/zeitgeist.html
 


http://www.bbc.co.uk/ - World Wide Wonderland

This e-mail (and any attachments) is confidential and may contain
personal views which are not the views of the BBC unless specifically
stated.
If you have received it in error, please delete it from your system. 
Do not use, copy or disclose the information in any way nor act in
reliance on it and notify the sender immediately. Please note that the
BBC monitors e-mails sent or received. 
Further communication will signify your consent to this.
Received on Monday, 14 June 2004 19:57:25 UTC