W3C home > Mailing lists > Public > public-i18n-geo@w3.org > June 2004

RE: Why should I use the 'lang' attribute?

From: Richard Ishida <ishida@w3.org>
Date: Wed, 16 Jun 2004 13:28:02 +0100
To: "'Deborah Cawkwell'" <deborah.cawkwell@bbc.co.uk>, "'GEO'" <public-i18n-geo@w3.org>
Message-Id: <20040616122801.B2AF04EECC@homer.w3.org>

Hi Deborah,

I agree with all of Addison's excellent notes.

My notes below...


============
Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 
 
 

> -----Original Message-----
> From: public-i18n-geo-request@w3.org 
> [mailto:public-i18n-geo-request@w3.org] On Behalf Of Deborah Cawkwell
> Sent: 14 June 2004 23:21
> To: GEO
> Subject: FAQ: Why should I use the 'lang' attribute?
> 
> Hi All
> 
> In feedback about my last attempt at this FAQ, the group 
> suggest I make a strong initial argument. I hope I've done 
> that (but I'm sure there will be some input - which of course 
> I welcome).
> 
> I have to admit that I've not really worked (more) on the 
> 'applications' part... So I welcome any text fragments that I 
> could incorporate tomorrow night (from 19:00 BST/GMT+0100), 
> so that on our Wednesday teleconference, we might get 
> somewhere near publishing this. 
> 
> Best regards to all & thanks (to any contributors)
> 
> Deborah
> 
> -------------------------------------
> 
> QUESTION
> 
> Why should I use the 'lang' attribute?

I feel like we should limit this to '... in HTML', or widen it to wording that would include xml:lang.  Dunno.  At least we should say very early that xml:lang is relevant too.


> 
> 
> ANWSWER
> 
> Overview
> 
> The 'lang' attribute contains information about the 'natural' 
> language of content. 

Mention xml:lang here.


> 
> A 'natural' ('human') language is a language with which 
> people communicate with one another such as Arabic or 
> Brazilian Portuguese. This stands in comparison to an 
> 'artificial' language, such as C or Perl, with which people 
> communicate with machines.
> 
> It is useful to identify the language of content and to make 
> that language information 'semantically' available, so that 
> it can serve people's needs better. For example, when 
> searching for information, it is useful to narrow that search 
> to the languages that the searcher can understand. In 
> addition, it may be desirable to display different natural 
> languages in ways known by users of those languages, for 
> example, quotation marks have different written 
> representations in different languages.
> 

There are some issues with the next two paragraphs that I think Addison described well.  I also see the stuff related to character encoding as somewhat tangential to the main argument, so I think it should either appear under a subheading, or possibly even as a note in the margin.


> The 'lang' attribute serves to uniquely identify the 
> 'language of content'. 

>Other means of identifying that 
> language of content, such as 'character encoding', do not 
> uniquely identify the natural language and may change over 
> time. Currently, natural language could be identified by 
> 'character encoding'. However, that character encoding does 
> not uniquely identify a natural language. One character 
> encoding can be used for multiple natural languages, eg, 
> Latin 1 (iso-8859-1) can encode both French and English. In 
> addition, the character encoding can vary over a single 
> language, eg, Arabic could be encoded with 'windows-1256' or 
> 'iso-8859-6' or 'utf-8' (or another Unicode encoding). 
> 
> Unicode - which can encode all languages - is likely to 
> become the dominant encoding form, bacause it can resolve 
> many problems.  Therefore, character encoding will cease to 
> have any use at all for identifying natural language(s) of 
> web content. An additional problem is that character encoding 
> may be specified in different places: in the http header 
> and/or in a metatag, where that encoding relates to the whole 
> page (forms can be an exception). 

I really like Addison's proposed text for the next paragraph, but his text sums up much of what you have in the answer so far (excluding the character related stuff).

> 
> The more pages that are correctly marked up with appropriate 
> semantic language information, the more applications will 
> emerge to harness it, to deliver information relevant to 
> people in the languages they understand.
> 
> 
> Implementation

> 
> The 'lang' attribute can be applied to the HTML container of 
> the whole web page, ie, the HTML element, or to individual 
> HTML elements (span, div, td, p, etc) when the language 
> varies from that specified as the 'primary' language. 

I think this is useful information, but your next question worries me...

>[What 
> will happen when people use multiple languages as a matter of 
> course - with Unicode, I think this is inevitable?] 

Nothing changes.  Unicode documents have a primary language just like documents in any other encoding.  I think you are letting yourself be confused by the idea that character encodings express language. They don't.


> 
> The 'lang' attribute of an HTML element is specified slightly 
> differently in HTML and XML, eg: 
> 
> <html lang="en" xml:lang="en"...
> 
> lang='en' = HTML markup
> xml:lang='en' = X(HT)ML markup
> 
> When using XHTML both syntaxes should be used.


The implementation detail is a little more complex than you describe it here because lang should not be used in xhtml 1.1 (nor XML). I would prefer to see a pointer to the language declaration tutorial and the relevant techniques doc, rather than an attempt to re-state how to do it.  This FAQ is about *why* one should do it, not how.

As I said before, I think one should allude to the fact that xml:lang may be required in addition to / in place of lang. But I think we should do so in the very first para of the answer.


> 
> 
> Application

Applications?


> 
> Accessibility
> The 'lang' attribute assists speech synthesizers and Braille 
> translators; it is required by the W3C Web Accessibility 
> Initiative (WAI) and enforced governmental policies in some 
> countries, eg, UK - Disability Discrimination Act (UK) 
> [other 
> countries? contact WAI? and/or specifically request this 
> information from users - useful way to get people more involved?]

You could look through http://www.w3.org/WAI/Policy/ but I think your example of the UK is sufficient to make your point.


> 
> Page rendering
> CSS2 uses the 'lang' attribute powerfully as a pseudo class.  
> (http://www.w3.org/International/questions/qa-css-lang.html).
> Unfortunately it doesn't work in IE yet. [Clarify scope - 
> changes with versions and operating systems - in order to 
> keep FAQs up-to-date - refer to tests & results from tests]  
> But the concept of language specific styling is a very 
> powerful one. [Need to add some examples.]
> 
> Search
> A common use for meta 

I think you talking about the meta element specifically, rather than meta information in general, so you should say so.

> is to specify keywords that a search 
> engine may use to improve the quality of search results. When 
> several meta elements provide language-dependent information 
> about a document, search engines may filter on the xml:lang 
> attribute to display search results using the language 
> preferences of the user. 
> (http://www.w3.org/TR/2002/WD-xhtml2-20020805/mod-meta.html)

Language information expressed with the lang attribute might also be useful for searching.  I don't know much about this area, but folks from the information science community at the Unicode conference seemed to be requesting that the lang (and xml:lang) attributes be fully deployed to help their searches.



> XML
> The 'xml:lang' attribute is the standard way to identify 
> language information in XML. [Information about tasks] cf Google

Not sure how this is relevant here.


> 
> Processing
> eg XSLT

You could also mention that this is/will be useful for spellchecking document during authoring.


> 
> 
> USEFUL LINKS
> 
> FAQ: HTTP and meta language information - 
> http://www.w3.org/International/questions/qa-http-and-lang.html
> [Will check following - from previous]
> HTML 4.01 Specification W3C Recommendation 24 December 1999: 
> http://www.w3.org/TR/html401/struct/dirlang.html#h-8.1.3.
> XHTML 2.0 W3C Working Draft 5 August 2002 
> http://www.w3.org/TR/2002/WD-xhtml2-20020805/mod-meta.html
> Web Accessbility Initiative: lang attribute - 
> http://www.w3.org/TR/WCAG10/#gl-abbreviated-and-foreign
> Tutorial: Language markup in XHTML and CSS (DRAFT): 
> http://www.w3.org/International/tutorials/tutorial-lang.html
> Authoring Techniques for XHTML & HTML Internationalization: 
> Specifying the language of content 1.0 - 
> http://www.w3.org/International/geo/html-tech/tech-lang.html
> FAQ: Styling using the lang attribute: 
> http://www.w3.org/International/questions/qa-css-lang.html
> FAQ: Two-letter or three-letter language codes: 
> http://www.w3.org/International/questions/qa-lang-2or3.html
> From the usability perspective: 
> http://diveintoaccessibility.org/day_7_identifying_your_language.html
> An interesting view on Google usage across cultures:
> http://www.google.com/press/zeitgeist2003.html
> http://www.google.com/press/zeitgeist.html



Note that I've started to link to the topic index for general pointers to additional information on a topic. See for example the right hand column of http://www.w3.org/International/questions/qa-lang-priorities.html

Of course you can still link to specific articles of particular interest.


Hope that helps.
RI


> 
> 
> 
> http://www.bbc.co.uk/ - World Wide Wonderland
> 
> This e-mail (and any attachments) is confidential and may 
> contain personal views which are not the views of the BBC 
> unless specifically stated.
> If you have received it in error, please delete it from your system. 
> Do not use, copy or disclose the information in any way nor 
> act in reliance on it and notify the sender immediately. 
> Please note that the BBC monitors e-mails sent or received. 
> Further communication will signify your consent to this.
> 
Received on Wednesday, 16 June 2004 08:28:09 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:12:38 GMT