Multi-Lingual Pages

USE CASES

The microformats community has already collected a bunch of use cases for multi-lingual HTML documents [1].

My own use case is that on legislation.gov.uk we have items of legislation that are published in Welsh and English, and we want to be able to distinguish between the Welsh and English titles and descriptions when we list them.

On the consumer side, I imagine that those consumers who gather data across the web need to be able to distinguish between information about the same entity provided in different languages.

DISCUSSION

HTML has the lang attribute to indicate the language of a particular part of a document, which is reflected in the lang property within the DOM.

# Microformats #

Microformat processors could theoretically pick up on the language of a value when mapping into other formats. For example, hCalendar [2] processors could use the HTML language to provide a value for the LANGUAGE parameter in iCalendar [3]; hCard [4] processors could do similarly when mapping to vCard [5]. However, I can't see anything in the wiki specifying this. There's also nothing about language support in microformats-2 [6].

Does anyone here know anything more about microformat language support? Would someone volunteer to investigate?

# RDFa #

RDFa processors use the lang attribute when generating RDF [7], which supports language-tagged plain literals [8].

# Microdata #

The microdata data model [9] doesn't support language-tagged values and nor does microdata+json [10], but the lang DOM property is accessible through the API.

I think it's probably worth raising this as a bug report on microdata. Could someone with experience of raising bug reports on HTML5 put together some wording?

Assuming nothing changes, I can see a couple of possible workarounds here:

  * using different properties for values in different languages
  * using values that are items with 'value' and 'lang' properties

but neither of these reuse the HTML lang attribute which is the natural way for users to indicate language.

Is it worth us pushing these as best practices for people using microdata? If we were to, I think we should propose the creation of a standard language-tagged-value type in a common namespace, and push consumers to recognise it. Probably we should see what emerges from a bug report on microdata before spending time on this.

PROPOSED GUIDELINES

1. Use the HTML lang attribute to indicate the language of different parts of the page

2. If you are publishing pages that contain multiple languages, use RDFa [or microformats; pending input on microformats support] to mark up your data

3. If you are consuming information from a set of pages that use different languages, ensure your data model includes language tagging and that your processor uses the lang DOM property when interpreting values


[1]:  http://microformats.org/wiki/multilingual-brainstorming
[2]:  http://microformats.org/wiki/hcalendar
[3]:  http://tools.ietf.org/html/rfc5545#section-3.2.10
[4]:  http://microformats.org/wiki/hcard
[5]:  http://tools.ietf.org/html/rfc6350#section-5.1
[6]:  http://microformats.org/wiki/microformats-2 
[7]:  http://www.w3.org/TR/rdfa-core/#T-current-language
[8]:  http://www.w3.org/TR/rdf-concepts/#dfn-plain-literal
[9]:  http://dev.w3.org/html5/md/#the-microdata-model
[10]: http://dev.w3.org/html5/md/#json
-- 
Jeni Tennison
http://www.jenitennison.com

Received on Sunday, 2 October 2011 10:20:57 UTC