Primary Language in HTML, XHTML and XML

This document is online in PDF at
http://europa.eu.int/comm/translation/engineering/primary_language_en.pdf

Regards
Tomas

-------------------------------------------------------------------
Primary Language in HTML, XHTML and XML
M.T. Carrasco Benitez
European Commission
March 2005. Version  2.0

* Status of this document
This document is a feedback to the Authoring Techniques for XHTML &
HTML Internationalization: Specifying the language of content 1.0 [AT]
and it builds on the Primary Language in HTML [PLH].

This document contains proposals; i.e., it is not a recommendation and
must not be followed to implement systems. If this document is cited,
it must be referred to as work in progress.

The latest version of this document is at:
 
http://europa.eu.int/comm/translation/engineering/primary_language_en.pdf

The previous version is at [PRE].

* Abstract
Primary language is the natural language in which a document is
written. To decide the primary language(s) (from now is singular, but
must be read as primary language(s)), the same criteria are applied to
traditional paper documents and to electronic documents. Essentially,
if the bulk of a document is in English, the document is considered to
be in English, even if there are a few bits in other languages.

There are also documents that are in multiple languages. These
documents would have multiple primary languages. For example, the main
page of the server Europa [EU] is in twenty languages.

Specifying the primary language in documents is very useful. Due to the
large number of multilingual documents, the European Institutions and
Bodies are very interested. The recommendation for primary language
should be as simple as possible. 

* Principles
• The document [AT] should also address XML [XML].
•The document should also address filenaming.
•The primary language must be normalized; i.e., specified only once.

* External and internal specification
The primary language could be specified externally (“… outside the
document …” in section 3.1) and internally. For example, externally
with the HTTP header field Content-Language; internally with the lang
attribute.

The filename should be considered acceptable to specify the primary
language. file is a registered URI scheme [US]. Examples:

myfile.en.html
myfile_en.html

Conventions to specify the primary language in the filename is a mayor
issue of great practical relevance.

* Inheritance
The external specification of the primary language must be at the top
of the tree. The proposed tree is as follows:

1. External specification; e.g., HTTP header field Content-Language or
filename.
2. <meta http-equiv=”Content-Language”
3. <html lang
4. Other attributes down the tree.

* Primary language and text-processing language
Single primary language.- It is the default text-processing language.
It can be override down the tree. One should avoid re-specifying
unnecessarily primary and the text-processing language. For example, if
English (only one language) is specified with
  meta http-equiv=”Content-Language”

one should not re-specify English again with the html lang.

Multiple primary languages.- The text-processing is considered
undefined. It can be specified down the tree. For example, if English
and French is specified (multiple primary languages) with
  meta http-equiv=”Content-Language”

the text-processing language is undefined if one does not specify it
down tree such as with the
 <p lang= …>.

* Data normalization
The language should be specified only once. In particular,  the
following double declaration should be avoided:

<html lang=”en” xml:lang=”en”>

* XHTML
For XHTML, one attribute must be sufficient. Though having both
attributes should also be valid. Section 4. of the document states
that:  “One method is to use the lang and xml:lang attributes …”. It
should be “and/or”; i.e., the double declaration with lang and xml:lang
should not be mandatory.

If one want to have a double declaration for the lang, the same
principle would have to be applied to xml:id [XMLID]:

 <p id=”foo” xml:id=”id”>

The title element in (X)HTML documents with multiple primary languages
The title element must be either: 
• Language neutral text
• Texts in all the primary languages

It is proposed to have a language neutral title. If this is not
possible, it is proposed not to include it or to have an empty title.

Example of language neutral text title (Europa is the name of the
server):

 <title>Europa</title>

Example: texts in multiple languages in the title without language
marking (poor and to be avoided):

<title>Gateway to the European Union. El portal de la Unión
Europea</title>

Example: texts in multiple languages in the title with language
marking:

 <title>
  <foo lang=”en”>Gateway to the European Union</foo>
  <foo lang=”es”>El portal de la Unión Europea</foo>
 </title>

At present, this is not possible. The elements div and span could be
considere for foo.

* XML
Multiple primary languages should be allowed in xml:lang. Example:

<?xml version=”1.0” ?>
<doc xml:lang=”en,es”>
  <text xml:lang=”en”>Gateway to the European Union</text>
  <text xml:lang=”es”>El portal de la Unión Europea</text>
</doc>

Nothing has to be changed in the XML; at most a clarification. In
section 2.1.2 states: “The values [plural] of the attribute are
language identifiers…”. It works with well-formed documents; for valid
documents, the DTD could allow multiple values.

* Metadata
The primary language must not be repeated in other metadata systems.

Example of one primary language with the Dublin Core [DC]:
<html>
  <head>
    <meta http-equiv="Content-Language" Content="en">
    <!--
    this element is virtually here
    <meta name= "dc.language" content ="en" />
    -->
    <meta name= "dc.creator" content ="M.T. Carrasco Benitez" />
    <title>European Union</title>
  </head>
  <body>
    <p>Gateway to the European Union<p>
  </body>
</html>

In the Dublin Core, the element language can contain only one language.
So, one needs to agree on the meaning of the meta element with the
attribute http-equiv. For example:

  <meta http-equiv="Content-Language" Content="en,es">

The following cannot be assumed:
  <meta name= "dc.language" content ="en,es" />

There should be a unified approach to the overlapping systems of
metadata; but  this is considered out of scope of this document. For
example, HTML has the element title and the Dublin Core also has an
element title.

Bad example:

<html>
  <head>
    <meta http-equiv="Content-Language" Content="en">
    <meta name=  "dc.creator" content ="M.T. Carrasco Benitez" />
    <meta name= "dc.title" content ="European Union" />
    <title>European Union</title>
  </head>
  <body>
    <p>Gateway to the European Union</p>
  </body>
</html>

A better example:

<html>
  <head>
    <meta http-equiv="Content-Language" Content="en">
    <meta name= "dc.creator" content ="M.T. Carrasco Benitez" />
    <!--
    this element is virtually here
    <meta name= "dc.title" content ="European Union" />
    -->
    <title>European Union</title>
  </head>
  <body>
    <p>Gateway to the European Union</p>
  </body>
</html>

* References
AT
Authoring Techniques for XHTML & HTML Internationalization: Specifying
the language of content 1.0
W3C Working Draft 24 February 2005
Richard Ishida
http://www.w3.org/TR/2005/WD-i18n-html-tech-lang-20050224

DC
Information and documentation - Dublin Core metadata element set
Draft International Standard
http://www.niso.org/international/SC4/n515.pdf

EU
Europa
Gateway to the European Union
http://europa.eu.int

HTML 
HTML 4.01 Specification
W3C Recommendation
Dave Raggett, Arnaud Le Hors, Ian Jacobs
http://www.w3.org/TR/html401

PLH
Primary Language in HTML
World Wide Web Consortium Note 13-March-1998
M.T. Carrasco Benitez
http://www.w3.org/TR/1998/NOTE-html-lan-19980313.html

PRE
Primary Language in HTML, XHTML and XML
Version 1 of the present document. October 2004.
European Commission
M.T. Carrasco Benitez
http://europa.eu.int/comm/translation/engineering/primary_language-1_en.pdf

TIL
Tags for the Identification of Languages
Request for Comments (RFC)
H. Alvestrand
http://www.ietf.org/rfc/rfc3066.txt


US
Uniform Resource Identifier (URI) SCHEMES
Official IANA Registry of URI Schemes
http://www.iana.org/assignments/uri-schemes

XHTML
XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition)
W3C Recommendation
W3C HTML Working Group
http://www.w3.org/TR/xhtml1

XML
Extensible Markup Language (XML) 1.0 (Third Edition)
W3C Recommendation
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, François
Yergeau
http://www.w3.org/TR/REC-xml

XMLID
xml:id Version 1.0
W3C Working Draft 7 April 2004
Jonathan Marsh, Daniel Veillard
http://www.w3.org/TR/2004/WD-xml-id-20040407

Author
Manuel Tomas CARRASCO BENITEZ
European Commission
L-2920 Luxembourg
Telephone: +352 4301 36943


Send instant messages to your online friends http://uk.messenger.yahoo.com 

Received on Wednesday, 23 March 2005 15:23:06 UTC