W3C home > Mailing lists > Public > public-apa@w3.org > September 2019

AW: Re: HTML etc. and ISO 639 language codes

From: Christian Galinski <christian.galinski@chello.at>
Date: Wed, 4 Sep 2019 21:23:56 +0200
To: <janina@rednote.net>, <public-apa@w3.org>
Cc: "'klaus.miesenberger'" <klaus.miesenberger@jku.at>, "'Fourney, David'" <david.fourney@usask.ca>, <hoeckner@hilfsgemeinschaft.at>, <shadi@w3.org>, <alejandro.moledo@edf-feph.org>, <lisa.seeman@zoho.com>, "'Kasinskaite, Irmgarda'" <I.Kasinskaite@unesco.org>, <drude@xs4all.nl>, <stevelee@w3.org>, 'FERRES Mercč' <FERRES@iso.org>
Message-ID: <008401d56356$4bc310f0$e34932d0$@chello.at>
Hi, Janina,

 

Thank you for your positive reply. I am sorry that I cannot attend the TCAP
meeting – unless there is the possibility to attend through
teleconferencing.

This would also be the ideal way to participate for David Fourney, who could
represent ISO/IEC-JTC 1/SC 35 in this matter. 

 

Please be so kind as to put the issue of language identifiers/codes for sign
languages explained below on the agenda of the upcoming TCAP meeting in
Japan and discuss how it could be solved, duly taking into account that
language codes increasingly (for a variety of purposes) have to be combined
with other coding schemes.

 

Below please find a summary of the discussion concerning (1) alpha-2 vs.
alpha-3 language identifiers for sign languages in video programs and apps
and (2) the combination of codes to further specify the language used, the
regional and other language variety and the script in which a written file
is rendered.

 

Technically speaking there may be more complexity or deeper issues behind
the questions raised. There may also be new needs for coordination. We are
looking forward to your comments. If there would be a slot for the
discussion of the issues at the TCAP meeting, David Fourney and me could
join by calling in.

 

Best regards

Christian 

 

 

1 Background:

The issue at hand is a technical problem that occurs when you want to assign
language identifiers to sign languages, if the code length of the identifier
is limited to alpha-2. However, ISO 639-1:2002 “Codes for the representation
of names of languages – Part 1: Alpha-2 code” does not provide identifiers
for sign languages. There are estimates of the number of sign languages
between more than 300 and up to 500. About 150 are assigned 3-letter
language identifiers in ISO 639-3 “Codes for the representation of names of
languages – Part 3: Alpha-3 code for comprehensive coverage of languages”.
In this connection, David Fourney also referred to 2019 as UN's
International Year of Indigenous Languages – in some indigenous language
communities sign languages exist. ‘Sign languages’ differ from ‘signed
languages’ insofar as they are the main language for Deaf and Hard of
Hearing persons to express themselves and largely differ from the language
spoken/written by the language community in which the respective Deaf and
Hard of Hearing persons are living. Compared to ‘sign languages’, ‘signed
language’ is a language modality largely representing the spoken or written
form of a language (e.g. “Signed Exact English”) – thus any language can be
signed in this way which can be identified by adding the identifier “sgn” to
the respective language identifier.

 

2 Request to W3C/TCAP:

The issue was raised at the ISO/IEC-JTC 1/SC 35 meeting in 2018 in Okayama
“User interfaces” where I reported on standardizing activities of ISO/TC 37
“Language and terminology” referring to language coding. David Fourney made
TC 37 aware of the fact that there is a “deficiency” in the ISO 639 series
when it comes to the coding of sign languages in video technology. The issue
was taken up by two WGs in ISO/TC 37 working on the fundamental terminology
of language coding and language varieties in a coordinated way. Out of the
discussions emerged the clarification of the above-mentioned distinction of
‘sign language and ‘signed language’. The WGs formulated a request to
ISO/IEC-JTC 1/SC 35 to clarify the matter and formulate a recommendation to
ISO/IEC-JTC 1/SC 35. At its last meeting ISO/IEC-JTC 1/SC 35 in Shanghai on
2 August unanimously approved 

Resolution 2019-69: Requests that Alpha-3 codes be used and recommended 

ISO/IEC JTC1/SC35 

*	recognizes that the application of the 2-letter (alpha-2) code today
is not sufficient for use in programs and apps related to user interfaces
which is particularly detrimental when needed for identifying individual
languages (including individual sign languages) in user interfaces.  
*	resolves to recommend the use of 3-letter codes for language
identification, wherever they can be applied 
*	requests its chair to contact W3C to ask that they recommend the use
of 3-letter identifiers for the names of languages wherever used according
to:

*	ISO 639-2 "Codes for the representation of names of languages -Part
2: Alpha-3 code" and 
*	ISO 639-3 "Codes for the representation of names of languages - Part
3: Alpha-3 code for comprehensive coverage of languages" (which includes
additional languages beyond those in ISO 639-2) 

These can be recommended either in addition to or in replacement for the
2-letter language identifiers as defined in ISO 639-1 "Codes for the
representation of names of languages - Part 1: Alpha-2 code". 

 

Here the issue as explained by David Fourney:

The technical issue lies primarily with the HTML5 <video> element and how it
supports the HTML lang attribute.

A <video> allows for one or more <source> files (which can be audio and or
video tracks) as well as one or more <track> files (for subtitles, captions,
transcripts, etc.).As a developer, I want to specify the language of the
captions, audio, and video so I can meet WCAG's SCs. (WCAG SC 3.1.1 and SC
3.1.2 require the specification of the language of content.)

HTML allows the specification of the language of content on pretty much any
element using HTML5's lang attribute. This means that I can specify the
language of a caption file, an audio track, or (presumably) a video track.

As a user, if my media player supports it, I can select an audio track in
one language (e.g., French) and a caption track in another (e.g.,
Norwegian). Theoretically, I can also select a video track in whatever
language I want.

That's where the problem lies. If the audio is embedded in the video file,
then obviously the language of the video is the language of the audio. This
can be any spoken language. Typically, this is indicated with a
two-character code. (This is also true with audio sources and captioning.)

Many languages do NOT have a two-character code. (Many many languages face
this issue. The SIL code tables provides a list of languages that have one
or both types of codes:  <https://iso639-3.sil.org/code_tables/639/data>
https://iso639-3.sil.org/code_tables/639/data)

But, what if there is no audio in the video? What if the language of the
video is in fact a visual language? What if it is a sign language? 

I should be able to specify the language of the content (e.g., lang="ase").
Since no sign languages have a two-character code, this must be a
three-character code.

 

3 Combinations of codes:

Increasingly a higher degree of granularity is becoming necessary for
identifying not only languages and their regional varieties, but also other
dimensions of language variation, such as a speaker’s language register or
communication anomaly. So far ISO 639 series deals with combinations of the
language identifiers with the country (or major subdivision) code acc. to
ISO 3166 series and script code acc. to ISO 15924. 

 

Here again David Fourney’s explanation:

With respect to the size of the string used to fully specify languages, I
recommend looking at IETF's BCP47  <https://tools.ietf.org/html/bcp47>
https://tools.ietf.org/html/bcp47. BCP47 is the document HTML seems to rely
upon as well.

W3C could ask the authors of BCP47 to require a new minimum string size (if
it is not already large enough) and recommend the expected use of
separators. I suggest using a larger string than 12 characters to future
proof this decision.

I recommend W3C provide examples in all of their discussions on the use of
the lang attribute. These examples should all start with the 3-character
code as its base. All examples using the 2-character code should be updated.

With respect to scripts, as I recall, HTML relies entirely on the
specification of the character set. Typically, this is now set to Unicode
which is thought to provide the necessary characters to write in various
languages. As I understand the situation (and I could be wrong), authors do
not have the ability to specify the script of their content.

You are correct that it would be exceedingly useful to be able to
deliberately specify a script (rather than a character set). I envisioned
this when I wrote ISO/IEC 24756:2009 and, to a lesser extent, ISO/IEC
20071-23. For example, in languages that have more than one script, it would
be useful for users to be able to specify that they want captions in one
preferred script (e.g., a user might want Russian captions to be presented
in Roman script rather than Cyrillic).

 

 

-----Ursprüngliche Nachricht-----

Von: Janina Sajka <janina@rednote.net> 

Gesendet: Donnerstag, 29. August 2019 18:17

An: lisa.seeman <lisa.seeman@zoho.com>

Cc: christian.galinski@chello.at; W3C WAI Accessible Platform Architectures
<public-apa@w3.org>

Betreff: Re: Language codes and iso639 series

 

Hi, Lisa, Christian, All:

 

It's unclear to me what kind of assistance you're seeking, and specifically
what agendum we might propose for a joint meeting during TPAC. Christian,
are you planning to attend TPAC? It would be helpful, as I don't see us
effectively carrying your concerns second hand.

 

I'm aware, at least to a degree, of ISO and IETF standardization on language
coding to include support for specifying sign language usage,[1] but those
are not activities directly in W3C's I18N remit,[2] though working in
coordination with those groups clearly is.

 

Is there a W3C i18n document Christian is looking to affect? Or perhaps
you're proposing something W3C might publish? APA would clearly be
interested, but the specifics just aren't in your email so I'm left
guessing.

 

We were certainly aware of the multiplicity of sign languages when we
created our "Media Accessibility User Requirements (MAUR)"[3] document
during the process of defining HTML 5.0, and I believe HTML 5 supports that
well for alternative media. But, I don't think we've done anything
specifically beyond that activity in this space.

 

PS: Any news on standardizing lang codes for AAC?

 

Please feel free to say more. I'd like to be helpful if I can.

 

Best,

 

Janina

 

[1] https://www.evertype.com/standards/iso639/sgn.html

[2] https://www.w3.org/i18n

[3] http://www.w3.org/TR/media-accessibility-reqs/

 

Lisa Seeman writes:

> Hi Janina

> Christian, who is cc'd is working on improving language code support so
that it works for sign langage and the combinations. For example English
sign language with Canadian dialect.

> 

> Can we bring this up at TPAC with internationalisation?

> 

> All the best

> 

> Lisa Seeman

> 

> LinkedIn, Twitter

> 

> 

 

-----Ursprüngliche Nachricht-----
Von: Fourney, David <david.fourney@usask.ca> 
Gesendet: Montag, 19. August 2019 13:20
An: christian.galinski@chello.at christian.galinski@chello.at
<christian.galinski@chello.at>
Cc: klaus.miesenberger <klaus.miesenberger@jku.at>
Betreff: Re: Re: HTML etc. and ISO 639-1 2-letter code

 

Hi Christian,

 

With respect to the size of the string used to fully specify languages, I
recommend looking at IETF's BCP47

 <https://tools.ietf.org/html/bcp47> https://tools.ietf.org/html/bcp47

 

BCP47 is the document HTML seems to rely upon as well.

 

W3C could ask the authors of BCP47 to require a new minimum string size (if
it is not already large enough) and recommend the expected use of
separators. I suggest using a larger string than 12 characters to future
proof this decision.

 

I recommend W3C provide examples in all of their discussions on the use of
the lang attribute. These examples should all start with the 3-character
code as its base. All examples using the 2-character code should be updated.

 

With respect to scripts, as I recall, HTML relies entirely on the
specification of the character set. Typically, this is now set to Unicode
which is thought to provide the necessary characters to write in various
languages. As I understand the situation (and I could be wrong), authors do
not have the ability to specify the script of their content.

 

You are correct that it would be exceedingly useful to be able to
deliberately specify a script (rather than a character set). I envisioned
this when I wrote ISO/IEC 24756:2009 and, to a lesser extent, ISO/IEC
20071-23. For example, in languages that have more than one script, it would
be useful for users to be able to specify that they want captions in one
preferred script (e.g., a user might want Russian captions to be presented
in Roman script rather than Cyrillic).

 

Finally, on the choice of codes. I strongly recommend that ISO and W3C set
an explicit recommendation on exactly which code set to use. The existence
of multiple 3-character sets will add to the problem rather than solve
anything. ISO will need to unify this work to help ease the confusion.

 

David.

 

________________________________________

From:  <mailto:christian.galinski@chello.at> christian.galinski@chello.at
<mailto:christian.galinski@chello.at> christian.galinski@chello.at <
<mailto:christian.galinski@chello.at> christian.galinski@chello.at>

Sent: Monday, August 19, 2019 3:06 AM

To: Fourney, David

Cc: klaus.miesenberger

Subject: Fwd: Re: HTML etc. and ISO 639-1 2-letter code

 

Hi David,

Great thanks to you for this excellent clarification!

 

The recommendation to use only the 3-letter code for languages obviously is
only one step in the direction of handling language codes in various
combinations with other codes and thus indicating language varieties to some
extent. At present language varieties can only be indicated in a rudimentary
form. ISO/TR 21636 "Indication and description of language varieties" will
pave the way for a future much more detailed coding of varieties.

 

At present we have at our disposal for coding languages (disregarding the
2-letter code according to ISO 639-1):

- 3-letter language codes (all small caps) according to ISO 639-2 and 639-3

- 3-letter codes for countries and their subdivisions (all capitalized)
according to ISO 3166-1 and 3166-2

  (I think we should recommend also here the use of the 3-letter code)

- 4-letter code for scripts /and script variants/ (first letter capitalized)
With 10 digits (12 - if separators are added) we can thus cope with a lot of
variation, under given limitations.

 

In the case of sign languages (being true sign languages - i.e. mother
tongues for the Deaf and Hard-of-Hearing) we have at our disposal:

- 3-letter language code (all small caps) according to ISO 639-3

  (to be extended towards including further sign languages)

- 3-letter codes for countries and their subdivisions (all capitalized)
according to ISO 3166-1 and 3166-2 With 6 digits (7 - if separators are
added) we can thus cope with some variation, under given limitations.

 

In the case of the language variety "signed language" (e.g. Signed Exact
English) we have at our disposal:

- "sgn" as indicator for "signed language"

- 3-letter language codes (all small caps) according to ISO 639-2 and 639-3

- 3-letter codes for countries and their subdivisions (all capitalized)
according to ISO 3166-1 and 3166-2 With 9 digits (11 - if separators are
added) we can cope with a lot of variation, under given limitations.
sgn-eng-AUS would refer to the Australian variety of Signed Exact English.

 

Would this mean that we should recommend - under given circumstances and as
a step in the direction of further necessary varieties in the future - a
minimum of 12 digits (incl. separators) for coding languages (incl. sign
languages and signed language)? Is this realistic, and if so, is it
sufficient?

 

Best regards

Christian

 

 

 

 

 

> ---------- Ursprüngliche Nachricht ----------

> Von: "Fourney, David" < <mailto:david.fourney@usask.ca>
david.fourney@usask.ca>

> An: " <mailto:christian.galinski@chello.at%20christian.galinski@chello.at>
christian.galinski@chello.at christian.galinski@chello.at" 

> < <mailto:christian.galinski@chello.at> christian.galinski@chello.at>

> Cc: "klaus.miesenberger" < <mailto:klaus.miesenberger@jku.at>
klaus.miesenberger@jku.at>, hoeckner 

> < <mailto:hoeckner@hilfsgemeinschaft.at> hoeckner@hilfsgemeinschaft.at>

> Datum: 17. August 2019 um 02:00

> Betreff: Re: HTML etc. and ISO 639-1 2-letter code

> 

> Hi Christian,

> 

> To answer your specific question: There is no connection to CSS.

> Cascading Style Sheets are used only for the styling and presentation 

> of content. For example, I would use CSS to indicate the font I want, 

> whether to make the text bold, and where to put it on the screen. CSS 

> is not for specifying languages, this is the role of HTML.

> 

> The technical issue lies primarily with the HTML5 <video> element and 

> how it supports the HTML lang attribute.

> 

> A <video> allows for one or more <source> files (which can be audio 

> and or video tracks) as well as one or more <track> files (for 

> subtitles, captions, transcripts, etc.).

> 

> As a developer, I want to specify the language of the captions, audio, 

> and video so I can meet meet WCAG's SCs. (WCAG SC 3.1.1 and SC 3.1.2 

> require the specification of the language of content.)

> 

> HTML allows the specification of the language of content on pretty 

> much any element using HTML5's lang attribute. This means that I can 

> specify the language of a caption file, an audio track, or 

> (presumably) a video track.

> 

> As a user, if my media player supports it, I can select an audio track 

> in one language (e.g., French) and a caption track in another (e.g., 

> Norwegian). Theoretically, I can also select a video track in whatever 

> language I want.

> 

> That's where the problem lies. If the audio is embedded in the video 

> file, then obviously the language of the video is the language of the 

> audio. This can be any spoken language. Typically, this is indicated 

> with a two-character code. (This is also true with audio sources and

> captioning.)

> 

> Many languages do NOT have a two-character code. (Many many languages 

> face this issue. The SIL code tables provides a list of languages that 

> have one or both types of codes:

>  <https://iso639-3.sil.org/code_tables/639/data>
https://iso639-3.sil.org/code_tables/639/data)

> 

> (A reminder that 2019 is the UN's International Year of Indigenous

> Languages.)

> 

> But, what if there is no audio in the video? What if the language of 

> the video is in fact a visual language? What if it is a sign language? 

> I should be able to specify the language of the content (e.g., 

> lang="ase"). Since no sign languages have a two-character code, this 

> must be a three-character code.

> 

> So the first issue is: "Can I do this?"

> 

>  From reading the HTML 5.2 and some IETF specifications, I MIGHT be 

> able to use a three-character code, but its not very clear IF I CAN. 

> The specification appears to allow a code of 6 to 8 characters in length.

> This suggests a combination of language and region codes, including 

> hyphens, might fit a three-character language code plus a 

> two-character region code, but not much else.

> 

> Resources on this include IETF's BCP47

>  <https://tools.ietf.org/html/bcp47> https://tools.ietf.org/html/bcp47

> and the HTML5.2 specification

>  <https://www.w3.org/TR/html52/dom.html#the-lang-and-xmllang-attributes>
https://www.w3.org/TR/html52/dom.html#the-lang-and-xmllang-attributes

> 

> The living specification discusses this at 

>  <https://html.spec.whatwg.org/#the-lang-and-xml:lang-attributes>
https://html.spec.whatwg.org/#the-lang-and-xml:lang-attributes

> 

> The second issue is: "Will it work?"

> 

> If a browser sees a three-character language code, will it know what 

> to do with it? What about a media player? What about a screen reader?

> 

> Its all well and good that I can specify my language, but not if it is 

> not supported (i.e., my user agent won't be able to handle it).

> 

> Setting aside <video>, I would also point out that this second issue 

> applies to the browser in general. Is there full support for 

> specifying the language of a document using a three-character code 

> (e.g., <html lang="eng"> vs. <html lang="en">).

> 

> 

> As I mentioned in Ottawa, what we need the W3C to do is:

> 1. Confirm how large a language code can be used within the HTML lang 

> attribute and determine if this length is large enough given the 

> three-character codes of ISO 639-2 and the various region and script 

> codes that can be appended to it.

> 

> 2. Confirm that user agents are required to support long language 

> codes (via the lang attribute), not just the two-character codes that 

> are specified in ISO 639-1. This is important because, if the HTML 

> specifications allow for rather long codes but the user agents do not, 

> then using a long code will not work.

> 

> To my mind, there should be no issue because it is just a language 

> indication code. Most of the time user agents should just accept any 

> code and do nothing further with it.

> 

> This issue was the source of my concern only because you mentioned the 

> demand to freeze ISO 639-1 from 20+ years ago. The freeze request 

> suggests to me that user agents only support a small number of codes 

> and intend to act in some way on these codes.

> 

> 3. Confirm that the lang attribute (of any length) can be used on any 

> HTML element in a meaningful way, including the specification of the 

> language of a video track (e.g., <source src="movie.mp4"

> type='video/mp4' lang='ase'>).

> 

> Ultimately, the need is to determine if user agents support 

> three-character codes so that the specification of a video or a 

> document in a language that only has a three-character code will 

> actually work. I would expect someone at W3C will know what support is (or
is not) available.

> 

> 

> I hope that this explanation helps you. Please let me know if you have 

> any questions.

> 

> Thanks,

> David.

> 

> 

> On 2019-08-15 12:21 p.m.,  <mailto:christian.galinski@chello.at>
christian.galinski@chello.at 

>  <mailto:christian.galinski@chello.at> christian.galinski@chello.at wrote:

> > Hi, David,

> >

> > How are you doing?

> >

> > Further to our recent discussions I would like to ask you to clarify 

> > one more technical question: concerning the use of the alpha-2 code 

> > (acc. to ISO 639-1?) in HTML and/or XHTML and/or HTML5 which you 

> > mentioned is hindering certain functions/features necessary for the 

> > Deaf and hard of hearing. Is there a connection to CSS?

> >

> > Could you please elaborate a bit on this technical question?

> >

> > If there is an issue, how should it be presented to W3C/TCAP?

> >

> > Best regards

> >

> > Christian

> >

> > p.s.

> >

 
Received on Thursday, 5 September 2019 01:14:49 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 18:55:35 UTC