ISSUE-295 (Multiple languages): Allow xml:lang to be set anywhere needed within IMSC documents [IMSC]

ISSUE-295 (Multiple languages): Allow xml:lang to be set anywhere needed within IMSC documents [IMSC]

http://www.w3.org/AudioVideo/TT/tracker/issues/295

Raised by: Nigel Megitt
On product: IMSC

Concerns have been expressed within the EBU XMLSubtitles group regarding the restriction of xml:lang to a single value at the top level of IMSC documents, and the restriction of supported Unicode code points dependent on the value of xml:lang, on the basis that this usage is contrary to the accepted use of the Unicode standard.

The IMSC document [1] states in Appendix A. Recommended Unicode Code Points per Language that “The following table specifies the [UNICODE] code points that SHOULD be used in a document's textual content if xml:lang is present (Primary language subtag is as defined in IETF RFC 5646).”
 
[1] https://dvcs.w3.org/hg/ttml/raw-file/tip/ttml-ww-profiles/ttml-ww-profiles.html#recommended-unicode-code-points-per-language

IMSC only allows a single xml:lang value… thus all content of an IMSC compliant document is labelled as belonging to a single language. Further the implication of the Appendix A heading is that only the code points identified for a specific xml:lang tag value should be used in a document with that specific xml:lang tag value. This is effectively recommending that no ‘foreign language’ phrase can appear in an IMSC document. For a subtitle document, it could be argued that a foreign phrase would not appear (since it would all be translated). This is however incorrect. For example subtitles for a travel program will of course quite likely contain foreign phrases. In addition, many international languages use ‘loan words’ and phrases, which when correctly represented should retain their proper accents and presentation. More importantly, in many countries, captions are expected to be verbatim. The scope of the IMSC document expressly includes captions, so it is difficult to understand how verbatim speeh might be conveyed in an IMSC document if the speaker being captioned chooses to use foreign words or phrases.
 
Of course, should the xml:lang tag be permitted on elements within the IMSC document with different values in different elements… i.e. correctly identifying languages, then the above recommendations are of less impact.

Removing the xml:lang restriction would also support the use of IMSC subtitles as a source of 'spoken subtitles' in which the distributed subtitles are rendered as speech, which may require the use of language-dependent speech synthesis models.
 
It is apparent that the intention of the IMSC document is to simplify the implementation requirements, essentially permitting and encouraging implementations to only support perceived specific regional requirements. However, this perspective is fundamentally flawed as it does not accept the multi-cultural nature of the world. E.g. In the USA, the demographic for main language spoken according to the American Community Survey 2009 (and endorsed by the United States Census Bureau) is:
 
English - 229 million
Spanish - 35 million
Chinese languages - 2.6 million + (mostly Cantonese speakers, with a growing group of Mandarin speakers)
Tagalog - 1.5 million + (Most Filipinos may also know other Philippine dialects)
French - 1.3 million
Vietnamese - 1.3 million
German - 1.1 million (High German) + German dialects
Korean - 1.0 million
Russian - 881,000
Arabic - 845,000
Italian - 754,000
Portuguese - 731,000
French Creole - 659,000
Polish - 594,000
Hindi - 561,000
… followed by many more languages with less than 500,000 speakers per language
 
Other countries and regions have even more diverse demographics (e.g. Europe, SE Asia).
 
EBU XML Subtitles group does not consider it appropriate for a W3C specification to make a recommendation that is so limited in acceptance of Internationalisation principles or to encourage the development of client implementations that are unable to support the real broader internationalisation needs of the viewing audience.
 
Further, Table B “Typical Practice for Subtitles per Region (Informative)” in the IMSC document [1] perpetuates this ‘nationalistic and parochial’ viewpoint by implying that only certain subtitle languages are typically used in certain regions. This is clearly also flawed. For example I can quite readily purchase a single (region 2) DVD in the UK that has the following subtitle languages available: English, French, Italian, Spanish, Danish, Dutch, Finnish, Icelandic, Norwegian, Portuguese, Swedish and Arabic.
 
A final concern is that whilst the inclusion of the tables included in Appendix A of the IMSC specification may seem to have some ‘single point of reference’ utility, there is the very real potential for a loss of synchronisation between IMSC and the real owners of this intellectual space- The Unicode Consortium. The Unicode characters used by specific languages are already clearly identified by the Unicode Consortium in the CLDR project. This should be the reference for such information, should it be needed.
 
We propose that the IMSC document should be amended to remove the inferences that limited character sets in client implementations are acceptable or recommended, that the tables should be removed and that due references should be made to the appropriate authorities (Unicode CLDP).

Received on Thursday, 7 November 2013 13:34:56 UTC