The murky intersection of accessibility and internationalization from Andrew Cunningham on 2017-01-09 (w3c-wai-ig@w3.org from January to March 2017)

From: Andrew Cunningham <andj.cunningham@gmail.com>
Date: Mon, 9 Jan 2017 13:59:39 +1100
To: WAI Interest Group <w3c-wai-ig@w3.org>
Message-ID: <CAOUP6KmwpFcZEJJNsGgi_7CvEdK1EmAdJzSDMuOwqVU5w5c6HQ@mail.gmail.com>

Hi everyone,

At the moment I am doing some work for a Victorian Government agency,
the focus is on web internationalisation, specifically the integration
of government information translated into community languages (the
languages spoken and read by migrant and refugee communities, where
members of these communities may possess limited fluency with
English).

Two common community languages used by our state government are
Burmese and Sgaw Karen. Content can be found in HTML or PDF files. The
Unicode Consortium's Myanmar Scripts and Languages FAQ [1] provides
some background information. Although the FAQ understates the
complexity of the problem. Most Burmese translations are not provided
in Unicode. Burmese is usually provided as a file using the Zawgyi
pseudo-Unicode (or adhoc) encoding.

Sgaw Karen is usually provided in a number of 8-bit legacy encodings
or occasionally using a pseudo-Unicode (or adhoc) font.

Web content in a pseudo-Unicode encoding identifies itself as UTF-8.
Web content in an 8-bit legacy encoding declares itself as ISO/IEC
8859-1 or Windows 1252 encodings. From an internationalisation
perspective, things are quite simple, the content should be in
Unicode. Although I can just come out and say that, I also need to be
able to justify the internationalisation decisions with reference to
accessibility considerations. Web accessibility is important for state
government agencies and departments. They aim to meet WCAG 2.0 AA
requirements.

My reading of WCAG 2.0 recommendation is that the encoding issues for
Burmese and Sgaw Karen directly impact principle 3 and 4, and that
non-Unicode content would be considered inaccessible. But there are no
specific guidelines that are relevant, and nothing for documents to
comply with, other than a generic principle? Would this be correct?

In HTML, it makes sense to require Unicode for Burmese and Sgaw Karen
content, but there is no explicit accessibility requirement to do so.

PDFs are a more complex problem. The ToUnicode mapping in Burmese and
Sgaw Karen PDF files using pseudo-Unicode or legacy fonts is
essentially useless. Such fonts work by deliberately miss-declaring
the glyph to codepoint correspondences. Using Unicode only also
doesn't get us all the way to an accessible document for these
languages, since PDF specifications can not handle all aspects of
resolving glyph to codepoint for font technologies employed for
complex scripts (writing systems). The way around this may be to make
use of the ActualText attributes.

In PDF Techniques for WCAG 2.0 [2], PDF7 is explicit referring to use
of ActualText via OCR where the PDF contains images of text rather
than a text layer. But I assume that ActualText was the appropriate
way forward in a pseudo-Unicode or legacy font scenario?

Use of unsupported legacy encodings within PDF files has been fairly
common for languages written in complex scripts due to historical
limitations in typesetting applications to handle opentype features
required by complex script languages. So it is a wider problem than
just the two languages I have been discussing.

Do my assumptions sound reasonable from an accessibility perspective?
Or are there alternative approaches from a accessibility perspective
you think I may have overlooked? Or have I totally lost the plot?

Feedback and input welcome.

Andrew

[1] http://www.unicode.org/faq/myanmar.html
[2] https://www.w3.org/TR/2014/NOTE-WCAG20-TECHS-20140408/pdf.html


Andrew Cunningham
andj.cunningham@gmail.com

Received on Monday, 9 January 2017 03:00:13 UTC