- From: Andrew Cunningham <andj.cunningham@gmail.com>
- Date: Mon, 9 Jan 2017 13:59:39 +1100
- To: WAI Interest Group <w3c-wai-ig@w3.org>
Hi everyone, At the moment I am doing some work for a Victorian Government agency, the focus is on web internationalisation, specifically the integration of government information translated into community languages (the languages spoken and read by migrant and refugee communities, where members of these communities may possess limited fluency with English). Two common community languages used by our state government are Burmese and Sgaw Karen. Content can be found in HTML or PDF files. The Unicode Consortium's Myanmar Scripts and Languages FAQ [1] provides some background information. Although the FAQ understates the complexity of the problem. Most Burmese translations are not provided in Unicode. Burmese is usually provided as a file using the Zawgyi pseudo-Unicode (or adhoc) encoding. Sgaw Karen is usually provided in a number of 8-bit legacy encodings or occasionally using a pseudo-Unicode (or adhoc) font. Web content in a pseudo-Unicode encoding identifies itself as UTF-8. Web content in an 8-bit legacy encoding declares itself as ISO/IEC 8859-1 or Windows 1252 encodings. From an internationalisation perspective, things are quite simple, the content should be in Unicode. Although I can just come out and say that, I also need to be able to justify the internationalisation decisions with reference to accessibility considerations. Web accessibility is important for state government agencies and departments. They aim to meet WCAG 2.0 AA requirements. My reading of WCAG 2.0 recommendation is that the encoding issues for Burmese and Sgaw Karen directly impact principle 3 and 4, and that non-Unicode content would be considered inaccessible. But there are no specific guidelines that are relevant, and nothing for documents to comply with, other than a generic principle? Would this be correct? In HTML, it makes sense to require Unicode for Burmese and Sgaw Karen content, but there is no explicit accessibility requirement to do so. PDFs are a more complex problem. The ToUnicode mapping in Burmese and Sgaw Karen PDF files using pseudo-Unicode or legacy fonts is essentially useless. Such fonts work by deliberately miss-declaring the glyph to codepoint correspondences. Using Unicode only also doesn't get us all the way to an accessible document for these languages, since PDF specifications can not handle all aspects of resolving glyph to codepoint for font technologies employed for complex scripts (writing systems). The way around this may be to make use of the ActualText attributes. In PDF Techniques for WCAG 2.0 [2], PDF7 is explicit referring to use of ActualText via OCR where the PDF contains images of text rather than a text layer. But I assume that ActualText was the appropriate way forward in a pseudo-Unicode or legacy font scenario? Use of unsupported legacy encodings within PDF files has been fairly common for languages written in complex scripts due to historical limitations in typesetting applications to handle opentype features required by complex script languages. So it is a wider problem than just the two languages I have been discussing. Do my assumptions sound reasonable from an accessibility perspective? Or are there alternative approaches from a accessibility perspective you think I may have overlooked? Or have I totally lost the plot? Feedback and input welcome. Andrew [1] http://www.unicode.org/faq/myanmar.html [2] https://www.w3.org/TR/2014/NOTE-WCAG20-TECHS-20140408/pdf.html Andrew Cunningham andj.cunningham@gmail.com
Received on Monday, 9 January 2017 03:00:13 UTC