[alreq] Character tables: separating language-dependent characters from Formatting Characters from Behnam Esfahbod via GitHub on 2017-06-27 (public-i18n-archive@w3.org from April to June 2017)

From: Behnam Esfahbod via GitHub <sysbot+gh@w3.org>
Date: Tue, 27 Jun 2017 10:40:36 +0000
To: public-i18n-archive@w3.org
Message-ID: <issues.opened-238808571-1498560035-sysbot+gh@w3.org>

behnam has just created a new issue for https://github.com/w3c/alreq:

== Character tables: separating language-dependent characters from Formatting Characters ==
This is specially regarding [Section A.5 Control characters](https://w3c.github.io/alreq/#h_character_tables_control_characters).

## Issues

1. The table contains many characters that are not language-dependent and depending on the text format (plain text, html, etc) they may appear in text or not. IMHO, we *should not* expect these characters to be handled correctly in CLDR, and the fact that some of them appear in the CLDR exemplar is not enough to make it a reliable source.

1. The Bidi Directional Formatting Characters are explicitly defined by UBA ([TR9](http://www.unicode.org/reports/tr9/#Directional_Formatting_Characters)), which is *normatively* referenced from ALReq in [Section 2.3 Direction](https://w3c.github.io/alreq/#h_direction), therefor no extra source (like CLDR exemplar) is needed to demonstrate needs for these characters.

1. ZWJ and ZWNJ are exceptions in this list: they are Joining Control characters with their usage in the Arabic script described in [Section 2.4 Joining](https://w3c.github.io/alreq/#h_joining), based on the [Unicode Arabic Cursive Joining](http://www.unicode.org/versions/Unicode9.0.0/ch09.pdf), another normative reference of ALReq. These characters are **expected** to be present in the content and in no way a higher protocol is expected to handle them. (NOTE: Maintaining joining during hyphenation, and cases similar to this, are not cases of ZWNJ/ZWJ in action.)

1. U+FEFF should be explicitly excluded, regardless of what ISIRI spec says about it. It's something deeply related to UTF encodings of Unicode text and has nothing to do with content or anything script-specific.

1. U+2060 (WORD JOINER), U+2028 (LINE SEPARATOR), and U+2029 (PARAGRAPH SEPARATOR) are the only non-ASCII left-over characters, all in limbo without good documents or corpus supporting them. I recommend to just blacklist them explicitly (similar to U+FEFF, if that also needs to be blacklisted).

1. And CR/LF are the only ones left, which, again, are talked about in the document, their stats are mixed up, and only add to confusions. (For example, why no U+0009 TAB there?)

1. If we want to keep CR/LF or U+2028/U+2029, IMHO, we need to have a section about them, explaining their use in line break/paragraph separation, AND anything platform-specific. Again, IMO, that's out of the scope of ALReq.

Please view or discuss this issue at https://github.com/w3c/alreq/issues/127 using your GitHub account

Received on Tuesday, 27 June 2017 10:40:42 UTC