[charmod-norm] Editorial comments on http://aphillips.github.io/charmod-norm/ from r12a via GitHub on 2017-12-06 (public-i18n-archive@w3.org from October to December 2017)

From: r12a via GitHub <sysbot+gh@w3.org>
Date: Wed, 06 Dec 2017 15:22:57 +0000
To: public-i18n-archive@w3.org
Message-ID: <issues.opened-279792394-1512573776-sysbot+gh@w3.org>
r12a has just created a new issue for https://github.com/w3c/charmod-norm:

== Editorial comments on http://aphillips.github.io/charmod-norm/ ==
Comments from a review of Addison's changes, all listed here because they are editorial suggestions:

# 3.1.1 Regular expressions
I think this should be a section one level up, either as 3.1 or 3.2 or the last section in 3.1, since it's not really relevant to 3.1's matching algorithm, and breaks the correspondence between the list and the following sections.

# 3.1.1
I'm generally in favour of stating the recommendation/best practice succintly, then following up with explanations and examples.  One reason for this is that if you link to the BP you can easily read the related explanations, rather than having to scroll back up to find the start of the relevant text, another is that i think it helps people skimming the document to home in on info they like.  And i generally prefer to read "DO THIS, and now here's why", rather than read these docs like a novel.
So i'd like to see the mustard before the para in this section.

# 3.1.1 
> [S] Specifications that define a regular expression syntax MUST provide at least Basic Unicode Level 1 support per [UTS18] and SHOULD provide Extended or Tailored (Levels 2 and 3) support.

I think it would be helpful for readers to provide a very short explanation of what that means.

# 3.1.2 Converting to sequence of Unicode CPs
> the resulting character sequence might still be partially de-normalized (for example, if it begins with a combining mark).

That's only denormalised in the full-normalisation sense, right?  A combining mark on its own is not denormalised NFC/NFD. Perhaps we need to qualify or drop that text.

# 3.1.2
> This means that the [Encoding] specification 

This means that the Encoding specification [Encoding]

# 3.1.2
replace the semi-colons with commas?  Otherwise is reads a bit odd at worst, a bit stop-start at best.

# 3.1.2
Again, i think the mustard needs to be higher up, rather than presented as a conclusion. 

# 3.1.2
> [C] For content authors, when converting content from a legacy character encoding to Unicode, it is RECOMMENDED that the text be normalized to Unicode Normalization Form C unless the mapping of specific characters interferes with the meaning.

I don't think we say _why_ this is recommended(?)

# 3.1.3 para 1
> Most document formats and protocols provide a means for encoding characters or

i think that should be

Most document formats and protocols provide a means for **escaping** characters **and**

(emphasis just for clarifying this comment - not for the doc)

# 3.1.3 para4
> the combining mark U+0300

the combining mark  ̀ [U+0300 COMBINING GRAVE ACCENT]

then remove the name further down

# 3.1.3 para5
> the general rule is to expand escapes on the same "level" as the user is interacting with

what does this mean?  Is it mustard?

# 3.1.3 para6
> escapes should be converted to the character sequence they represent before the processing of the syntax, unless explicitly forbidden by the format's processing rules

should be mustard

# 3.1.4 last para
> Specifications should avoid the NFKD and NFKC normalization forms unless there is a compelling reason. Implementations must not apply these normalization forms unless specifically requested by the user.

suggest we mustardise, and move just before the para it's in.  Also, spell out 'these', so that the mustard can be read alone.

# 3.1.4 
> Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content

A 'SHOULD' seems a bit strong. A 'should' may be better, or 'It is recommended that'.

# 3.1.4.1 2nd mustard
it would be good to have an example to illustrate and clarify the first mustard

3.1.4.1 last mustard
> warn users or prevent the input or creation of textual constructs starting with a combining mark

text constructs -> syntactic constructs ?

# 3.1.5 para 2
should be mustard (it even has a MUST in it)

# 3.1.5.1
> [S] Case-sensitive matching is RECOMMENDED for new protocols and formats.

i think this is only relevant for syntax, and not for user text, natural lang search, etc. so perhaps we should say so

# 3.1.5.1
> Turkish examples above

add link

# 3.1.5.2
move mustard to start of section

# 3.1.5.3 
the first para and the first mustard essentially say the same thing - so move the mustard to replace the first para ?

# 3.1.5.4 para 1
> Locale- or language-specific tailoring is most appropriate when it is part of natural language processing operations.

Locale- or language-specific tailoring is most appropriate when it is part of natural language processing operations, ie. not within the scope of this document.

# 3.1.5.4 last mustard
the example of text-transform would only apply to natural language text, wouldn't it?  Which means that it's out of scope for this doc.

# 3.1.6 para 2
> Care should be taken not to interfere with the encoding of different languages

Sounds like, don't mess with UTF-8. Perhaps we should say:

Care should be taken not to interfere with the character repertoire for different languages


hth

Please view or discuss this issue at https://github.com/w3c/charmod-norm/issues/152 using your GitHub account
Received on Wednesday, 6 December 2017 15:22:59 UTC