Re: i18n comments on Polyglot Markup from Leif Halvard Silli on 2010-07-15 (public-html@w3.org from July 2010)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 15 Jul 2010 22:15:59 +0400
To: Richard Ishida <ishida@w3.org>
Cc: public-html@w3.org, Eliot Graff <eliotgra@microsoft.com>
Message-ID: <20100715221559971777.15542567@xn--mlform-iua.no>
Richard Ishida, Tue, 13 Jul 2010 20:40:24 +0100:

> I am about to raise 8 bugs in bugzilla.  These comments have been 
> discussed by the i18n WG.  I hope you find them helpful.
> 
> FWIW, the i18n group keeps track of comments on your doc at 
> http://www.w3.org/International/reviews/1007-polyglot/


This is comment to some of the 8 issues/bugs on the keep page:

 2nd issue: 
  ]] In-document declarations always useful [...] So it's true to say 
that you strictly don't need it, but we would prefer that people do. 
Please could you reflect that in your document. [[
 Comment: I don't have the Polyglot Markup spec in front of me. But I 
believe only UTF-8 or UTF-16 are permitted encodings. At least, I have 
long since filed bug 9962 which says that only UTF-8 and UTF-16 should 
be permited. [1] Then, as Anne explained, for UTF-16, there is non 
HTML5-compatible way to have an in-document UTF-16 declaration. Thus, 
your 2nd issue does not feel relevant. For UTF-16 it is not relevant, 
at least. And when it comes to UTF-8, then in-document declaration is 
_necessary_ (unless you want to rely on HTTP or BOM). No other 
encodings should be allowed, as there are no HTML5-compatible way to 
specify them. When using UTF-8 - and no BOM- then using the <meta 
charset="UTF-8"/> element should be required, since otherwise the 
document will/may default to WIN-1252 (or something similar) when 
parsed off-line as HTML.

 3rd issue:
  ]] … This could be read "use utf-8 with the appropriate BOM or UTF-16 
with the appropriate BOM", but a utf-8 bom (or signature) is not 
strictly necessary, and some would argue that it may cause problems, 
and it's use should be discouraged here. [[
 Comment: 
  For the first issue, if it is possible to read the Polyglot Markup 
spec as if BOM is needed together with UTF-8, then of course detail 
should be fixed. 
  For the latter issue, then the HTML5 spec allows BOM, and has no 
warnings against it. Thus, unless HTML5 proper as well advice against 
use of BOM, then the Polyglot Markup spec must not warn against BOM 
either. (Unless there are any issues with BOM for XML parsers, then XML 
cannot be used to justify any warning against use of BOM.) 

 4th issue:
  ]]
     … Character Encoding. Omit the either/or list. " In short, for 
correct character encoding, polyglot markup must either: " The MUST is 
too strong. There is no problem with using more than one declaration, 
and in an earlier comment we said that we recommend that you have a 
readable declaration in the source in addition to a UTF8/16 encoding.
 I think it is better just to omit the list and it's lead-in paragraph 
"In short, for correct ...".
 The information is contained in the following paragraph that starts 
with "If polyglot markup uses an encoding other than..."
  [[
 Comment: This issue indeed seems very similar to the 2nd issue. 
Otherwise, the Polyglot Markup spec seeks to spec what is 
HTML-compatible. That requires a some either/or language, I think. But 
I'll study your bug.

 5th issue: 
  ]] No mention is made of the lang and xml:lang attributes. The 
document should say that both should be used when language attributes 
are used.[[
 Comment: Indeed, that is an very unforgivable bug. ;-) But, as the 
focus of this document is to be a _spec_, the document MUST say that 
both xml:lang and lang have to be used - none of them can be used alone.

  ]]
  It may also recommend the use of the language attributes in the html 
element to set the default language for the document, and mention that 
the meta Content-Language element has no usefulness at all in XML for 
setting the language of content.
  [[
 Comment: This feels like, eventually, another issue.

 6th issue:
  ]]
   6.2.3 Attribute values Case requirements 
            " however, case requirements do not apply to non-ASCII
              letters such as Greek, Cyrillic, or non-ASCII Latin 
letters. "
  We are confused by this text. Scripts such as Greek, Cyrillic, and 
Armenian do have case distinctions, and those distinctions are 
significant in XML if you have attribute names or values in those 
scripts. But we are not clear when any characters from those scripts or 
non-ASCII Latin letters are used for attribute names or values in HTML.

Please clarify for us what the intent is.

(There is similar text in 6.2.2)
  [[
 Comment: I think I may have had a word in what the spec says here. The 
purpose is to express that while ASCII letters are generally treated 
case-insensitively in HTML (in contrast to XHTML), the same is not the 
case for non-ASCII letters. Thus XHTML and HTML agree that non-ASCII 
letters are treated case _sensitively_. Whereas they disagree about 
ASCII letters - XHTML treats them case sensitively, whereas HTML treats 
them as insensitively. For programmers, it is perhaps obvious that 
there is a difference between the ASCII case sensitivity of the 
non-ASCII case sensitivity. But for more ordinary people, it is not 
logical that some letters are treated case sensitively, while others 
are not. It is also generally common to say about XML that it is case 
sensitive, in contrast to HTML. But fact is, that HTML and XML only 
differ with regard to case sensitivity when it comes to ASCII.

For the record, HTML5, when it talks about the data-* attributes, says 
the same thing: data-ASCII="" is treated case insensitively. Whereas 
data-ÆØÅ="" is not treated case insensitively.

(Btw, I just read in the RDFa working group's last telcon resolutions, 
that ARIA role treats ASCII letters sensitively.)

 7th issue: 
  ]]
   8. Named Entity References Named entity references 
   " For example, polyglot markup uses &#160; instead of &nbsp;. " 
   We would prefer your example to use the hexadecimal NER &#xA0; 
rather than the decimal. See 
http://www.w3.org/TR/2005/REC-charmod-20050215/#C048

   [[
 Comment: Why? Is that a special recommendation with regard to just the 
non-breaking-space character? As much as I know, the I18N WG have some 
documents which recommend using hexadecimal rather than decimal NCRs. 
Is that the issue you want to put through? However, how can Polyglot 
Markup have stronger requirements than XHTML and HTML have? I here get 
the feeling that it is your "this spec should not be a spec, but a 
friendly authoring guide" which comes through. You feel that you can 
give stricter (but friendlier, still?) requirements in a guide than in 
a spec.
 I can agree that the Polyglot Markup spec should mention the 
hexadecimal _as well as_ the decimal. But I see no reason to not 
mention the decimal.

[1] http://www.w3.org/Bugs/Public/show_bug.cgi?id=9962

-- 
leif halvard silli
Received on Thursday, 15 July 2010 18:17:06 UTC