[whatwg] Consecutive hyphen-minus characters in comments/in ACE-strings of IDNs

In 10.1.6 Comments the current HTML spec http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#comments says:

> Following this sequence, the comment may have text, with the additional
> restriction that the text must not [...] contain two consecutive U+002D
> HYPHEN-MINUS characters (--) [...]

Section 5 of RFC 3490 http://tools.ietf.org/html/rfc3490#section-5 defines the ACE-prefix in Internationalized Domain Names to be "xn--", i.e. always containing two consecutive hyphen-minus characters.

This leads to the odd situation that correctly ASCII-compatible encoded IDNs cannot be used in HTML comments. For example, the wide-spread habit of commenting out parts of HTML code in web pages fails when the code contains those otherwise valid URLs. This really happens in practice when working with IDNs (my personal experience) and I assume this incompatibility will cause a growing number of pages to be invalid in future, as the number of used IDNs grows, which will happen for sure, as ICANN has approved internationalized top level domain names this year.

Can the problems be prevented? E.g. by making "xn--" and "XN--" valid in comments?

May it even be justified to make "--" valid in comments again? As I understand http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2006-May/006337.html and following replies, "--" used to be valid earlier in the spec and was then changed to make HTML more compatible with SGML, although HTML(5) is explicitly not SGML anymore. Making "--" valid won't affect any previously valid or invalid HTML page in any negative way, will it?

Martin Janecke

Received on Tuesday, 2 November 2010 03:44:36 UTC