[whatwg] Consecutive hyphen-minus characters in comments/in ACE-strings of IDNs

On Tue, 2 Nov 2010, Martin Janecke wrote:
>
> In 10.1.6 Comments the current HTML spec 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#comments 
> says:
> 
> > Following this sequence, the comment may have text, with the additional
> > restriction that the text must not [...] contain two consecutive U+002D
> > HYPHEN-MINUS characters (--) [...]
> 
> Section 5 of RFC 3490 http://tools.ietf.org/html/rfc3490#section-5 
> defines the ACE-prefix in Internationalized Domain Names to be "xn--", 
> i.e. always containing two consecutive hyphen-minus characters.
> 
> This leads to the odd situation that correctly ASCII-compatible encoded 
> IDNs cannot be used in HTML comments. For example, the wide-spread habit 
> of commenting out parts of HTML code in web pages fails when the code 
> contains those otherwise valid URLs. This really happens in practice 
> when working with IDNs (my personal experience) and I assume this 
> incompatibility will cause a growing number of pages to be invalid in 
> future, as the number of used IDNs grows, which will happen for sure, as 
> ICANN has approved internationalized top level domain names this year.
> 
> Can the problems be prevented? E.g. by making "xn--" and "XN--" valid in 
> comments?
> 
> May it even be justified to make "--" valid in comments again? As I 
> understand 
> http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2006-May/006337.html 
> and following replies, "--" used to be valid earlier in the spec and was 
> then changed to make HTML more compatible with SGML, although HTML(5) is 
> explicitly not SGML anymore. Making "--" valid won't affect any 
> previously valid or invalid HTML page in any negative way, will it?

The main reason, IIRC, that we have disallowed "--" in comments in 
text/html is that they are disallowed in XML, and to help authors catch 
cases where they are commenting out comments.

The question, I guess, is which of the following do we think is more 
important:

 * Helping authors not write HTML markup that might be hard to convert to 
   XML, and helping authors avoid nesting comments accidentally, by 
   flagging "--" sequences in comments

 * Getting out of the way of authors who want to put "--" sequences in 
   comments, e.g. because they use "--" as a long dash (as I do all the 
   time!), or because they want to comment out punycoded URLs.

Currently the spec assumes the former is more important. Personally, I 
think the latter is rather more useful, but then I use "--" as long 
dashes all the time! When this was last studied, the weight of argument 
was on the stricter "disallow --" side of things, presumably.

I'm open to changing this back; does anyone else have an opinion on this?

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Thursday, 6 January 2011 17:10:26 UTC