Re: <code> element and scripting languages from Alexandre Louis Marc Morgaut on 2015-03-30 (public-html-comments@w3.org from March 2015)

From: Alexandre Louis Marc Morgaut <alexandre.morgaut@gmail.com>
Date: Mon, 30 Mar 2015 09:31:23 +0200
To: Andrea Rendine <master.skywalker.88@gmail.com>
Cc: "public-html-comments@w3.org" <public-html-comments@w3.org>, blink-dev@chromium.org, "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Message-Id: <98B262BC-4E49-4E7E-AAE3-2E8FCF34EFAC@gmail.com>
I copy this answer the WAI and Chromium Blink-dev mailing list as I think they should probably be concerned
In short for them the issue raised in this thread is about the fact the way to tell HTML which programming language is shown in <code> elements is probably too loosely defined, or at least to loosely respected in practice. And, as an extension, worst, <code> elements are sometime entirely omitted in favor of <pre> elements


I agree with Andrea that the current situation is far from ideal with such widely inconsistent practices (@data-language=css , @class=language-css, @class=css, ...)

It has an impact on how translation tools can handle such content but also on how speech accessibility tools can render it in audio version

I'm less categoric on the fact that @class can't do the job
It is true that while it is currently already encouraged, only very few syntax highlighter libs respect the "language-*" value pattern 

If we stay with @class, few actions might help:
1) Add at very least a W3C Note, referenced in this <code> element spec, explaining why the "language-*" should be respected
2) Add also, if not already existing, an accessibility related Note to the WA documentation and refer to it from the <code> element spec
3) Maybe see with Microformat if it could had such specification on its end as a "source","snippet",  or related, microformat 

Of course, as I said before, using @language could be an interesting alternative
It may still benefit from a W3C Note and an Accessibility recommendation

Another solution that I didn't mention, and neither saw mentioned yet, could be to go with an "aria-attribute"
In which case such decision would not rely on the HTML working group by on the WAI ARIA one

I may go a little out of topic regarding this thread, but while talking about the <code> element and mentioning the over use, when not abusive, of the <pre> element 
to render programming languages, I think it would make a lot of sense if <code> element could have these default CSS rules:

CODE: {
    white-space: pre;
    font-family: Monospace;
}

I don't know any programming language that would suffer from that (tell me if there is some) and the <pre> element to me only exists for presentational purpose which have been widely pushed to the responsibility of CSS. The good thing is also that it maintain the possibility to use it either inline or as a block depending on inner break-lines.
As it has been mentioned in this thread, we reached a point to which the <pre> element has also been used as a replacement of the <code> element which is a big problem when coming back
to good semantic support of such content by tools (accessibility, translation, automation, ...)


I'll end with another automation use case, to highlight importance of <code> over <pre> & right way to define the programming language

I can easily imagine useful chrome / Firefox extensions that would parse pages showing snippets from <code> elements, and save all of them in files which extensions would be determined based on the "programming language" information, and the name either from:
- the @id of the <code> element
- or its label (via <label> or @aria-labeledBy) 
- or the best related <Hn> defined title (potentially with a suffix number if shared by several <code> elements)
- and/or a generic name (like "script", "style", "source") followed by a suffix number if many
potentially of grouped in a zip archive

I'm pretty sure some other developers would love that too


Alexandre Morgaut
http://about.me/amorgaut

On Mar 27, 2015, at 2:05 PM, Andrea Rendine <master.skywalker.88@gmail.com> wrote:

> Martin, IDK about plang but I guess Michael Pieters weren't serious when suggesting that (was he?)
> However, language is well-known to authors as I said before. There was a non-canonical (and pretty useless) habit of specifying @language on <script> elements, with "javascript" +  the intended version, in order to hide higher version JS scripts to legacy user agents. I.e.
> <script type="text/javascript" language="JavaScript1.3"> would have masked this script to UAs whose compatibility wasn't above JS 1.2.
> So I guess that nobody would use @language for "en-US". Also consider that the spec uses class="language-python" as an example of programming language markup.
> 
> About why the way things are now is not good (in my opinion?) Look:
> Prism.js => uses class="language-****" as per spec suggestion
> SyntaxHighlighter 3.0 (by Alex Gorbatchev) => class="brush: js" (and I'll spare you the other parameters in the example for your sanity) (also, it applies to <pre> but not to <code> which is pointless on a strictly semantical POV)
> SyntaxHighlighter Evolved (WP plugin, so completely different approach, but still worth mentioning) => "wrap your code in [language], such as [php]code" (as per documentation). Outputs a series of <code class="htmlscript plain"> elements (its use of <code> is impressively wrong...)
> highlightjs.org => class="html"
> Do you notice any consistency? I'm not speaking about authors changing highlighter, though it'd be worth considering. I also talk about semantic value. A feature that UAs can look at and know, consistently and interoperably, what kind of script we are talking about?
Received on Monday, 30 March 2015 12:33:20 UTC