W3C home > Mailing lists > Public > public-html-comments@w3.org > March 2015

Re: <code> element and scripting languages

From: Alexandre Morgaut <Alexandre.Morgaut@4d.com>
Date: Wed, 18 Mar 2015 09:25:21 +0100
To: Andrea Rendine <master.skywalker.88@gmail.com>
CC: "public-html-comments@w3.org" <public-html-comments@w3.org>
Message-ID: <B9946D82-B787-4A74-91FC-0F5B6F16B9CF@4d.com>
I'm a bit late on that thread but here are my inputs

On Mar 13, 2015, at 7:19 PM, Andrea Rendine <master.skywalker.88@gmail.com> wrote:

> I came up the idea I am going to write after reading these lines:
> "There is no formal way to indicate the language of computer code being marked up. Authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, can use the class attribute, e.g. by adding a class prefixed with "language-" to the element. (http://www.w3.org/html/wg/drafts/html/master/semantics.html#the-code-element)"
> I don't think this is the best way to recognize code snippets. @class attribute is not meant to convey any semantic meaning.

@class might not be ideal but it is at least better than "@data-*" attributes

(...) but authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.

> http://www.w3.org/TR/html51/dom.html#classes

W3C TIP is named "Use class with semantic in mind"

> http://www.w3.org/QA/Tips/goodclassnames

Microformat, which are all about semantic, heavily rely on @class attributes to enrich HTML document with semantic data
(for the most widely used: hcard and hcalendar)

> http://microformats.org/wiki/class-design-pattern

Even if using a "language-" class name pattern might be better, I see the HTML structure

<pre><code class="MyLanguage">  MyCode </code></pre>

as almost good enough and it is probably one of the most supported as by

- the excellent highlight.js: https://highlightjs.org/usage/

- the DISCUSS comments platform: https://help.disqus.com/customer/portal/articles/665057-syntax-highlighting

Prism use almost the same markup with the "language-" variant for the class name: http://prismjs.com/

SyntaxHighlighter unfortunately doesn't use the <code> element but also use the @class attribute: https://code.google.com/p/syntaxhighlighter/wiki/Usage

Rainbow prefers the @data-language option, that I dislike for being explicitly meaningless for global purpose: http://craig.is/making/rainbows

(some other tool/lib/framework may use @data-language to express human language translations availability)

> On the other hand, I had a funny experience some days ago while looking at an automated translation of a page in my language. This page contained PHP and JS code snippets, as well as a native scripting language. This means that it was full of control expressions such as "if ... else", "while", "function", "print" and so on.
> As you can easily imagine, these words in the snippet had been translated, thus making the snippets themselves useless.

I feel that issue sometime myself too
Page translators should probably not try to translate text inside <code> block
They may propose an option to recognize when a syntax highlighter snippet is used to determine the programmation language and then only translate the comments
Variable names translation might also be possible but more with more risks
It should probably be limited to variables declared in the snippet itself, but the translation tool should then very well know all the programming language syntax including string template syntaxes (as HEREDOC notation in PHP, string template syntax in JavaScript, ...)

> So I thought: @lang could be used for this purpose on code-snippet elements and generally speaking in HTML documents.
> Why @lang? Well, I took this idea from seeing WHATWG HTML spec, which is written in a language denoted by lang="en-GB-x-hixie" and I thought to an extension of this concept. Actually, it would be a compact way to declare 2 things:
> 1. apart from strings and comments, the core of the related element is NOT the same "language" as the text and it is not meant to be translated. It doesn't stretch the meaning of the "lang" concept: first off, it's always a matter of language e.g. in non-English pages because control expressions are generally in English (or in a natural language which must not be translated, anyway), then it contains contraptions and abstract terms which define it as a real "language" which is different from the plain text.
> 2. the element is to be identified according to its programming language, e.g. for highlighting syntax. As a side note, there's a CSS selector based on lang attributes, and jQuery-based highlighting plugins, as well as any other library based on CSS selectors, would benefit from this.

If we should add an attribute to <code> to specify I would personally reuse the @language attribute that is now deprecated but was previously used for the <script> element to clearly specify a programmation language name. Another option might be to use the @type attribute with the language associated MIME type as it is now done for the <script> element. Custom / vendor mime types might then be required for languages that don't already have one.

My main issue with the usage of the @lang attribute is that I know at least one programing language that support human language localisations: the 4D language
> http://doc.4d.com/4Dv14R4/4D/14-R4/Preface.300-1707921.en.html

The same code, depending on the code editor version that is used, has its commands and instructions translated either in english or in french
(they are all stored as tokens in the code source file)

as an example, in the french edition, could would read and edit this code:

 CONFIRMER("Add a new record?") ` The user wants to add a record?
 Tant que(OK=1) ` Loop as long as the user wants to
    AJOUTER ENREGISTREMENT([aTable]) ` Add a new record
 Fin tant que ` The loop always ends with End while


That would have been written in via the english code editor edition as

 CONFIRM("Add a new record?") ` The user wants to add a record?
 While(OK=1) ` Loop as long as the user wants to
    ADD RECORD([aTable]) ` Add a new record
 End while ` The loop always ends with End while


Note that the tool doesn't semantically translated the developer variables and/or its comments (starting by ` )
Only the language instructions and commands are localized

actually, comments and developer variable names are what make sense for the @lang attribute, as the variants of the 4D language could still be differentiated as 4DFR and 4DEN

actually I would probably use this HTML markup to differentiate them

<pre><code class="4D 4DFR" lang="en"> my 4D FR code with english comments </code></pre>

<pre><code class="4D" lang="en"> my 4D EN code with english comments </code></pre>

considering as 4D EN being the default localization (both EN & FR still sharing of course syntax rules)

> In "public-html" mailing list I received suggestions such as using translate="no" in order to prevent automated translation; and separately create private use attributes such as data-code-lang, or propose new attributes like programming-lang, to express the programming language. I should add lang="" however, because as said above, it is difficult to consider a code snippet like a paragraph in natural language.

translate="no" can be interesting for a more general purpose, but translating part of the source code can still be interesting to propose if done correctly, and when I see what transpilers and minifiers can do, I prettry sure translation tools could take care of only translating what makes sense

@programming-lang why not, but a bit cumbersome to me, even more when lot of programmers still remember for the @language attribute and its semantic meaning
We may want something less confusing with @lang, but as being restricted to the <code> element use case, it should probably be fine (my 2 cents)

> The different proposals have something really good but they're partial - they only focus on preventing translation or programming language indication. Maybe there's a way to achieve both.

As a summary, to me, to achieve both is possible

- As of today, translation tools can detect "standard" and recommended patterns (often via @class) to recognize where to the programming name to know the syntax that is used
        note that some syntax highlighting tools also propose to detect the language automatically, but of course it's not always possible, even more on short snippets

- Maintain usage of @lang to potentially translate comments, variable names (unless something like translate=no is specified)

- Maybe later they could use a @programming-lang, the simpler @language, or the more common @type attribute



Alexandre Morgaut
Wakanda Community Manager

60, rue d'Alsace
92110 Clichy

Standard : +33 1 40 87 92 00
Email :    Alexandre.Morgaut@4d.com
Web :      www.4D.com

Received on Wednesday, 18 March 2015 08:28:23 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 18 March 2015 08:28:24 UTC