W3C home > Mailing lists > Public > public-html-comments@w3.org > March 2015

<code> element and scripting languages

From: Andrea Rendine <master.skywalker.88@gmail.com>
Date: Fri, 13 Mar 2015 19:19:01 +0100
Message-ID: <CAGxST9kOiS1gO+u7dRdtRko8PDqFJN255rby3oGDRPXOWeR+jA@mail.gmail.com>
To: public-html-comments@w3.org
I came up the idea I am going to write after reading these lines:
*"There is no formal way to indicate the language of computer code being
marked up. **Authors who wish to mark code elements with the language used,
e.g. so that syntax highlighting scripts can use the right rules, can use
the class attribute, e.g. by adding a class prefixed with "language-" to
the element.
(http://www.w3.org/html/wg/drafts/html/master/semantics.html#the-code-element
<http://www.w3.org/html/wg/drafts/html/master/semantics.html#the-code-element>)"*

I don't think this is the best way to recognize code snippets. @class
attribute is not meant to convey any semantic meaning.

On the other hand, I had a funny experience some days ago while looking at
an automated translation of a page in my language. This page contained PHP
and JS code snippets, as well as a native scripting language. This means
that it was full of control expressions such as "if ... else", "while",
"function", "print" and so on.
As you can easily imagine, these words in the snippet had been translated,
thus making the snippets themselves useless.

So I thought: @lang could be used for this purpose on code-snippet elements
and generally speaking in HTML documents.
Why @lang? Well, I took this idea from seeing WHATWG HTML spec, which is
written in a language denoted by lang="en-GB-x-hixie" and I thought to an
extension of this concept. Actually, it would be a compact way to declare 2
things:
1. apart from strings and comments, the core of the related element is NOT
the same "language" as the text and it is not meant to be translated. It
doesn't stretch the meaning of the "lang" concept: first off, it's always a
matter of language e.g. in non-English pages because control expressions
are generally in English (or in a natural language which must not be
translated, anyway), then it contains contraptions and abstract terms which
define it as a real "language" which is different from the plain text.
2. the element is to be identified according to its programming language,
e.g. for highlighting syntax. As a side note, there's a CSS selector based
on lang attributes, and jQuery-based highlighting plugins, as well as any
other library based on CSS selectors, would benefit from this.

When I thought about this, I didn't think about a change in the BCP47 spec,
which I consider out of range. Instead, I looked at that specification.
It leaves room for partial customization through the use of "private use
subtags" in the form of a string consisting of "x-" followed by up to 8
alphabetic characters. This subtag can either follow a primary/regional
language tag, or be present as stand-alone.
Private use subtags are "private" by definition, and they are meant to be
used in limited groups according to agreements specific for these groups.
But nobody would prevent HTML community to build such an "agreement" in the
spec, so that a series of "private use subtags" such as "x-perl" or "x-php"
can be used by Web authors (an agreement would be necessary for language
names such as Javascript or C++, because either too long or containing
non-alphabetic characters).
This means that a snippet in the form <code lang="x-php">, for example,
would be both easy to understand, easy to target for syntax highlight
extensions, and able to tell its content apart from parent elements
defining a language for the whole document. If, on the other side, in the
snippet there are strings or comments in a natural language is to be
considered, something like <code lang="en-US-x-php"> could be used.

In "public-html" mailing list I received suggestions such as using
translate="no" in order to prevent automated translation; and separately
create private use attributes such as data-code-lang, or propose new
attributes like programming-lang, to express the programming language. I
should add lang="" however, because as said above, it is difficult to
consider a code snippet like a paragraph in natural language.
The different proposals have something really good but they're partial -
they only focus on preventing translation or programming language
indication. Maybe there's a way to achieve both.
Please tell me what you think about it.
Thanks.
Received on Friday, 13 March 2015 18:19:43 UTC

This archive was generated by hypermail 2.3.1 : Friday, 13 March 2015 18:19:43 UTC