Re: <code> element and scripting languages from Stuart Wakefield on 2015-03-15 (public-html-comments@w3.org from March 2015)

From: Stuart Wakefield <me@stuartwakefield.co.uk>
Date: Sun, 15 Mar 2015 08:33:41 +0000
To: Andrea Rendine <master.skywalker.88@gmail.com>
Cc: "public-html-comments@w3.org" <public-html-comments@w3.org>
Message-Id: <90398DA9-6D18-4B78-B5C4-44EC38694E3C@stuartwakefield.co.uk>
Hi Andrea,

Using the lang attribute to identify the programming language within an elements text content does seem appropriate.

I couldn't find specific guidance in current recommendations, I do note that HTML 4.01 recommendations did have the following guidance in section 8.1.1:
"The lang attribute's value is a language code that identifies a natural language spoken, written, or otherwise used for the communication of information among people. Computer languages are explicitly excluded from language codes."

It is unclear whether the HTML 5 recommendations drop this guidance on purpose. The HTML 4.01 guidance on the lang attribute does seem much clearer in general in usage and intent than the corresponding advice in the HTML5 recommendation:

"Language information specified via the lang attribute may be used by a user agent to control rendering in a variety of ways. Some situations where author-supplied language information may be helpful include:

Assisting search engines
Assisting speech synthesizers
Helping a user agent select glyph variants for high quality typography
Helping a user agent choose a set of quotation marks
Helping a user agent make decisions about hyphenation, ligatures, and spacing
Assisting spell checkers and grammar checkers"
All would seem to be appropriate for this use case. 

How would be appropriate to handle, for example, comments in a natural language within a section marked up as a machine readable language?

It would be useful to know, what the initial reasons were for HTML 4.01 authors to discount computer / machine readable languages from the original recommendation and whether the HTML5 recommendation omits this advice intentionally.

In the interim my advice would be to use translate no and data attributes, to the best of my knowledge this is the most widely accepted way of achieving this despite it lacking in supplying useful meta about the content in a meaningful way.

The improvements you've suggested would, given its acceptance, certainly increase the semantic richness of this type of content over that approach.

Stuart

> On 13 Mar 2015, at 18:19, Andrea Rendine <master.skywalker.88@gmail.com> wrote:
> 
> I came up the idea I am going to write after reading these lines:
> "There is no formal way to indicate the language of computer code being marked up. Authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, can use the class attribute, e.g. by adding a class prefixed with "language-" to the element. (http://www.w3.org/html/wg/drafts/html/master/semantics.html#the-code-element)"
> 
> I don't think this is the best way to recognize code snippets. @class attribute is not meant to convey any semantic meaning.
> 
> On the other hand, I had a funny experience some days ago while looking at an automated translation of a page in my language. This page contained PHP and JS code snippets, as well as a native scripting language. This means that it was full of control expressions such as "if ... else", "while", "function", "print" and so on.
> As you can easily imagine, these words in the snippet had been translated, thus making the snippets themselves useless.
> 
> So I thought: @lang could be used for this purpose on code-snippet elements and generally speaking in HTML documents.
> Why @lang? Well, I took this idea from seeing WHATWG HTML spec, which is written in a language denoted by lang="en-GB-x-hixie" and I thought to an extension of this concept. Actually, it would be a compact way to declare 2 things:
> 1. apart from strings and comments, the core of the related element is NOT the same "language" as the text and it is not meant to be translated. It doesn't stretch the meaning of the "lang" concept: first off, it's always a matter of language e.g. in non-English pages because control expressions are generally in English (or in a natural language which must not be translated, anyway), then it contains contraptions and abstract terms which define it as a real "language" which is different from the plain text.
> 2. the element is to be identified according to its programming language, e..g. for highlighting syntax. As a side note, there's a CSS selector based on lang attributes, and jQuery-based highlighting plugins, as well as any other library based on CSS selectors, would benefit from this.
> 
> When I thought about this, I didn't think about a change in the BCP47 spec, which I consider out of range. Instead, I looked at that specification.
> It leaves room for partial customization through the use of "private use subtags" in the form of a string consisting of "x-" followed by up to 8 alphabetic characters. This subtag can either follow a primary/regional language tag, or be present as stand-alone.
> Private use subtags are "private" by definition, and they are meant to be used in limited groups according to agreements specific for these groups. But nobody would prevent HTML community to build such an "agreement" in the spec, so that a series of "private use subtags" such as "x-perl" or "x-php" can be used by Web authors (an agreement would be necessary for language names such as Javascript or C++, because either too long or containing non-alphabetic characters).
> This means that a snippet in the form <code lang="x-php">, for example, would be both easy to understand, easy to target for syntax highlight extensions, and able to tell its content apart from parent elements defining a language for the whole document. If, on the other side, in the snippet there are strings or comments in a natural language is to be considered, something like <code lang="en-US-x-php"> could be used.
> 
> In "public-html" mailing list I received suggestions such as using translate="no" in order to prevent automated translation; and separately create private use attributes such as data-code-lang, or propose new attributes like programming-lang, to express the programming language. I should add lang="" however, because as said above, it is difficult to consider a code snippet like a paragraph in natural language.
> The different proposals have something really good but they're partial - they only focus on preventing translation or programming language indication. Maybe there's a way to achieve both.
> Please tell me what you think about it.
> Thanks.
Received on Monday, 16 March 2015 07:49:52 UTC