Re: <code> element and scripting languages from Gannon Dick on 2015-03-16 (public-html-comments@w3.org from March 2015)

From: Gannon Dick <gannon_dick@yahoo.com>
Date: Mon, 16 Mar 2015 11:46:33 -0700
To: Andrea Rendine <master.skywalker.88@gmail.com>, Stuart Wakefield <me@stuartwakefield.co.uk>
Cc: "public-html-comments@w3.org" <public-html-comments@w3.org>
Message-ID: <1426531593.26060.YahooMailBasic@web122906.mail.ne1.yahoo.com>

Hi Andrea,

In addition to the properties of the @lang attribute mentioned below there is another, and it overshadows the rest ... Web Specifications are "for the tourists". This is why EUROPA.EU has 24 languages, VATICAN.VA has 10 and why "Americans" speak 108 languages @Home, last time I counted the US Census report.

The @lang attribute is appropriate to use in conjunction with the <code> element, IMHO, as long as the effect of a partisan implementer audience is neutralized. This is especially so with semantic data. This problem is 1000 years old - Iceland is in fact energy self-sufficient and Greenland, well let's just say Erik the Red earned his place in the PageRank Hall of Fame.

ISO Standards have a different outlook with respect to the Open World Assumption - they would like it to work properly sometime before the tourists realize they have been played. To do this they must account for seasonal pairity shifts in time. It is a matter of intellectual integrity for the specifications, if I didn't just answer my own question about the evolution of the WWW and GPS Navigation.

The ISO language specification is by type:
1) Terminology 2 character codes "for the tourists".
2) Bibliographic 3 character codes for scholarship without Confirmation Bias or suggesting a BCP/best answer.
3) Miscellaneous, to discourage "obvious" conclusions about a finite set of language labels. In the case of "zxx" the code does mean "not a human language". To be frank, it means to gadgets (user agents) that "they" have not located and identified intelligent life (language tool users). The web of things has not passed its Turing Test yet, glowing reports of imminent success notwithstanding.

--Gannon

--------------------------------------------
On Sun, 3/15/15, Stuart Wakefield <me@stuartwakefield.co.uk> wrote:

Subject: Re: <code> element and scripting languages
To: "Andrea Rendine" <master.skywalker.88@gmail.com>
Cc: "public-html-comments@w3.org" <public-html-comments@w3.org>
Date: Sunday, March 15, 2015, 3:33 AM

Hi
Andrea,
Using the lang
attribute to identify the programming language within an
elements text content does seem appropriate.
I couldn't find specific
guidance in current recommendations, I do note that HTML
4.01 recommendations did have the following guidance in
section 8.1.1:"The lang attribute's
value is a language code that identifies a natural language
spoken, written, or otherwise used for the communication of
information among people. Computer languages are explicitly
excluded from language codes."It is unclear whether the HTML 5 recommendations
drop this guidance on purpose. The HTML 4.01
guidance on the lang attribute does seem much clearer in
general in usage and intent than the corresponding advice in
the HTML5 recommendation:"Language
information specified via the lang attribute may be
used by a user agent to control rendering in a variety of
ways. Some situations where author-supplied language
information may be helpful include:Assisting search enginesAssisting speech synthesizersHelping a user agent select glyph variants for high
quality typographyHelping a
user agent choose a set of quotation
marksHelping a user agent
make decisions about hyphenation,
ligatures, and spacingAssisting spell checkers and grammar
checkers"All would seem to
be appropriate for this use case.
How would be appropriate to handle,
for example, comments in a natural language within a section
marked up as a machine readable language?
It would be useful to know, what the
initial reasons were for HTML 4.01 authors to discount
computer / machine readable languages from the original
recommendation and whether the HTML5 recommendation omits
this advice intentionally.
In the interim my advice would be to
use translate no and data attributes, to the best of my
knowledge this is the most widely accepted way of achieving
this despite it lacking in supplying useful meta about the
content in a meaningful way.
The improvements you've
suggested would, given its acceptance, certainly increase
the semantic richness of this type of content over that
approach.
Stuart
On 13 Mar
2015, at 18:19, Andrea Rendine <master.skywalker.88@gmail.com>
wrote:

I came up the idea I
am going to write after reading these lines:
"There is no
formal way to indicate the language of computer code being
marked up. Authors who wish to
mark code elements with the language used, e.g. so that
syntax highlighting scripts can use the right rules, can use
the class attribute, e.g. by adding a class prefixed with
"language-" to the element. (http://www.w3.org/html/wg/drafts/html/master/semantics.html#the-code-element)"
I don't think this
is the best way to recognize code snippets. @class attribute
is not meant to convey any semantic meaning.
On the other hand, I
had a funny experience some days ago while looking at an
automated translation of a page in my language. This page
contained PHP and JS code snippets, as well as a native
scripting language. This means that it was full of control
expressions such as "if ... else",
"while", "function", "print"
and so on.As you can easily
imagine, these words in the snippet had been translated,
thus making the snippets themselves useless.
So I thought: @lang
could be used for this purpose on code-snippet elements and
generally speaking in HTML documents.Why @lang? Well, I
took this idea from seeing WHATWG HTML spec, which is
written in a language denoted by
lang="en-GB-x-hixie" and I thought to an extension
of this concept. Actually, it would be a compact way to
declare 2 things:1. apart from strings
and comments, the core of the related element is NOT the
same "language" as the text and it is not meant to
be translated. It doesn't stretch the meaning of the
"lang" concept: first off, it's always a
matter of language e.g. in non-English pages because control
expressions are generally in English (or in a natural
language which must not be translated, anyway), then it
contains contraptions and abstract terms which define it as
a real "language" which is different from the
plain text.2. the element is to
be identified according to its programming language, e.g.
for highlighting syntax. As a side note, there's a CSS
selector based on lang attributes, and jQuery-based
highlighting plugins, as well as any other library based on
CSS selectors, would benefit from this.
When I thought about
this, I didn't think about a change in the BCP47 spec,
which I consider out of range. Instead, I looked at that
specification.It leaves room for
partial customization through the use of "private use
subtags" in the form of a string consisting of
"x-" followed by up to 8 alphabetic characters.
This subtag can either follow a primary/regional language
tag, or be present as stand-alone.Private use subtags
are "private" by definition, and they are meant to
be used in limited groups according to agreements specific
for these groups. But nobody would prevent HTML community to
build such an "agreement" in the spec, so that a
series of "private use subtags" such as
"x-perl" or "x-php" can be used by Web
authors (an agreement would be necessary for language names
such as Javascript or C++, because either too long or
containing non-alphabetic characters).This means that a
snippet in the form <code lang="x-php">, for
example, would be both easy to understand, easy to target
for syntax highlight extensions, and able to tell its
content apart from parent elements defining a language for
the whole document. If, on the other side, in the snippet
there are strings or comments in a natural language is to be
considered, something like <code
lang="en-US-x-php"> could be used.
In
"public-html" mailing list I received suggestions
such as using translate="no" in order to prevent
automated translation; and separately create private use
attributes such as data-code-lang, or propose new attributes
like programming-lang, to express the programming language.
I should add lang="" however, because as said
above, it is difficult to consider a code snippet like a
paragraph in natural language.The different
proposals have something really good but they're partial
- they only focus on preventing translation or programming
language indication. Maybe there's a way to achieve
both.Please
tell me what you think about it.Thanks.

Received on Monday, 16 March 2015 18:47:01 UTC