Re: <code> element and scripting languages from Gannon Dick on 2015-03-13 (public-html-comments@w3.org from March 2015)

From: Gannon Dick <gannon_dick@yahoo.com>
Date: Fri, 13 Mar 2015 12:17:48 -0700
To: public-html-comments@w3.org, Andrea Rendine <master.skywalker.88@gmail.com>
Cc: iso639@dkuug.dk
Message-ID: <1426274268.36935.YahooMailBasic@web122905.mail.ne1.yahoo.com>
Hi Andrea,

I wouldn't want a flash mob showing up at you door with pitchforks and torches (the American kind that burn).  Linguists and Librarians in mobs can be dangerous.  Doubly so when they grouse mob-like with big words at great length because of their professional training.

The US Library of Congress is the registration authority for the ISO 639-[X] Language codes [1].

ISO 639-1 Terminology Languages has no default (Not a Language, like NaN (Not a Number) for numbers)
ISO 639-2 Bibliographic Languages has a default, the set is generally understood to be of human or historical human origin.  The code is identical to the ISO 639-3. 

ISO 639-3 has the default you seek, with the registration requirements wiggle room you need:

zxx	No linguistic content	No linguistic content
zxx	Not applicable	Not applicable

So, to avoid flash mob disruptions, especially on Beer Friday, it is wise to use the proper ISO 639-3 codes.  Depending upon the HTML schema version location you specify, this could be a validation anomaly since some schema call for a 2 Alpha string and a 3 Lower Case Alpha will fail validation.  There is no ISO 639-1 2 Alpha string semantically correct.

Be safe & Cheers,

--Gannon  




[1] http://www.loc.gov/standards/iso639-2/iso639jac.html


--------------------------------------------
On Fri, 3/13/15, Andrea Rendine <master.skywalker.88@gmail.com> wrote:

 Subject: <code> element and scripting languages
 To: public-html-comments@w3.org
 Date: Friday, March 13, 2015, 1:19 PM
 
 I came up the idea I
 am going to write after reading these lines:
 "There is no
 formal way to indicate the language of computer code being
 marked up. Authors who wish to
 mark code elements with the language used, e.g. so that
 syntax highlighting scripts can use the right rules, can use
 the class attribute, e.g. by adding a class prefixed with
 "language-" to the element. (http://www.w3.org/html/wg/drafts/html/master/semantics.html#the-code-element)"
 I don't
 think this is the best way to recognize code snippets.
 @class attribute is not meant to convey any semantic
 meaning.
 On the
 other hand, I had a funny experience some days ago while
 looking at an automated translation of a page in my
 language. This page contained PHP and JS code snippets, as
 well as a native scripting language. This means that it was
 full of control expressions such as "if ... else",
 "while", "function", "print"
 and so on.As you can easily
 imagine, these words in the snippet had been translated,
 thus making the snippets themselves useless.
 So I
 thought: @lang could be used for this purpose on
 code-snippet elements and generally speaking in HTML
 documents.Why @lang? Well, I
 took this idea from seeing WHATWG HTML spec, which is
 written in a language denoted by
 lang="en-GB-x-hixie" and I thought to an extension
 of this concept. Actually, it would be a compact way to
 declare 2 things:1. apart from strings
 and comments, the core of the related element is NOT the
 same "language" as the text and it is not meant to
 be translated. It doesn't stretch the meaning of the
 "lang" concept: first off, it's always a
 matter of language e.g. in non-English pages because control
 expressions are generally in English (or in a natural
 language which must not be translated, anyway), then it
 contains contraptions and abstract terms which define it as
 a real "language" which is different from the
 plain text.2. the element is to
 be identified according to its programming language, e.g.
 for highlighting syntax. As a side note, there's a CSS
 selector based on lang attributes, and jQuery-based
 highlighting plugins, as well as any other library based on
 CSS selectors, would benefit from this.
 When I
 thought about this, I didn't think about a change in the
 BCP47 spec, which I consider out of range. Instead, I looked
 at that specification.It leaves room for
 partial customization through the use of "private use
 subtags" in the form of a string consisting of
 "x-" followed by up to 8 alphabetic characters.
 This subtag can either follow a primary/regional language
 tag, or be present as stand-alone.Private use subtags
 are "private" by definition, and they are meant to
 be used in limited groups according to agreements specific
 for these groups. But nobody would prevent HTML community to
 build such an "agreement" in the spec, so that a
 series of "private use subtags" such as
 "x-perl" or "x-php" can be used by Web
 authors (an agreement would be necessary for language names
 such as Javascript or C++, because either too long or
 containing non-alphabetic characters).This means that a
 snippet in the form <code lang="x-php">, for
 example, would be both easy to understand, easy to target
 for syntax highlight extensions, and able to tell its
 content apart from parent elements defining a language for
 the whole document. If, on the other side, in the snippet
 there are strings or comments in a natural language is to be
 considered, something like <code
 lang="en-US-x-php"> could be used.
 In
 "public-html" mailing list I received suggestions
 such as using translate="no" in order to prevent
 automated translation; and separately create private use
 attributes such as data-code-lang, or propose new attributes
 like programming-lang, to express the programming language.
 I should add lang="" however, because as said
 above, it is difficult to consider a code snippet like a
 paragraph in natural language.The different
 proposals have something really good but they're partial
 - they only focus on preventing translation or programming
 language indication. Maybe there's a way to achieve
 both.Please
 tell me what you think about it.Thanks.
Received on Friday, 13 March 2015 19:18:16 UTC