W3C home > Mailing lists > Public > public-html-comments@w3.org > March 2015

Re: <code> element and scripting languages

From: Gannon Dick <gannon_dick@yahoo.com>
Date: Mon, 16 Mar 2015 11:46:33 -0700
Message-ID: <1426531593.26060.YahooMailBasic@web122906.mail.ne1.yahoo.com>
To: Andrea Rendine <master.skywalker.88@gmail.com>, Stuart Wakefield <me@stuartwakefield.co.uk>
Cc: "public-html-comments@w3.org" <public-html-comments@w3.org>
Hi Andrea,

In addition to the properties of the @lang attribute mentioned below there is another, and it overshadows the rest ... Web Specifications are "for the tourists".  This is why EUROPA.EU has 24 languages, VATICAN.VA has 10 and why "Americans" speak 108 languages @Home, last time I counted the US Census report. 

The @lang attribute is appropriate to use in conjunction with the <code> element, IMHO, as long as the effect of a partisan implementer audience is neutralized.  This is especially so with semantic data.  This problem is 1000 years old - Iceland is in fact energy self-sufficient and Greenland, well let's just say Erik the Red earned his place in the PageRank Hall of Fame.

ISO Standards have a different outlook with respect to the Open World Assumption - they would like it to work properly sometime before the tourists realize they have been played.  To do this they must account for seasonal pairity shifts in time.  It is a matter of intellectual integrity for the specifications, if I didn't just answer my own question about the evolution of the WWW and GPS Navigation.

The ISO language specification is by type:
1) Terminology 2 character codes "for the tourists".
2) Bibliographic 3 character codes for scholarship without Confirmation Bias or suggesting a BCP/best answer.
3) Miscellaneous, to discourage "obvious" conclusions about a finite set of language labels.  In the case of "zxx" the code does mean "not a human language". To be frank, it means to gadgets (user agents) that "they" have not located and identified intelligent life (language tool users).  The web of things has not passed its Turing Test yet, glowing reports of imminent success notwithstanding.

--Gannon




--------------------------------------------
On Sun, 3/15/15, Stuart Wakefield <me@stuartwakefield.co.uk> wrote:

 Subject: Re: <code> element and scripting languages
 To: "Andrea Rendine" <master.skywalker.88@gmail.com>
 Cc: "public-html-comments@w3.org" <public-html-comments@w3.org>
 Date: Sunday, March 15, 2015, 3:33 AM
 
 Hi
 Andrea,
 Using the lang
 attribute to identify the programming language within an
 elements text content does seem appropriate.
 I couldn't find specific
 guidance in current recommendations, I do note that HTML
 4.01 recommendations did have the following guidance in
 section 8.1.1:"The lang attribute's
 value is a language code that identifies a natural language
 spoken, written, or otherwise used for the communication of
 information among people. Computer languages are explicitly
 excluded from language codes."It is unclear whether the HTML 5 recommendations
 drop this guidance on purpose. The HTML 4.01
 guidance on the lang attribute does seem much clearer in
 general in usage and intent than the corresponding advice in
 the HTML5 recommendation:"Language
 information specified via the lang attribute may be
 used by a user agent to control rendering in a variety of
 ways. Some situations where author-supplied language
 information may be helpful include:Assisting search enginesAssisting speech synthesizersHelping a user agent select glyph variants for high
 quality typographyHelping a
 user agent choose a set of quotation
 marksHelping a user agent
 make decisions about hyphenation,
 ligatures, and spacingAssisting spell checkers and grammar
 checkers"All would seem to
 be appropriate for this use case. 
 How would be appropriate to handle,
 for example, comments in a natural language within a section
 marked up as a machine readable language?
 It would be useful to know, what the
 initial reasons were for HTML 4.01 authors to discount
 computer / machine readable languages from the original
 recommendation and whether the HTML5 recommendation omits
 this advice intentionally.
 In the interim my advice would be to
 use translate no and data attributes, to the best of my
 knowledge this is the most widely accepted way of achieving
 this despite it lacking in supplying useful meta about the
 content in a meaningful way.
 The improvements you've
 suggested would, given its acceptance, certainly increase
 the semantic richness of this type of content over that
 approach.
 Stuart
 On 13 Mar
 2015, at 18:19, Andrea Rendine <master.skywalker.88@gmail.com>
 wrote:
 
 I came up the idea I
 am going to write after reading these lines:
 "There is no
 formal way to indicate the language of computer code being
 marked up. Authors who wish to
 mark code elements with the language used, e.g. so that
 syntax highlighting scripts can use the right rules, can use
 the class attribute, e.g. by adding a class prefixed with
 "language-" to the element. (http://www.w3.org/html/wg/drafts/html/master/semantics.html#the-code-element)"
 I don't think this
 is the best way to recognize code snippets. @class attribute
 is not meant to convey any semantic meaning.
 On the other hand, I
 had a funny experience some days ago while looking at an
 automated translation of a page in my language. This page
 contained PHP and JS code snippets, as well as a native
 scripting language. This means that it was full of control
 expressions such as "if ... else",
 "while", "function", "print"
 and so on.As you can easily
 imagine, these words in the snippet had been translated,
 thus making the snippets themselves useless.
 So I thought: @lang
 could be used for this purpose on code-snippet elements and
 generally speaking in HTML documents.Why @lang? Well, I
 took this idea from seeing WHATWG HTML spec, which is
 written in a language denoted by
 lang="en-GB-x-hixie" and I thought to an extension
 of this concept. Actually, it would be a compact way to
 declare 2 things:1. apart from strings
 and comments, the core of the related element is NOT the
 same "language" as the text and it is not meant to
 be translated. It doesn't stretch the meaning of the
 "lang" concept: first off, it's always a
 matter of language e.g. in non-English pages because control
 expressions are generally in English (or in a natural
 language which must not be translated, anyway), then it
 contains contraptions and abstract terms which define it as
 a real "language" which is different from the
 plain text.2. the element is to
 be identified according to its programming language, e.g.
 for highlighting syntax. As a side note, there's a CSS
 selector based on lang attributes, and jQuery-based
 highlighting plugins, as well as any other library based on
 CSS selectors, would benefit from this.
 When I thought about
 this, I didn't think about a change in the BCP47 spec,
 which I consider out of range. Instead, I looked at that
 specification.It leaves room for
 partial customization through the use of "private use
 subtags" in the form of a string consisting of
 "x-" followed by up to 8 alphabetic characters.
 This subtag can either follow a primary/regional language
 tag, or be present as stand-alone.Private use subtags
 are "private" by definition, and they are meant to
 be used in limited groups according to agreements specific
 for these groups. But nobody would prevent HTML community to
 build such an "agreement" in the spec, so that a
 series of "private use subtags" such as
 "x-perl" or "x-php" can be used by Web
 authors (an agreement would be necessary for language names
 such as Javascript or C++, because either too long or
 containing non-alphabetic characters).This means that a
 snippet in the form <code lang="x-php">, for
 example, would be both easy to understand, easy to target
 for syntax highlight extensions, and able to tell its
 content apart from parent elements defining a language for
 the whole document. If, on the other side, in the snippet
 there are strings or comments in a natural language is to be
 considered, something like <code
 lang="en-US-x-php"> could be used.
 In
 "public-html" mailing list I received suggestions
 such as using translate="no" in order to prevent
 automated translation; and separately create private use
 attributes such as data-code-lang, or propose new attributes
 like programming-lang, to express the programming language.
 I should add lang="" however, because as said
 above, it is difficult to consider a code snippet like a
 paragraph in natural language.The different
 proposals have something really good but they're partial
 - they only focus on preventing translation or programming
 language indication. Maybe there's a way to achieve
 both.Please
 tell me what you think about it.Thanks.
 
Received on Monday, 16 March 2015 18:47:01 UTC

This archive was generated by hypermail 2.3.1 : Monday, 16 March 2015 18:47:01 UTC