Re: [whatwg] Proposal: Change HTML spec to allow any arbitrary value for the <meta> "name" attribute

On Tue, 4 Jun 2013, Michael[tm] Smith wrote:
> 
> The context of the proposal is the following language in the HTML spec:
> 
>   "Conformance checkers must use the information given on the WHATWG Wiki
>   MetaExtensions page to establish if a value is allowed or not: values
>   defined in this specification or marked as "proposed" or "ratified" must
>   be accepted, whereas values marked as "discontinued" or not listed in
>   either this specification or on the aforementioned page must be rejected
>   as invalid."
> 
>   http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#other-metadata-names
> 
> I propose we remove that language from the spec [...]

For the past few years, we've been running an experiment here, getting 
people to register values on the wiki:

   http://wiki.whatwg.org/wiki/MetaExtensions

We also have the rel extensions registry on the microformats wiki.

Before we go further, we should probably discuss the problem that we're 
trying to solve with these registries, instead of just allowing any value 
to be used. These are the same reasons as for most things we make non- 
conforming at the authoring level:
 - helping authors catch cases where their intent is unclear
 - helping authors catch typos
 - helping authors avoid known interoperability issues
 - helping authors avoid wasting time

(See also: http://whatwg.org/html#conformance-requirements-for-authors )

There's also the goal of just documenting what's out there, to help people 
who want to invent new values to avoid reinventing the wheel, or at least, 
to avoid reinventing it poorly.

I think these goals are reasonable, and worth pursuing. We could just have 
wide-open extension points here, allowing any value, not bothering to 
define any. But I think this would be a net loss.


>From the aforementioned experiments with wiki registries, we've learnt 
several things, which I shall now discuss. Some of these are taken from 
mailing list discussions on this thread; thanks to the contributors 
thereto. Others are taken from IRC discussions earlier this week; thanks 
to Tantek and MikeSmith in particular for their comments:

   http://krijnhoetmer.nl/irc-logs/whatwg/20140122#l-552
   http://krijnhoetmer.nl/irc-logs/whatwg/20140123#l-88


The first big lesson is that, maybe surprisingly, there's a lot of demand 
for these features. The long tail of needs for these values is very long 
indeed. Well over a hundred <meta name> keywords have been registered, and 
that ignores all the ones that people haven't bothered to register.

Simultaneously, people are unhappy when validators don't know about their 
meta names. The modern HTML validators point people to the Wiki; for 
example, validator.nu says:

   "You can register metadata names on the WHATWG wiki yourself."

People follow the link and attempt to register values all the time. The 
barrier to adding values has been low, but still not trivial: you have to 
e-mail a request for a wiki account, then when you get it a few days 
later, you have to add the item to the wiki page which means editing 
MediaWiki's table markup. We get a few requests for accounts each week. We 
also get a number of people each week filing unclear bug reports on the 
spec asking, I think, for certain values to be registered.

One lesson from this is that we could make this much easier, e.g. by 
having validators offer to register the keyword directly. The wikis for 
<meta name> and <link rel> have been wildly more successful than IETF/IANA 
registries, at least in terms of how many keywords they document. This is, 
presumably, due to the lower barrier for entry. If a goal is to document 
values used on the Web, then lowering the barrier to entry even further 
might increase the volume of documentation.

Another lesson, though, is that if we make certain values non-conforming, 
we'd better have a rather convincing message for the validators to give 
the authors. Such messages should probably include an explanation as to 
why a value is non-conforming and a description of how to achieve the 
desired effect instead. Right now, the message is just "Bad value ... for 
attribute name on element meta: Keyword ... is not registered". This may 
be the best we can do for unknown values (though maybe we could do better 
for typos, pointing people to the keyword they probably meant to use?), 
but if we start marking values as non-conforming because of issues with 
the values, we need to be clearer. Right now, for <link rel> extensions we 
have the concept of "synonyms". This probably isn't sufficient for 
validators, though, and we don't have it at all for <meta>.

Another anectodal data point is that based on the conversations I've had 
with people trying to register accounts on the WHATWG wiki so they can 
register keywords, many authors have no idea what these keywords are 
really for, despite being very sure they want to have them and not wanting 
validators to complain about them. I don't know what we can do about that, 
though. At the end of the day, an author who doesn't care, can't be taught.

Once authors have registered a keyword, one complaint that I have heard 
several times is that the validators don't immediately get updated. I 
think there would be value to having a lower latency between registration 
and validation. Ideally, one could imagine a situation where someone 
validates a page, finds a not-yet-registered value, registers it, goes to 
a different validator, and that other validator is already updated and 
knows that the values is registered. Obviously, this would require some 
real-time communication without humans in the loop after the registration.


Looking at the keywords themselves, one is struck by how very few of the 
keywords have high quality specifications. Most have either what can be 
best described as author's guides, just vague descriptions of how to use 
the values, or, in some cases, specifications that only give authoring 
conformance criteria, and don't mention consumers at all. I admittedly 
didn't perform a comprehensive search, but I couldn't find any <meta> 
keywords that had a specification that described anything like error 
handling rules for bogus values. Many keywords get registered without a 
specification at all. While we require a spec link currently, many people 
just ignore this, or provide a link to, as Mike put it, "just general 
overview documents that are only marginally related".

Many of the values are very poorly designed. Even major vendors like 
Microsoft do crazy things, as in:

   <meta name="msapplication-square70x70logo" 
         content="images/tinylogo.png">

...which is redundant with existing features (<link rel=icon>, which was 
itself a Microsoft extension), uses the wrong extension mechanism (URLs 
should use <link>), explicitly has a vendor name in the keyword despite 
being a generic concept (logos aren't just useful for Windows), and uses a 
different keyword for each size, rather than using an approach like the 
<link sizes> attribute in HTML.

There are also many duplicate ideas. For example, there are several ways 
to mark up license information registered in the MetaExtensions list; 
they're all redundant with rel=license.

There are values that make sense as part of a wider vocabulary, but that 
don't really fit the HTML story, like "da_pageTitle", which of course is 
redundant with <title>. Despite this, because those vocabularies are used 
with HTML, these values get used. Presumably, consumers find it easier to 
just use the data from the vocabulary, than to interpret HTML semantics. 
(It's easier just to grab all your <meta> values, than to grab those and 
then also look for <title>, <link rel=license>, etc.)

There are multiple groups of values that form vocabularies, but they each 
use a different syntax. For example, twitter uses the syntax 
"twitter:...", Microsoft has used the syntax "msapplication-*", Decibel 
Insight uses the "da_*" form, Webtrends uses "wt.*", and Dublin Core 
supposedly uses arbitrary prefixes resulting in keywords of the form 
"prefix.*", with the "prefix" part being declared in a <link 
rel="schema.prefix"> link, which amusingly violates both the <meta> and 
<link> extension mechanisms (I'd be curious to see if anyone actually 
implements this as specified [1]), though only one prefix is actually 
registered.

[1] http://dublincore.org/documents/dc-html/ section 3.2.1, I guess, 
though that doesn't actually say how interpreters should parse it.

We could probably help this a lot by providing guidance to extension 
creators:

 - how to create groups of keywords
 - where to first check to make sure there's no existing values that do 
   the same thing
 - best practices like using <link> rather than <meta> for URLs
 - suggesting that specifications should include rules on how consumers 
   are to interpret the data, including error handling


One concern that has been raised, which I share, is that many authors are 
marking up their pages with a lot of <meta>data that is never actually 
consumed by anyone. This is, fundamentally, a waste of the author's time. 
I don't really see what we can do about this, though. For example, while I 
think it's unlikely that most people marking up their pages with Dublin 
Core metadata will ever benefit from it, some people, e.g. government 
archivists, mark up all their pages with ample accurate metadata that is 
then processed by their own tools. Even if 90% of authors should avoid a 
feature, we should probably not make it non-conforming if 10% of authors 
have a perfectly valid and simple use for it.

It would be tempting to have the validators ask the author for which 
consuming software package they are planning on processing their metadata 
with (and mark uses that have no target software as non-conforming), but I 
can't really see how to make that work.

Other authors do use the values, but do so within a small community, where 
the benefits of public registration are unclear, and where typos would be 
caught by their own software, without the help of a validator.


Currently (past few years), the rel= registry is maintained (mainly in 
terms of spam gardening) by the Microformats community, and the <meta> 
registry is maintained mostly by Mike, Anne, and Henri (mainly in terms of 
filtering poorly registered values), but in general the registries are not 
that opinionated -- there's nobody throwing out "application-url" as a 
meta name, for example, despite the fact that it's not really appropriate 
(rel=canonical or some other rel= values would be better).

Maybe we could improve this, e.g. by more aggressively categorising 
keywords, maybe automatically rejecting keywords that linger for a period 
of time (like a year or two), maybe basing it in part on volume of 
validator traffic for a keyword.

We also have some rel="" and <meta> keywords that have a preferred 
position, in the HTML spec itself. I think this makes sense for values 
that have clear and broadly applicable use cases, and, for values that 
have a long history, a proven implementation record: it indicates to 
authors that these are not going to go anywhere, it indicates to 
implementors that these are values worth implementing. One of the things 
that has been lacking recently is the transfer of these kinds of values 
from the registries into the spec itself.

This kind of review, however, is not obviously compatible with a mechanism 
by which anyone can register a keyword and have it immediately supported 
by all validators. This suggests we need several tiers, e.g. unknown, 
provisionally registered, accepted.


So where does all this get us.

Imagine the following scenarios:

A user goes to http://validator.nu/ and validates their page. The one 
problem they find is that they used <meta name="target-audience" 
content="expert">, and that name isn't recognised.

The validator says:

   Error: The keyword "target-audience" is not a known metadata name, yet 
   it used on the <meta> element on line 5:
     </title>↩ <meta name="target-audience"↩content="expert">↩</hea
                           ^^^^^^^^^^^^^^^
   _View_all_registered_metadata_names_.
   ▶ Register the "target-audience" metadata name

The user clicks the "Register" disclosure triangle, and down pops:

   ▼ Register the "target-audience" metadata name

       Metadata keyword name: [ target-audience     ]

       Briefly describe the purpose of this keyword:
         __________________________________
        |                                  |
        |                                  |
        |__________________________________|

       What kind of metadata keyword is this?

        (o) This keyword is defined in a public specification.
            Specification: [ http://                             ]

        ( ) This keyword is for private use within my organisation or
            for personal use on my site.

       Why are you providing this metadata?

        (o) I am providing it so that specific software can read the data.
            Name of software: [                                  ]

        ( ) I am providing it in the hope that it will one day be useful.

        ( ) Other: [                                             ]            

       (( Submit Registration ))
       The URL of the Web page you validated, as well as your IP address, 
       will be recorded for spam fighting purposes. The information 
       provided on this form will be publicly visible.

The user fills in the form, and presses the button. The validator contacts 
a central registry system for <meta name> keywords, and provides the 
information. The central registry informs all the validators that the 
registry has been updated.

The user validates the page with a different validator. The page is marked 
as valid, since the keyword was just registered.

I think this would be a pretty good experience for authors.


The user later goes back to http://validator.nu/ and validates a new page. 
The one problem they find is that they used <meta name="targetaudience" 
content="beginner">, and that name isn't recognised.

The validator says:

   Error: The keyword "targetaudience" is not a known metadata name, yet 
   it used on the <meta> element on line 5:
     </title>↩ <meta name="targetaudience"↩content="beginner">↩</hea
                           ^^^^^^^^^^^^^^
   Did you mean: _target-audience_ ?

   _View_all_registered_metadata_names_.
   ▶ Register the "targetaudience" metadata name

Here, the link with the text "target-audience" is a link to the 
specification of that link type, as registered earlier, or maybe a link to 
http://meta.registries.whatwg.org/target-audience or some such, which 
could list historical information about that keyword, as well as a link to 
the specification and any other annotations about it.


Later, we determine that "target-audience" is redundant with the 
earlier-registered, more widely used, and more widely supported, 
"dcterms.audience". So, Mike updates the registry accordingly. Later 
still, the aforementioned author revalidates the original page, and gets:

   Error: The <meta> element on line 5 uses an obsolete metadata name.
   Instead of using the metadata name "target-audience", consider
   using the metadata name "dcterms.audience". This keyword is more widely 
   implemented.
     </title>↩ <meta name="target-audience"↩content="expert">↩</hea
                           ^^^^^^^^^^^^^^^
   _View_further_information_about_"dcterms.audience"_.
   _View_further_information_about_"target-audience"_.
   _View_all_registered_metadata_names_.
   ▶ Update registration for "target-audience" metadata name

If the user drops down the disclosure triangle, they get:

   ▼ Update registration for "target-audience" metadata name

       Metadata keyword name: [ target-audience     ]

       This metadata name is non-conforming and has the following 
       associated message:

          Instead of using the metadata name "target-audience", consider 
          using the metadata name "dcterms.audience". This keyword is more 
          widely implemented.

       This was the result of _an_update_ on 2014-09-01 by MikeSmith™‬.

       If you believe this is an error, please report this problem to the 
       _WHATWG_mailing_list_ or _file_a_bug_.

The last two links would point to http://www.whatwg.org/mailing-list and 
http://whatwg.org/newbug accordingly. The "an update" link would point to 
the history of the keyword's registration.


Let's say the author in question contacts us to inform us that 
"target-audience" is used by his organisation internally, and we update 
the registry accordingly. Now when the author validates, they see (after 
opening the first disclosure triangle):

   This document is valid HTML!
   ▼ This document used one proprietary <meta> metadata name.
       The <meta> element on line 5 uses the keyword "target-audience".
       This metadata name has been registered for private use only.
       Instead of using the metadata name "target-audience", consider 
       using the metadata name "dcterms.audience". This keyword is more 
       widely implemented.

       _View_further_information_about_"dcterms.audience"_.
       _View_further_information_about_"target-audience"_.
       _View_all_registered_metadata_names_.
       ▶ Update registration for "target-audience" metadata name


Are these scenarios a good idea?

One could imagine going further, e.g. supporting groups of keywords that 
come from a single vocabulary, and having higher-level information about 
the vocabulary itself, e.g. "This document used 5 keywords from the 
obsolete Irish Core vocabulary", rather than listing all five keywords 
independently. But I've not attempted to consider doing this at this time.

We may also wish to distinguish between keywords that have full specs and 
keywords that have inadequate documentation. For example:

   This document is valid HTML!
   ▶ This document used three <meta> metadata names that lack complete
     specifications. Interoperability problems may result.


If we want to go down this route, I think we'd probably want to set up a 
server for tracking the keywords, rather than using a generic wiki. This 
would allow us to provide an API for validators to send and receive 
updates (we could do this over HTTP, or, probably better, long-lived TCP 
connections so that validators don't have to poll the server).

Keywords would basically be records with the following information, along 
with a history which would track what was changed, who changed it, and 
when they changed it:

 - keyword name: probably immutable, and used as a key

 - status, one of:

    - unknown: anything that hasn't ever been registered. These keywords 
      would get the message in the first scenario, and offer to have the 
      keyword registered. (I guess this isn't really a status any of the 
      records could have, it's the implicit status of anything not in the 
      registry so far.)

    - provisionally registered public: the initial state of a keyword when 
      it is added to the system as a specced keyword.

    - provisionally registered proprietary: the initial state of a keyword 
      when it is added to the system as a proprietary keyword.

    - conforming interoperable good practice: the keyword is good. It has 
      a specification.

    - conforming but inadequately documented: the keyword is good, but it 
      lacks a comprehensive specification.

    - conforming but proprietary: the keyword is intended just for private 
      use; it has been deemed not useful for wide use, not implemented by 
      widely used UAs, but used in private and not harmful.

    - non-conforming: the keyword has been examined and deemed poor 
      practice; it has better alternatives.

 - brief description

 - spec link

 - suggestion text (the "Instead of..." text in the examples above), in a 
   form that allows related metadata names to be recognised (so that the 
   link to dcterms.audience can be pulled out). Blank for the first four 
   statuses above, non-blank for the last three.

There would be a protocol for validators to speak, which would have to 
have at least the following API features:

 - Get a list of all the registered keywords, and their current state, 
   and information about when the last change was made.

 - Add a keyword.

 - Notification that a new type has been added or the state of an existing 
   type has been changed.


This is a proposal; I haven't set anything up to do this yet. I would be 
interested in knowing whether people see any problems with this, or see 
any better ways to address the underlying problems listed earlier (or 
indeed, if people think the problem description is incomplete or wrong).


We could do more clever things, for example, checking that the provided 
spec link actually contains the relevant keyword name. I'm not sure if 
it's worth it, but it might be interesting to do.

Another thing that we could look at is the metadata _values_. Right now, 
we're not doing any checking at all of the content="" attribute, but many 
of the keywords have limited value spaces that we could check. It's 
probably best to have the validators implement checking for the 
"conforming interoperable good practice" keywords explicitly, though, 
rather than trying to make this more generic.

The above is mainly focused on <meta>, but many extension points need 
serious work, not just <meta name>. <link rel>, for instance, but also 
Content-Type, URL schemes, etc. I think we should probably start with 
<meta name>, and move forward from there once we have more experience with 
that one, since that's the only one for which the WHATWG is alone in 
trying to document the name space. Once we've proven that the model works, 
assuming it works, we can approach other registries and show them our 
experience. (And if it doesn't work, then we can keep trying to work on 
<meta name> instead.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Friday, 24 January 2014 21:06:06 UTC