Re: [whatwg] Proposal: Change HTML spec to allow any arbitrary value for the <meta> "name" attribute from Ian Hickson on 2014-01-24 (public-whatwg-archive@w3.org from January 2014)

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 24 Jan 2014 21:05:33 +0000 (UTC)
To: "whatwg@lists.whatwg.org" <whatwg@lists.whatwg.org>
Message-ID: <alpine.DEB.2.00.1401230002130.26647@ps20323.dreamhostps.com>

On Tue, 4 Jun 2013, Michael[tm] Smith wrote:
>
> The context of the proposal is the following language in the HTML spec:
>
> "Conformance checkers must use the information given on the WHATWG Wiki
> MetaExtensions page to establish if a value is allowed or not: values
> defined in this specification or marked as "proposed" or "ratified" must
> be accepted, whereas values marked as "discontinued" or not listed in
> either this specification or on the aforementioned page must be rejected
> as invalid."
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#other-metadata-names
>
> I propose we remove that language from the spec [...]

For the past few years, we've been running an experiment here, getting
people to register values on the wiki:

http://wiki.whatwg.org/wiki/MetaExtensions

We also have the rel extensions registry on the microformats wiki.

Before we go further, we should probably discuss the problem that we're
trying to solve with these registries, instead of just allowing any value
to be used. These are the same reasons as for most things we make non-
conforming at the authoring level:
- helping authors catch cases where their intent is unclear
- helping authors catch typos
- helping authors avoid known interoperability issues
- helping authors avoid wasting time

(See also: http://whatwg.org/html#conformance-requirements-for-authors )

There's also the goal of just documenting what's out there, to help people
who want to invent new values to avoid reinventing the wheel, or at least,
to avoid reinventing it poorly.

I think these goals are reasonable, and worth pursuing. We could just have
wide-open extension points here, allowing any value, not bothering to
define any. But I think this would be a net loss.

>From the aforementioned experiments with wiki registries, we've learnt
several things, which I shall now discuss. Some of these are taken from
mailing list discussions on this thread; thanks to the contributors
thereto. Others are taken from IRC discussions earlier this week; thanks
to Tantek and MikeSmith in particular for their comments:

http://krijnhoetmer.nl/irc-logs/whatwg/20140122#l-552
http://krijnhoetmer.nl/irc-logs/whatwg/20140123#l-88

The first big lesson is that, maybe surprisingly, there's a lot of demand
for these features. The long tail of needs for these values is very long
indeed. Well over a hundred <meta name> keywords have been registered, and
that ignores all the ones that people haven't bothered to register.

Simultaneously, people are unhappy when validators don't know about their
meta names. The modern HTML validators point people to the Wiki; for
example, validator.nu says:

"You can register metadata names on the WHATWG wiki yourself."

People follow the link and attempt to register values all the time. The
barrier to adding values has been low, but still not trivial: you have to
e-mail a request for a wiki account, then when you get it a few days
later, you have to add the item to the wiki page which means editing
MediaWiki's table markup. We get a few requests for accounts each week. We
also get a number of people each week filing unclear bug reports on the
spec asking, I think, for certain values to be registered.

One lesson from this is that we could make this much easier, e.g. by
having validators offer to register the keyword directly. The wikis for
<meta name> and <link rel> have been wildly more successful than IETF/IANA
registries, at least in terms of how many keywords they document. This is,
presumably, due to the lower barrier for entry. If a goal is to document
values used on the Web, then lowering the barrier to entry even further
might increase the volume of documentation.

Another lesson, though, is that if we make certain values non-conforming,
we'd better have a rather convincing message for the validators to give
the authors. Such messages should probably include an explanation as to
why a value is non-conforming and a description of how to achieve the
desired effect instead. Right now, the message is just "Bad value ... for
attribute name on element meta: Keyword ... is not registered". This may
be the best we can do for unknown values (though maybe we could do better
for typos, pointing people to the keyword they probably meant to use?),
but if we start marking values as non-conforming because of issues with
the values, we need to be clearer. Right now, for <link rel> extensions we
have the concept of "synonyms". This probably isn't sufficient for
validators, though, and we don't have it at all for <meta>.

Another anectodal data point is that based on the conversations I've had
with people trying to register accounts on the WHATWG wiki so they can
register keywords, many authors have no idea what these keywords are
really for, despite being very sure they want to have them and not wanting
validators to complain about them. I don't know what we can do about that,
though. At the end of the day, an author who doesn't care, can't be taught.

Once authors have registered a keyword, one complaint that I have heard
several times is that the validators don't immediately get updated. I
think there would be value to having a lower latency between registration
and validation. Ideally, one could imagine a situation where someone
validates a page, finds a not-yet-registered value, registers it, goes to
a different validator, and that other validator is already updated and
knows that the values is registered. Obviously, this would require some
real-time communication without humans in the loop after the registration.

Looking at the keywords themselves, one is struck by how very few of the
keywords have high quality specifications. Most have either what can be
best described as author's guides, just vague descriptions of how to use
the values, or, in some cases, specifications that only give authoring
conformance criteria, and don't mention consumers at all. I admittedly
didn't perform a comprehensive search, but I couldn't find any <meta>
keywords that had a specification that described anything like error
handling rules for bogus values. Many keywords get registered without a
specification at all. While we require a spec link currently, many people
just ignore this, or provide a link to, as Mike put it, "just general
overview documents that are only marginally related".

Many of the values are very poorly designed. Even major vendors like
Microsoft do crazy things, as in:

...which is redundant with existing features (<link rel=icon>, which was
itself a Microsoft extension), uses the wrong extension mechanism (URLs
should use <link>), explicitly has a vendor name in the keyword despite
being a generic concept (logos aren't just useful for Windows), and uses a
different keyword for each size, rather than using an approach like the
<link sizes> attribute in HTML.

There are also many duplicate ideas. For example, there are several ways
to mark up license information registered in the MetaExtensions list;
they're all redundant with rel=license.

There are values that make sense as part of a wider vocabulary, but that
don't really fit the HTML story, like "da_pageTitle", which of course is
redundant with <title>. Despite this, because those vocabularies are used
with HTML, these values get used. Presumably, consumers find it easier to
just use the data from the vocabulary, than to interpret HTML semantics.
(It's easier just to grab all your <meta> values, than to grab those and
then also look for <title>, <link rel=license>, etc.)

There are multiple groups of values that form vocabularies, but they each
use a different syntax. For example, twitter uses the syntax
"twitter:...", Microsoft has used the syntax "msapplication-*", Decibel
Insight uses the "da_*" form, Webtrends uses "wt.*", and Dublin Core
supposedly uses arbitrary prefixes resulting in keywords of the form
"prefix.*", with the "prefix" part being declared in a <link
rel="schema.prefix"> link, which amusingly violates both the <meta> and
<link> extension mechanisms (I'd be curious to see if anyone actually
implements this as specified [1]), though only one prefix is actually
registered.

[1] http://dublincore.org/documents/dc-html/ section 3.2.1, I guess,
though that doesn't actually say how interpreters should parse it.

We could probably help this a lot by providing guidance to extension
creators:

- how to create groups of keywords
- where to first check to make sure there's no existing values that do
the same thing
- best practices like using <link> rather than <meta> for URLs
- suggesting that specifications should include rules on how consumers
are to interpret the data, including error handling

One concern that has been raised, which I share, is that many authors are
marking up their pages with a lot of <meta>data that is never actually
consumed by anyone. This is, fundamentally, a waste of the author's time.
I don't really see what we can do about this, though. For example, while I
think it's unlikely that most people marking up their pages with Dublin
Core metadata will ever benefit from it, some people, e.g. government
archivists, mark up all their pages with ample accurate metadata that is
then processed by their own tools. Even if 90% of authors should avoid a
feature, we should probably not make it non-conforming if 10% of authors
have a perfectly valid and simple use for it.

It would be tempting to have the validators ask the author for which
consuming software package they are planning on processing their metadata
with (and mark uses that have no target software as non-conforming), but I
can't really see how to make that work.

Other authors do use the values, but do so within a small community, where
the benefits of public registration are unclear, and where typos would be
caught by their own software, without the help of a validator.

Currently (past few years), the rel= registry is maintained (mainly in
terms of spam gardening) by the Microformats community, and the <meta>
registry is maintained mostly by Mike, Anne, and Henri (mainly in terms of
filtering poorly registered values), but in general the registries are not
that opinionated -- there's nobody throwing out "application-url" as a
meta name, for example, despite the fact that it's not really appropriate
(rel=canonical or some other rel= values would be better).

Maybe we could improve this, e.g. by more aggressively categorising
keywords, maybe automatically rejecting keywords that linger for a period
of time (like a year or two), maybe basing it in part on volume of
validator traffic for a keyword.

We also have some rel="" and <meta> keywords that have a preferred
position, in the HTML spec itself. I think this makes sense for values
that have clear and broadly applicable use cases, and, for values that
have a long history, a proven implementation record: it indicates to
authors that these are not going to go anywhere, it indicates to
implementors that these are values worth implementing. One of the things
that has been lacking recently is the transfer of these kinds of values
from the registries into the spec itself.

This kind of review, however, is not obviously compatible with a mechanism
by which anyone can register a keyword and have it immediately supported
by all validators. This suggests we need several tiers, e.g. unknown,
provisionally registered, accepted.

So where does all this get us.

Imagine the following scenarios:

A user goes to http://validator.nu/ and validates their page. The one
problem they find is that they used <meta name="target-audience"
content="expert">, and that name isn't recognised.

The validator says:

Error: The keyword "target-audience" is not a known metadata name, yet
it used on the <meta> element on line 5:
</title>↩ <meta name="target-audience"↩content="expert">↩</hea
^^^^^^^^^^^^^^^
_View_all_registered_metadata_names_.
▶ Register the "target-audience" metadata name

The user clicks the "Register" disclosure triangle, and down pops:

▼ Register the "target-audience" metadata name

Metadata keyword name: [ target-audience ]

Briefly describe the purpose of this keyword:
__________________________________
| |
| |
|__________________________________|

What kind of metadata keyword is this?

(o) This keyword is defined in a public specification.
Specification: [ http:// ]

( ) This keyword is for private use within my organisation or
for personal use on my site.

Why are you providing this metadata?

(o) I am providing it so that specific software can read the data.
Name of software: [ ]

( ) I am providing it in the hope that it will one day be useful.

( ) Other: [ ]

(( Submit Registration ))
The URL of the Web page you validated, as well as your IP address,
will be recorded for spam fighting purposes. The information
provided on this form will be publicly visible.

The user fills in the form, and presses the button. The validator contacts
a central registry system for <meta name> keywords, and provides the
information. The central registry informs all the validators that the
registry has been updated.

The user validates the page with a different validator. The page is marked
as valid, since the keyword was just registered.

I think this would be a pretty good experience for authors.

The user later goes back to http://validator.nu/ and validates a new page.
The one problem they find is that they used <meta name="targetaudience"
content="beginner">, and that name isn't recognised.

The validator says:

Error: The keyword "targetaudience" is not a known metadata name, yet
it used on the <meta> element on line 5:
</title>↩ <meta name="targetaudience"↩content="beginner">↩</hea
^^^^^^^^^^^^^^
Did you mean: _target-audience_ ?

_View_all_registered_metadata_names_.
▶ Register the "targetaudience" metadata name

Here, the link with the text "target-audience" is a link to the
specification of that link type, as registered earlier, or maybe a link to
http://meta.registries.whatwg.org/target-audience or some such, which
could list historical information about that keyword, as well as a link to
the specification and any other annotations about it.

Later, we determine that "target-audience" is redundant with the
earlier-registered, more widely used, and more widely supported,
"dcterms.audience". So, Mike updates the registry accordingly. Later
still, the aforementioned author revalidates the original page, and gets:

Error: The <meta> element on line 5 uses an obsolete metadata name.
Instead of using the metadata name "target-audience", consider
using the metadata name "dcterms.audience". This keyword is more widely
implemented.
</title>↩ <meta name="target-audience"↩content="expert">↩</hea
^^^^^^^^^^^^^^^
_View_further_information_about_"dcterms.audience"_.
_View_further_information_about_"target-audience"_.
_View_all_registered_metadata_names_.
▶ Update registration for "target-audience" metadata name

If the user drops down the disclosure triangle, they get:

▼ Update registration for "target-audience" metadata name

Metadata keyword name: [ target-audience ]

This metadata name is non-conforming and has the following
associated message:

Instead of using the metadata name "target-audience", consider
using the metadata name "dcterms.audience". This keyword is more
widely implemented.

This was the result of _an_update_ on 2014-09-01 by MikeSmith™‬.

If you believe this is an error, please report this problem to the
_WHATWG_mailing_list_ or _file_a_bug_.

The last two links would point to http://www.whatwg.org/mailing-list and
http://whatwg.org/newbug accordingly. The "an update" link would point to
the history of the keyword's registration.

Let's say the author in question contacts us to inform us that
"target-audience" is used by his organisation internally, and we update
the registry accordingly. Now when the author validates, they see (after
opening the first disclosure triangle):

This document is valid HTML!
▼ This document used one proprietary <meta> metadata name.
The <meta> element on line 5 uses the keyword "target-audience".
This metadata name has been registered for private use only.
Instead of using the metadata name "target-audience", consider
using the metadata name "dcterms.audience". This keyword is more
widely implemented.

_View_further_information_about_"dcterms.audience"_.
_View_further_information_about_"target-audience"_.
_View_all_registered_metadata_names_.
▶ Update registration for "target-audience" metadata name

Are these scenarios a good idea?

One could imagine going further, e.g. supporting groups of keywords that
come from a single vocabulary, and having higher-level information about
the vocabulary itself, e.g. "This document used 5 keywords from the
obsolete Irish Core vocabulary", rather than listing all five keywords
independently. But I've not attempted to consider doing this at this time.

We may also wish to distinguish between keywords that have full specs and
keywords that have inadequate documentation. For example:

This document is valid HTML!
▶ This document used three <meta> metadata names that lack complete
specifications. Interoperability problems may result.

If we want to go down this route, I think we'd probably want to set up a
server for tracking the keywords, rather than using a generic wiki. This
would allow us to provide an API for validators to send and receive
updates (we could do this over HTTP, or, probably better, long-lived TCP
connections so that validators don't have to poll the server).

Keywords would basically be records with the following information, along
with a history which would track what was changed, who changed it, and
when they changed it:

- keyword name: probably immutable, and used as a key

- status, one of:

- unknown: anything that hasn't ever been registered. These keywords
would get the message in the first scenario, and offer to have the
keyword registered. (I guess this isn't really a status any of the
records could have, it's the implicit status of anything not in the
registry so far.)

- provisionally registered public: the initial state of a keyword when
it is added to the system as a specced keyword.

- provisionally registered proprietary: the initial state of a keyword
when it is added to the system as a proprietary keyword.

- conforming interoperable good practice: the keyword is good. It has
a specification.

- conforming but inadequately documented: the keyword is good, but it
lacks a comprehensive specification.

- conforming but proprietary: the keyword is intended just for private
use; it has been deemed not useful for wide use, not implemented by
widely used UAs, but used in private and not harmful.

- non-conforming: the keyword has been examined and deemed poor
practice; it has better alternatives.

- brief description

- spec link

- suggestion text (the "Instead of..." text in the examples above), in a
form that allows related metadata names to be recognised (so that the
link to dcterms.audience can be pulled out). Blank for the first four
statuses above, non-blank for the last three.

There would be a protocol for validators to speak, which would have to
have at least the following API features:

- Get a list of all the registered keywords, and their current state,
and information about when the last change was made.

- Add a keyword.

- Notification that a new type has been added or the state of an existing
type has been changed.

This is a proposal; I haven't set anything up to do this yet. I would be
interested in knowing whether people see any problems with this, or see
any better ways to address the underlying problems listed earlier (or
indeed, if people think the problem description is incomplete or wrong).

We could do more clever things, for example, checking that the provided
spec link actually contains the relevant keyword name. I'm not sure if
it's worth it, but it might be interesting to do.

Another thing that we could look at is the metadata _values_. Right now,
we're not doing any checking at all of the content="" attribute, but many
of the keywords have limited value spaces that we could check. It's
probably best to have the validators implement checking for the
"conforming interoperable good practice" keywords explicitly, though,
rather than trying to make this more generic.

The above is mainly focused on <meta>, but many extension points need
serious work, not just <meta name>. <link rel>, for instance, but also
Content-Type, URL schemes, etc. I think we should probably start with
<meta name>, and move forward from there once we have more experience with
that one, since that's the only one for which the WHATWG is alone in
trying to document the name space. Once we've proven that the model works,
assuming it works, we can approach other registries and show them our
experience. (And if it doesn't work, then we can keep trying to work on
<meta name> instead.)

--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'

Received on Friday, 24 January 2014 21:06:06 UTC