- From: Ian Hickson <ian@hixie.ch>
- Date: Fri, 24 Jan 2014 21:05:33 +0000 (UTC)
- To: "whatwg@lists.whatwg.org" <whatwg@lists.whatwg.org>
- Message-ID: <alpine.DEB.2.00.1401230002130.26647@ps20323.dreamhostps.com>
On Tue, 4 Jun 2013, Michael[tm] Smith wrote: > > The context of the proposal is the following language in the HTML spec: > > "Conformance checkers must use the information given on the WHATWG Wiki > MetaExtensions page to establish if a value is allowed or not: values > defined in this specification or marked as "proposed" or "ratified" must > be accepted, whereas values marked as "discontinued" or not listed in > either this specification or on the aforementioned page must be rejected > as invalid." > > http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#other-metadata-names > > I propose we remove that language from the spec [...] For the past few years, we've been running an experiment here, getting people to register values on the wiki: http://wiki.whatwg.org/wiki/MetaExtensions We also have the rel extensions registry on the microformats wiki. Before we go further, we should probably discuss the problem that we're trying to solve with these registries, instead of just allowing any value to be used. These are the same reasons as for most things we make non- conforming at the authoring level: - helping authors catch cases where their intent is unclear - helping authors catch typos - helping authors avoid known interoperability issues - helping authors avoid wasting time (See also: http://whatwg.org/html#conformance-requirements-for-authors ) There's also the goal of just documenting what's out there, to help people who want to invent new values to avoid reinventing the wheel, or at least, to avoid reinventing it poorly. I think these goals are reasonable, and worth pursuing. We could just have wide-open extension points here, allowing any value, not bothering to define any. But I think this would be a net loss. >From the aforementioned experiments with wiki registries, we've learnt several things, which I shall now discuss. Some of these are taken from mailing list discussions on this thread; thanks to the contributors thereto. Others are taken from IRC discussions earlier this week; thanks to Tantek and MikeSmith in particular for their comments: http://krijnhoetmer.nl/irc-logs/whatwg/20140122#l-552 http://krijnhoetmer.nl/irc-logs/whatwg/20140123#l-88 The first big lesson is that, maybe surprisingly, there's a lot of demand for these features. The long tail of needs for these values is very long indeed. Well over a hundred <meta name> keywords have been registered, and that ignores all the ones that people haven't bothered to register. Simultaneously, people are unhappy when validators don't know about their meta names. The modern HTML validators point people to the Wiki; for example, validator.nu says: "You can register metadata names on the WHATWG wiki yourself." People follow the link and attempt to register values all the time. The barrier to adding values has been low, but still not trivial: you have to e-mail a request for a wiki account, then when you get it a few days later, you have to add the item to the wiki page which means editing MediaWiki's table markup. We get a few requests for accounts each week. We also get a number of people each week filing unclear bug reports on the spec asking, I think, for certain values to be registered. One lesson from this is that we could make this much easier, e.g. by having validators offer to register the keyword directly. The wikis for <meta name> and <link rel> have been wildly more successful than IETF/IANA registries, at least in terms of how many keywords they document. This is, presumably, due to the lower barrier for entry. If a goal is to document values used on the Web, then lowering the barrier to entry even further might increase the volume of documentation. Another lesson, though, is that if we make certain values non-conforming, we'd better have a rather convincing message for the validators to give the authors. Such messages should probably include an explanation as to why a value is non-conforming and a description of how to achieve the desired effect instead. Right now, the message is just "Bad value ... for attribute name on element meta: Keyword ... is not registered". This may be the best we can do for unknown values (though maybe we could do better for typos, pointing people to the keyword they probably meant to use?), but if we start marking values as non-conforming because of issues with the values, we need to be clearer. Right now, for <link rel> extensions we have the concept of "synonyms". This probably isn't sufficient for validators, though, and we don't have it at all for <meta>. Another anectodal data point is that based on the conversations I've had with people trying to register accounts on the WHATWG wiki so they can register keywords, many authors have no idea what these keywords are really for, despite being very sure they want to have them and not wanting validators to complain about them. I don't know what we can do about that, though. At the end of the day, an author who doesn't care, can't be taught. Once authors have registered a keyword, one complaint that I have heard several times is that the validators don't immediately get updated. I think there would be value to having a lower latency between registration and validation. Ideally, one could imagine a situation where someone validates a page, finds a not-yet-registered value, registers it, goes to a different validator, and that other validator is already updated and knows that the values is registered. Obviously, this would require some real-time communication without humans in the loop after the registration. Looking at the keywords themselves, one is struck by how very few of the keywords have high quality specifications. Most have either what can be best described as author's guides, just vague descriptions of how to use the values, or, in some cases, specifications that only give authoring conformance criteria, and don't mention consumers at all. I admittedly didn't perform a comprehensive search, but I couldn't find any <meta> keywords that had a specification that described anything like error handling rules for bogus values. Many keywords get registered without a specification at all. While we require a spec link currently, many people just ignore this, or provide a link to, as Mike put it, "just general overview documents that are only marginally related". Many of the values are very poorly designed. Even major vendors like Microsoft do crazy things, as in: <meta name="msapplication-square70x70logo" content="images/tinylogo.png"> ...which is redundant with existing features (<link rel=icon>, which was itself a Microsoft extension), uses the wrong extension mechanism (URLs should use <link>), explicitly has a vendor name in the keyword despite being a generic concept (logos aren't just useful for Windows), and uses a different keyword for each size, rather than using an approach like the <link sizes> attribute in HTML. There are also many duplicate ideas. For example, there are several ways to mark up license information registered in the MetaExtensions list; they're all redundant with rel=license. There are values that make sense as part of a wider vocabulary, but that don't really fit the HTML story, like "da_pageTitle", which of course is redundant with <title>. Despite this, because those vocabularies are used with HTML, these values get used. Presumably, consumers find it easier to just use the data from the vocabulary, than to interpret HTML semantics. (It's easier just to grab all your <meta> values, than to grab those and then also look for <title>, <link rel=license>, etc.) There are multiple groups of values that form vocabularies, but they each use a different syntax. For example, twitter uses the syntax "twitter:...", Microsoft has used the syntax "msapplication-*", Decibel Insight uses the "da_*" form, Webtrends uses "wt.*", and Dublin Core supposedly uses arbitrary prefixes resulting in keywords of the form "prefix.*", with the "prefix" part being declared in a <link rel="schema.prefix"> link, which amusingly violates both the <meta> and <link> extension mechanisms (I'd be curious to see if anyone actually implements this as specified [1]), though only one prefix is actually registered. [1] http://dublincore.org/documents/dc-html/ section 3.2.1, I guess, though that doesn't actually say how interpreters should parse it. We could probably help this a lot by providing guidance to extension creators: - how to create groups of keywords - where to first check to make sure there's no existing values that do the same thing - best practices like using <link> rather than <meta> for URLs - suggesting that specifications should include rules on how consumers are to interpret the data, including error handling One concern that has been raised, which I share, is that many authors are marking up their pages with a lot of <meta>data that is never actually consumed by anyone. This is, fundamentally, a waste of the author's time. I don't really see what we can do about this, though. For example, while I think it's unlikely that most people marking up their pages with Dublin Core metadata will ever benefit from it, some people, e.g. government archivists, mark up all their pages with ample accurate metadata that is then processed by their own tools. Even if 90% of authors should avoid a feature, we should probably not make it non-conforming if 10% of authors have a perfectly valid and simple use for it. It would be tempting to have the validators ask the author for which consuming software package they are planning on processing their metadata with (and mark uses that have no target software as non-conforming), but I can't really see how to make that work. Other authors do use the values, but do so within a small community, where the benefits of public registration are unclear, and where typos would be caught by their own software, without the help of a validator. Currently (past few years), the rel= registry is maintained (mainly in terms of spam gardening) by the Microformats community, and the <meta> registry is maintained mostly by Mike, Anne, and Henri (mainly in terms of filtering poorly registered values), but in general the registries are not that opinionated -- there's nobody throwing out "application-url" as a meta name, for example, despite the fact that it's not really appropriate (rel=canonical or some other rel= values would be better). Maybe we could improve this, e.g. by more aggressively categorising keywords, maybe automatically rejecting keywords that linger for a period of time (like a year or two), maybe basing it in part on volume of validator traffic for a keyword. We also have some rel="" and <meta> keywords that have a preferred position, in the HTML spec itself. I think this makes sense for values that have clear and broadly applicable use cases, and, for values that have a long history, a proven implementation record: it indicates to authors that these are not going to go anywhere, it indicates to implementors that these are values worth implementing. One of the things that has been lacking recently is the transfer of these kinds of values from the registries into the spec itself. This kind of review, however, is not obviously compatible with a mechanism by which anyone can register a keyword and have it immediately supported by all validators. This suggests we need several tiers, e.g. unknown, provisionally registered, accepted. So where does all this get us. Imagine the following scenarios: A user goes to http://validator.nu/ and validates their page. The one problem they find is that they used <meta name="target-audience" content="expert">, and that name isn't recognised. The validator says: Error: The keyword "target-audience" is not a known metadata name, yet it used on the <meta> element on line 5: </title>↩ <meta name="target-audience"↩content="expert">↩</hea ^^^^^^^^^^^^^^^ _View_all_registered_metadata_names_. ▶ Register the "target-audience" metadata name The user clicks the "Register" disclosure triangle, and down pops: ▼ Register the "target-audience" metadata name Metadata keyword name: [ target-audience ] Briefly describe the purpose of this keyword: __________________________________ | | | | |__________________________________| What kind of metadata keyword is this? (o) This keyword is defined in a public specification. Specification: [ http:// ] ( ) This keyword is for private use within my organisation or for personal use on my site. Why are you providing this metadata? (o) I am providing it so that specific software can read the data. Name of software: [ ] ( ) I am providing it in the hope that it will one day be useful. ( ) Other: [ ] (( Submit Registration )) The URL of the Web page you validated, as well as your IP address, will be recorded for spam fighting purposes. The information provided on this form will be publicly visible. The user fills in the form, and presses the button. The validator contacts a central registry system for <meta name> keywords, and provides the information. The central registry informs all the validators that the registry has been updated. The user validates the page with a different validator. The page is marked as valid, since the keyword was just registered. I think this would be a pretty good experience for authors. The user later goes back to http://validator.nu/ and validates a new page. The one problem they find is that they used <meta name="targetaudience" content="beginner">, and that name isn't recognised. The validator says: Error: The keyword "targetaudience" is not a known metadata name, yet it used on the <meta> element on line 5: </title>↩ <meta name="targetaudience"↩content="beginner">↩</hea ^^^^^^^^^^^^^^ Did you mean: _target-audience_ ? _View_all_registered_metadata_names_. ▶ Register the "targetaudience" metadata name Here, the link with the text "target-audience" is a link to the specification of that link type, as registered earlier, or maybe a link to http://meta.registries.whatwg.org/target-audience or some such, which could list historical information about that keyword, as well as a link to the specification and any other annotations about it. Later, we determine that "target-audience" is redundant with the earlier-registered, more widely used, and more widely supported, "dcterms.audience". So, Mike updates the registry accordingly. Later still, the aforementioned author revalidates the original page, and gets: Error: The <meta> element on line 5 uses an obsolete metadata name. Instead of using the metadata name "target-audience", consider using the metadata name "dcterms.audience". This keyword is more widely implemented. </title>↩ <meta name="target-audience"↩content="expert">↩</hea ^^^^^^^^^^^^^^^ _View_further_information_about_"dcterms.audience"_. _View_further_information_about_"target-audience"_. _View_all_registered_metadata_names_. ▶ Update registration for "target-audience" metadata name If the user drops down the disclosure triangle, they get: ▼ Update registration for "target-audience" metadata name Metadata keyword name: [ target-audience ] This metadata name is non-conforming and has the following associated message: Instead of using the metadata name "target-audience", consider using the metadata name "dcterms.audience". This keyword is more widely implemented. This was the result of _an_update_ on 2014-09-01 by MikeSmith™. If you believe this is an error, please report this problem to the _WHATWG_mailing_list_ or _file_a_bug_. The last two links would point to http://www.whatwg.org/mailing-list and http://whatwg.org/newbug accordingly. The "an update" link would point to the history of the keyword's registration. Let's say the author in question contacts us to inform us that "target-audience" is used by his organisation internally, and we update the registry accordingly. Now when the author validates, they see (after opening the first disclosure triangle): This document is valid HTML! ▼ This document used one proprietary <meta> metadata name. The <meta> element on line 5 uses the keyword "target-audience". This metadata name has been registered for private use only. Instead of using the metadata name "target-audience", consider using the metadata name "dcterms.audience". This keyword is more widely implemented. _View_further_information_about_"dcterms.audience"_. _View_further_information_about_"target-audience"_. _View_all_registered_metadata_names_. ▶ Update registration for "target-audience" metadata name Are these scenarios a good idea? One could imagine going further, e.g. supporting groups of keywords that come from a single vocabulary, and having higher-level information about the vocabulary itself, e.g. "This document used 5 keywords from the obsolete Irish Core vocabulary", rather than listing all five keywords independently. But I've not attempted to consider doing this at this time. We may also wish to distinguish between keywords that have full specs and keywords that have inadequate documentation. For example: This document is valid HTML! ▶ This document used three <meta> metadata names that lack complete specifications. Interoperability problems may result. If we want to go down this route, I think we'd probably want to set up a server for tracking the keywords, rather than using a generic wiki. This would allow us to provide an API for validators to send and receive updates (we could do this over HTTP, or, probably better, long-lived TCP connections so that validators don't have to poll the server). Keywords would basically be records with the following information, along with a history which would track what was changed, who changed it, and when they changed it: - keyword name: probably immutable, and used as a key - status, one of: - unknown: anything that hasn't ever been registered. These keywords would get the message in the first scenario, and offer to have the keyword registered. (I guess this isn't really a status any of the records could have, it's the implicit status of anything not in the registry so far.) - provisionally registered public: the initial state of a keyword when it is added to the system as a specced keyword. - provisionally registered proprietary: the initial state of a keyword when it is added to the system as a proprietary keyword. - conforming interoperable good practice: the keyword is good. It has a specification. - conforming but inadequately documented: the keyword is good, but it lacks a comprehensive specification. - conforming but proprietary: the keyword is intended just for private use; it has been deemed not useful for wide use, not implemented by widely used UAs, but used in private and not harmful. - non-conforming: the keyword has been examined and deemed poor practice; it has better alternatives. - brief description - spec link - suggestion text (the "Instead of..." text in the examples above), in a form that allows related metadata names to be recognised (so that the link to dcterms.audience can be pulled out). Blank for the first four statuses above, non-blank for the last three. There would be a protocol for validators to speak, which would have to have at least the following API features: - Get a list of all the registered keywords, and their current state, and information about when the last change was made. - Add a keyword. - Notification that a new type has been added or the state of an existing type has been changed. This is a proposal; I haven't set anything up to do this yet. I would be interested in knowing whether people see any problems with this, or see any better ways to address the underlying problems listed earlier (or indeed, if people think the problem description is incomplete or wrong). We could do more clever things, for example, checking that the provided spec link actually contains the relevant keyword name. I'm not sure if it's worth it, but it might be interesting to do. Another thing that we could look at is the metadata _values_. Right now, we're not doing any checking at all of the content="" attribute, but many of the keywords have limited value spaces that we could check. It's probably best to have the validators implement checking for the "conforming interoperable good practice" keywords explicitly, though, rather than trying to make this more generic. The above is mainly focused on <meta>, but many extension points need serious work, not just <meta name>. <link rel>, for instance, but also Content-Type, URL schemes, etc. I think we should probably start with <meta name>, and move forward from there once we have more experience with that one, since that's the only one for which the WHATWG is alone in trying to document the name space. Once we've proven that the model works, assuming it works, we can approach other registries and show them our experience. (And if it doesn't work, then we can keep trying to work on <meta name> instead.) -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 24 January 2014 21:06:06 UTC