WebIDL crawler

Hi,

As reported separately to the TAG [1], Francois (cc'd) and I have
recently been working on a set of tools aiming at crawling WebIDL data
from the various Web Platform specifications that use it.

More specifically, these tools try and extract WebIDL fragments from the
latest identified version of specs , and at the same time extract
information on normative references that these specs declare. The WebIDL
fragments are then parsed into a JSON AST built from webidl2.js.

This lets us build a complete map of usage of WebIDL across the
specifications of the OWP, which itself has enabled us to run various
analyzers:
* we can run diagnostics on specs to ensure their normative references
are consistent with the various WebIDL "names" they import
* we can detect duplicate or missing definitions
* we can easily detect specs with invalid WebIDL fragments

A question that emerged when we built the analyzer for the first point
was which WebIDL names are "exported" by a spec (i.e. reusable in other
specs). It is logically obvious that interface names are, and there are
plenty of reasons for dictionaries and enums to be; it wasn't initially
as clear to me that typedefs were, but usage clearly show they are (e.g.
EventHandler) — which leads to the result that the very generic "JSON"
name is defined (differently!) as a typedef in two different specs. I'm
not sure if callback names are necessarily meant to be part of the
exported names.

I think it would probably make sense for the WebIDL spec itself to be
more explicit about the notion of exported names, if only to make it
clearer to spec editors for which name they need to avoid name clashes.

For an example of the reports the tool can produce, see one instance
produced recently:
https://github.com/tidoust/reffy/wiki/Report-per-anomaly-(20160711)
(there are known false positives)

On top of that, we also built a more general explorer of WebIDL usage
across specifications:
https://dontcallmedom.github.io/webidlpedia

This explorer lists all the defined WebIDL names (interfaces,
dictionaries, typedef, enums), with information on which specs define
them and which specs makes use of them.

An interesting way to look at these lists is the one sorted by
"popularity" (i.e. highest level of usage by other specs):
https://dontcallmedom.github.io/webidlpedia/?full=popularity
It might be particularly interesting to explore in more depth the
patterns that lead to some dictionaries and enums having 0 usage.

A similar view shows the list of strings that are used as enum values
across specifications:
https://dontcallmedom.github.io/webidlpedia/?enums=popularity
That view could hopefully become useful in bringing more consistency in
these names across specification.

There are obviously many other ways the collected data ought to be
exploited, for instance by exploring which specs make use of which
extended attribute.

It might be of particular interest for the WebIDL spec designers to
exploit this data to evaluate usage of specific patterns, and determine
if some of them deserve further formalization in WebIDL.

The said tools are available at
https://github.com/tidoust/reffy
https://github.com/dontcallmedom/webidlpedia

Francois and I will likely keep working on these tools time permitting;
we also welcome pull requests on the repos, and feedback on possible
future directions that would be useful to this group.

Thanks,

Dom & François

1. https://lists.w3.org/Archives/Public/www-tag/2016Jul/0003.html

Received on Friday, 15 July 2016 08:00:50 UTC