Re: Meaningless: towards a real-world web semantics observatory from Alex Russell on 2013-03-27 (www-tag@w3.org from March 2013)

From: Alex Russell <slightlyoff@google.com>
Date: Wed, 27 Mar 2013 15:25:43 +0000
To: Noah Mendelsohn <nrm@arcanedomain.com>
Cc: "www-tag@w3.org List" <www-tag@w3.org>, Tantek Çelik <tcelik@mozilla.com>
Message-ID: <CANr5HFU=hWs41EXdSWCME-JjtrNLhhdMHcLCM5_pe5DMThNnYg@mail.gmail.com>

On Wednesday, March 27, 2013, Noah Mendelsohn wrote:

> (Leaving off most of the cc: list to avoid cross-posted discussion.
> Nothing sensitive here -- feel feel to forward if useful.)
>
> This looks very cool. Would it be easy/reasonable/in-the-spirit-**of-the-thing
> to extend it start gathering statistics on JSON, XML, various forms of RDF,
> RDF-a, etc?


So the way it works is by analyzing nodes as you browse. If you can think
of a lightweight way to charachterize an element as being in one of these
buckets, patches welcome!


> For that matter, it would also be >really< interesting to watch things
> like content that will be interpreted differently by the HTML5 sniffing
> rules than by following authoritative metadata.
>

How would we detect such a thing?


> In general, you seem to be on a very nice slippery slope of building a
> dashboard for the Web's data/content encoding. Are you interested in
> heading further down the slope?


Happy to extend this to gather whatever data can be both truly anonymous
and inexpensively characterized.


> Noah
>
> On 3/27/2013 9:58 AM, Alex Russell wrote:
>
>> Hi all,
>>
>> These lists host many debates about the semantics (or lack thereof) of
>> HTML. Good data that bears on these questions is often hard to come by.
>> This isn't anyone's fault per sae but it sure would be nice if we had
>> better data to use as the baseline for discussions about what should (and
>> shouldn't) be in HTML.next.
>>
>> In the interest of building such a corpus, I've created a small extension
>> to help gather information on the real-world semantics that users
>> encounter
>> in the web; both semantic HTML and extensions to it like Microformats,
>> schema.org <http://schema.org> markup, and ARIA roles and states.
>> Crawlers
>> miss a lot as they (generally) aren't running scripts and interacting
>> deeply with sites, so this anonymizing system attempts to fill that gap by
>> observing the semantic content of pages both when the load and as they
>> change over time.
>>
>> Why cross-post this so broadly? Because I need your help! If you think
>> evolving the web based on data is better than trying to do it without and
>> you happen to use Chrome as your browser, please install the extension:
>>
>> https://chrome.google.com/**webstore/detail/meaningless/**
>> gmmhpelpfhlofjjolcegdddjadkmin**cn/details<https://chrome.google.com/webstore/detail/meaningless/gmmhpelpfhlofjjolcegdddjadkmincn/details>
>>
>> If you're a developer and use another browser, I'd love your help in
>> porting the extension to other platforms (FF, Safari, etc.):
>>
>> https://github.com/**slightlyoff/meaningless<https://github.com/slightlyoff/meaningless>
>>
>> If you're interested in the data, a sparse reporting front-end is
>> currently
>> in place:
>>
>> http://meaningless-stats.**appspot.com/global<http://meaningless-stats.appspot.com/global>
>>
>> Help is needed to analyze the data in more meaningful ways, visualize it,
>> etc. Filing tickets and submitting pull requests is the easiest way to
>> help: https://github.com/**slightlyoff/meaningless/issues<https://github.com/slightlyoff/meaningless/issues>
>>
>> Thanks for your help and attention.
>>
>

Received on Wednesday, 27 March 2013 15:26:21 UTC