[whatwg] RDFa Problem Statement (was: Creative Commons Rights Expression Language)

On Mon, 25 Aug 2008, Manu Sporny wrote:
> 
> Web browsers currently do not understand the meaning behind human 
> statements or concepts on a web page. While this may seem academic, it 
> has direct implications on website usability. If web browsers could 
> understand that a particular page was describing a piece of music, a 
> movie, an event, a person or a product, the browser could then help the 
> user find more information about the particular item in question.

Is this something that users actually want? How would this actually work? 
Personally I find that if I'm looking at a site with music tracks, say 
Amazon's MP3 store, I don't have any difficulty working out what the 
tracks are or interacting with the page. Why would I want to ask the 
computer to do something with the tracks?

It would be helpful if you could walk me through some examples of what UI 
you are envisaging in terms of "helping the user find more information". 
Why is Safari's "select text and then right click to search on Google" not 
good enough? Have any usability studies been made to test these ideas? 
(For example, paper prototype usability studies?) What were the results?


> It would help automate the browsing experience.

Why does the browsing experience need automating?


> Not only would the browsing experience be improved, but search engine 
> indexing quality would be better due to a spider's ability to understand 
> the data on the page with more accuracy.

This I can speak to directly, since I work for a search engine and have 
learnt quite a bit about how it works.

I don't think more metadata is going to improve search engines. In 
practice, metadata is so highly gamed that it cannot be relied upon. In 
fact, search engines probably already "understand" pages with far more 
accuracy than most authors will ever be able to express.


You started by saying:

> Web browsers currently do not understand the meaning behind human 
> statements or concepts on a web page.

This is true, and I even agree that fixing this problem, letting browsers 
understand the meaning behind human statements and concepts, would open up 
a giant number of potentially killer applications. I don't think 
"automating the browser experience" is necessarily that killer app, but 
let's assume that it is for the sake of argument.

You continue:

> If we are to automate the browsing experience and deliver a more usable 
> web experience, we must provide a mechanism for describing, detecting 
> and processing semantics.

This statement seems obvious, but actually I disagree with it. It is not 
the case the providing a mechanism for describing, detecting, and 
processing semantics is the only way to let browsers understand the 
meaning behind human statements or concepts on a web page. In fact, I 
would argue it's not even the the most plausible solution.

A mechanism for describing, detecting, and processing semantics; that is, 
new syntax, new vocabularies, new authoring requirements, fundamentally 
relies on authors actually writing the information using this new syntax.

If there's anything we can learn from the Web today, however, it is that 
authors will reliably output garbage at the syntactic level. They misuse 
HTML semantics and syntax uniformly (to the point where 90%+ of pages are 
invalid in some way). Use of metadata mechanisms is at a pitifully low 
level, and when used is inaccurate (Content-Type headers for non-HTML data 
and character encoding declarations for all text types are both widely 
wrong, to the point where browsers have increasingly complex heuristics to 
work around the errors). Even "successful" formats for metadata publishing 
like hCard have woefully low penetration.

Yet, for us to automate the browsing experience by having computers 
understand the Web, for us to have search engines be significantly more 
accurate by understanding pages, the metadata has to be widespread, 
detailed, and reliable.

So to get this data into Web pages, we have to get past the laziness and 
incompetence of authors.

Furthermore, even if we could get authors to reliably put out this data 
widely, we would have to then find a way to deal with spammers and black 
hat SEOs, who would simply put inaccurate data into their pages in an 
attempt to game search engines and browsers.

So to get this data into Web pages, we have to get past the inherent greed 
and evilness of hostile authors.


As I mentioned earlier, there is another solution, one that doesn't rely 
on either getting authors to be any more accurate or precise than they are 
now, one that doesn't require any effort on the part of authors, and one 
that can be used in conjunction with today's anti-spam tools to avoid 
being gamed by them and potentially to in fact dramatically improve them: 
have the computers learn the human languages themselves.

Instead of making all the humans of the world learn a computer language, 
or tools for writing that computer language, have the computers learn the 
human language. Not only does this not require us to solve a fundamentally 
unsolvable pair of problems (making humans not be lazy and making humans 
not be evil), but it also means that the computers would also gain an 
understanding of all the legacy content that would otherwise never be seen 
by computers.

This kind of thing is already being done, for example with automated 
language translation where the software learns for itself how to translate 
text, or in search engines that extract information like byline dates and 
author credits, without the need for pages to have special markup, or in 
data clustering, where tools can examine large sets of data and sort the 
content into buckets based on topics without any special markup or user 
intervention. Similarly developments in image processing are making huge 
steps, with tools that can derive depth mapping data from moving video, or 
that can convert a set of static 2D images to a 3D point field. It's clear 
that over the coming years, this will only get better and better.


However, let's pretend for now that we can find a way to solve laziness 
and evilness and continue with your e-mail:

> If one understands web semantics to be an important part of the web's 
> future, the question then becomes, why RDFa? Why not Microformats?
> 
> While there are a number of technical merits that speak in favor of RDFa 
> over Microformats (fully qualified vocabulary terms

Why is this better?

> prefix short-hand via CURIEs

This is definitely not better.

> accessibility-friendly

How is not reusing HTML semantics better than using them? With the 
exception of the now-resolve <time> issue, it seems like Microformats has 
the better accessibility story.

> unified processing rules, etc)

Microformats could certainly benefit from a more consistent parsing model, 
but that can be obtained without going to RDFa.


> [...] this issue really boils down to one of centralized innovation vs. 
> distributed innovation.

I don't see what the syntax has anything to do with whether the formats 
are developed centrally or not. Nothing is stopping anyone from creating 
another Microformats-like organisation that does the same thing without 
going through the Microformats.org process. There could be millions of 
them, in fact. So long as they pick names that are suitably unique (e.g. 
URIs, or Java-like identifiers), or so long as they don't promote the use 
of their formats outside of their own site, I don't see a problem.

In fact, this is happening every day, with each author making up his own 
class values for use on this own site.


> The Microformats community, and all communities like it, require a group 
> of people to come together, collaborate and create a standard vocabulary 
> to express ALL semantics.

Well, for any one person to do anything useful with the data on the Web, 
they have to have a core vocabulary (or set of vocabularies) that they 
understand. So a set of standard vocabularies to express all the semantics 
that that one person is interested in is needed, yes.

It doesn't have to be _all_ semantics, however. I might want to have a 
format for annotating Stargate analysis Web pages that me and my friends 
write, but so long as me and my friends agree on it it doesn't have to 
involve anyone else.


> A somewhat strained analogy would be bringing in representatives from 
> all of the cultures of the world and having them agree on a universal 
> vocabulary.

That's pretty much exactly what Unicode did. Or what we're doing with 
HTML. That doesn't seem untennable, it seems quite reasonable.

However, I'm not suggesting that it should be necessary.


> In short, RDFa addresses the problem of a lack of a standardized 
> semantics expression mechanism in HTML family languages. RDFa not only 
> enables the use cases described in the videos listed above, but all use 
> cases that struggle with enabling web browsers and web spiders 
> understand the context of the current page.

I'm not convinced the problem you describe can be solved in the manner you 
describe. It seems to rely on getting authors to do something that they 
have shown themselves incapable of doing over and over since the Web 
started. It seems like a much better solution would be to get computers to 
understand what humans are doing already.

Even if we ignore that, it doesn't seem like the above discussion would 
lead one to a set of requirements that would lead one to design a language 
like RDFa.


Thanks for the explanation, by the way. This is by far the most useful 
explanation of RDFa that I have ever seen.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 26 August 2008 03:02:29 UTC