[whatwg] RDFa Problem Statement

Ian,

I am addressing these questions both personally and as a representative
of our company, Digital Bazaar. I am certainly not speaking in any way
for the W3C SWD, RDFa Task Force, or Microformats community.

Ian Hickson wrote:
> On Mon, 25 Aug 2008, Manu Sporny wrote:
>> Web browsers currently do not understand the meaning behind human 
>> statements or concepts on a web page. While this may seem academic, it 
>> has direct implications on website usability. If web browsers could 
>> understand that a particular page was describing a piece of music, a 
>> movie, an event, a person or a product, the browser could then help the 
>> user find more information about the particular item in question.
> 
> Is this something that users actually want? 

These are fairly broad questions, so I will attempt to address them in a
general sense. We can go into the details at a later date if that would
benefit the group in understanding how RDFa addresses this perceived need.

Both the Microformats community and the RDFa community believe that
users want a web browser that can help them navigate the web more
efficiently. One of the best ways that a browser can provide this
functionality is by understanding what the user is currently browsing
with more accuracy than what is available today.

The Microformats community is currently at 1,145 members on the
discussion mailing list and 350 members on the vocabulary specification
mailing list. The community has a common goal of making web semantics a
ubiquitous technology. It should be noted as well that the Microformats
community ARE the users that want this technology.

There are very few commercial interests in that community - we have
people from all walks of life contributing to the concept that the
semantic web is going to make the browsing experience much better by
helping computers to understand the human concepts that are being
discussed on each page.

I should also point out that XHTML1.1 and XHTML2 will have RDFa
integrated because it is the best technology that we have at this moment
to address the issue of web semantics. You don't have to agree with the
"best technology" aspect of the statement, just that there is some
technology X, that has been adopted to provide semantics in HTML.

The Semantic Web Deployment group at the W3C also believes this to be a
fundamental issue with the evolution of the Web. We are also working on
an HTML4 DTD to add RDFa markup to legacy websites. I say this not to
make the argument that "everybody is doing it", but to point out that
there seems to be a fairly wide representation, both from standards
bodies and from web communities that semantics is a requirement of
near-term web technologies.

> How would this actually work? 

I don't know if you mean from a societal perspective, a standards
perspective, a technological perspective or some other philosophical
perspective. I am going to assume that you mean from a "technological
perspective" and a "societal perspective" since I believe those to be
the most important.

The technological perspective is the easiest to answer - we have working
code, to the tune of 9 RDFa parser implementations and two browser
plug-ins. Here's the implementation report for RDFa:

http://www.w3.org/2006/07/SWD/RDFa/implementation-report/#TestResults

To see how it works in practice, the Fuzzbot plug-in shows what we have
right now. It's rough, but demonstrates the simplest use case (semantic
data on a web page that is extracted and acted upon by the browser):

http://www.youtube.com/watch?v=oPWNgZ4peuI

All of the code to do this stuff is available under an Open Source
license. librdfa, one of the many RDFa parsers is available here:

http://rdfa.digitalbazaar.com/librdfa/

and Fuzzbot, the semantic web processor, is available here:

http://rdfa.digitalbazaar.com/fuzzbot/

>From a societal perspective, it frees up the people working on this
problem to focus on creating vocabularies. We're wasting most of our
time in the Microformats community arguing over the syntax of the
vocabulary expression language - which isn't what we want to talk about
- we want to talk about web semantics.

More accurately, RDFa relies on technologies that are readily accepted
on the web (URIs, URLs, etc.) to express semantic information. So, RDFa
frees up users to focus on expressing semantics by creating vocabularies
either through a standards body, an ad-hoc group, or individually.
Anybody can create a vocabulary, then you let the web decide which
vocabularies are useful and which ones are not. The ones that aren't
useful get ignored and the ones that are useful find widespread usage.

>From a societal perspective, this is how the web already operates and it
is the defining feature that makes the web such a great tool for humanity.

> Personally I find that if I'm looking at a site with music tracks, say 
> Amazon's MP3 store, I don't have any difficulty working out what the 
> tracks are or interacting with the page. Why would I want to ask the 
> computer to do something with the tracks?

Yes, that's absolutely correct. /You/ don't have difficulty working out
what the tracks are or interacting with the page... but /your browser/
has a terrible time figuring out what text is talking about a musical
track and which button leads you to a purchase of that musical track and
which one is asking you to open your Amazon Gold Box reward. The browser
has no clue as to what is going on in the page.

"Computer, find more information on this artist."

"Computer, find the cheapest price for this musical track."

"Computer, find a popular blog talking about this album."

"Computer, what other artists has this artist worked with?"

"Computer, is this a popular track?"

Without some form a semantic markup, the computer cannot answer any of
those questions for the user.

> It would be helpful if you could walk me through some examples of what UI 
> you are envisaging in terms of "helping the user find more information". 

A first cut at a UI can be found in the Fuzzbot demo videos:

http://www.youtube.com/watch?v=oPWNgZ4peuI
http://www.youtube.com/watch?v=PVGD9HQloDI

However, it's rough and we have spent a total of 4 days on the UI for
expressing semantics on a page. Operator is another example of a UI that
does semantics detection and display now:

http://www.youtube.com/watch?v=Kjp4BaJOd0M

However, we believe that people will perform more manipulation of data
objects once semantic web technologies go mainstream. The Mozilla Labs
Aurora project is the type of data manipulation that we'd like to see in
the future - where semantic data objects are exchanged freely, like URLs
are today:

Aurora - Part 1 - Collaboration, History, Data Objects, Basic Navigation
http://www.vimeo.com/1450211

Aurora - Part 4 - Personal Data Portability
http://www.vimeo.com/1488633

I'd be happy to write up some formal use cases and usability scenarios
if that is what would be required to discuss the ideas expressed in the
videos above in more detail.

> Why is Safari's "select text and then right click to search on Google" not 
> good enough? 

The browser has no understanding of the text in that approach. This is a
fundamental difference between a regular web page and one that contains
embedded semantics. The computer doesn't know how to deal with the
plain-text example in any particular way... other than asking the user
"What should I do with this amorphous blob of text you just highlighted?"

A page with semantics allows the browser to, in the very least, give the
user an option of web services/sites that match the type of data being
manipulated. If it's music, highly frequented music sites are the target
of the semantic data object query. If the semantic data object is
director information about a particular movie, IMDB could be queried in
the background to retrieve information about the movie... or the browser
could display local show times based on a browser preference for a movie
theater selection service that the user favors.

"Computer, I'd like to go to see this movie this week, what times is it
playing at the Megaplex 30 that doesn't conflict with my events in my
Google Calendar?"

> Have any usability studies been made to test these ideas? 
> (For example, paper prototype usability studies?) What were the
> results?

Yes/maybe to the first two questions - there is frequent feedback to
Mike Kaply and us on how to improve the UIs for Operator and Fuzzbot,
respectively. However - the UI ideas are quite different from the
fundamental concept of marking up semantic data. While we can talk about
the UIs and dream a little, it will be very hard to get to the UI stage
unless there is some way to express semantics in HTML5.

As for the results, those are ongoing. People are downloading and using
Operator and Fuzzbot. My guess is that they are being used mostly as a
curiosity at this point - no REAL work is getting done using those
plug-ins since the future is uncertain for web semantics. It always is
until a standard is finalized and a use for that standard is identified.
These are the early days, however - nobody is quite sure what the ideal
user experience is yet.

What we do know and can demonstrate, however, is that web semantics
enable a plethora of new user experiences.

>> It would help automate the browsing experience.
> 
> Why does the browsing experience need automating?

In short, because users hate performing repetitive tasks and would enjoy
the browser enabling them to find the information that they need faster
and with more accuracy than the current web is capable of delivering.

No amount of polishing is going to turn the steaming pile of web
semantics that we have today into the semantic web that we know can
exist with the proper architecture in place.

>> Not only would the browsing experience be improved, but search engine 
>> indexing quality would be better due to a spider's ability to understand 
>> the data on the page with more accuracy.
> 
> This I can speak to directly, since I work for a search engine and have 
> learnt quite a bit about how it works.
> 
> I don't think more metadata is going to improve search engines. In 
> practice, metadata is so highly gamed that it cannot be relied upon. In 
> fact, search engines probably already "understand" pages with far more 
> accuracy than most authors will ever be able to express.

You are correct, more erroneous metadata is not going to improve search
engines. More /accurate/ metadata, however, IS going to improve search
engines. Nobody is going to argue that the system could not be gamed. I
can guarantee that it will be gamed.

However, that's the reality that we have to live with when introducing
any new web-based technology. It will be mis-used, abused and corrupted.
The question is, will it do more good than harm? In the case of RDFa
/and/ Microformats, we do think it will do more good than harm.

We have put a great deal of thought into anti-gaming strategies for
search engines with regards to the semantic web. Most of them follow the
same principles that Google, Yahoo and others use to prevent link-based
and keyword-based gaming strategies.

I don't understand what you mean by: "search engines probably already
'understand' pages with far more accuracy than most authors will ever be
able to express.". That train of logic assumes that an author doesn't
know what they're saying to the extent that a search engine does, which
seems to be a fallacy.

I think you were addressing the concept that "search technology is
fairly good at understanding the content in web pages", which I do agree
with you if that was your point. However, to say that "search technology
is better at understanding human minds" is a bit of a stretch. Could you
 explain in more depth if this is a cornerstone to your thinking, please?

>> Web browsers currently do not understand the meaning behind human 
>> statements or concepts on a web page.
> 
> This is true, and I even agree that fixing this problem, letting browsers 
> understand the meaning behind human statements and concepts, would open up 
> a giant number of potentially killer applications. I don't think 
> "automating the browser experience" is necessarily that killer app, but 
> let's assume that it is for the sake of argument.

We don't have to agree on the killer app - I don't want the discussion
to turn into that. It would be like trying to agree on the "killer app"
for the web. People rarely agree on killer apps and I'd like us to focus
on what we can realistically accomplish with web semantics in the next
6-12 months instead of discussing what we think the "killer app" is and
is not.

What I, and many others in the semantic web communities, do think is
that there are a number of compelling use cases for a method of semantic
expression in HTML. I think documenting those use cases would be a more
effective use of everybody's time. What are your thoughts on that strategy?

>> If we are to automate the browsing experience and deliver a more usable 
>> web experience, we must provide a mechanism for describing, detecting 
>> and processing semantics.
> 
> This statement seems obvious, but actually I disagree with it. It is not 
> the case the providing a mechanism for describing, detecting, and 
> processing semantics is the only way to let browsers understand the 
> meaning behind human statements or concepts on a web page. In fact, I 
> would argue it's not even the the most plausible solution.
> 
> A mechanism for describing, detecting, and processing semantics; that is, 
> new syntax, new vocabularies, new authoring requirements, fundamentally 
> relies on authors actually writing the information using this new syntax.

I don't believe it does - case in point: My Space, Facebook, Flickr,
Google Maps, Google Calendar, LinkedIn. Those are all examples of
websites where the users don't write a bit of code, but instead use
interfaces to add people, places, events, photos, locations and a
complex web of links between each concept without writing any code.

Take our website for example:

http://bitmunk.com/media/6995806

Our artists did not have to mark up any of the RDFa on the page,
however, it's there because our software tools converted their
registration information into the Audio RDF vocabulary and used RDFa to
embed semantics in each of their artist pages.

Neither RDFa nor Microformats force authors to use the new syntax or
vocabularies if they do not want to do so. If the author doesn't care
about semantics, they don't have to use the RDFa-specific properties.

> If there's anything we can learn from the Web today, however, it is that 
> authors will reliably output garbage at the syntactic level. They misuse 
> HTML semantics and syntax uniformly (to the point where 90%+ of pages are 
> invalid in some way). Use of metadata mechanisms is at a pitifully low 
> level, and when used is inaccurate (Content-Type headers for non-HTML data 
> and character encoding declarations for all text types are both widely 
> wrong, to the point where browsers have increasingly complex heuristics to 
> work around the errors). Even "successful" formats for metadata publishing 
> like hCard have woefully low penetration.

Yes, I agree with you on all points.

> Yet, for us to automate the browsing experience by having computers 
> understand the Web, for us to have search engines be significantly more 
> accurate by understanding pages, the metadata has to be widespread, 
> detailed, and reliable.

I agree that it has to be reliable, but not that the metadata has to be
widespread or that detailed. The use cases that are enabled by merely
having the type and title of a creative work are profound. I can go into
detail on this as well, if this community would like to hear about it?

> So to get this data into Web pages, we have to get past the laziness and 
> incompetence of authors.
> 
> Furthermore, even if we could get authors to reliably put out this data 
> widely, we would have to then find a way to deal with spammers and black 
> hat SEOs, who would simply put inaccurate data into their pages in an 
> attempt to game search engines and browsers.
> 
> So to get this data into Web pages, we have to get past the inherent greed 
> and evilness of hostile authors.

Like I mentioned earlier, we have thoughts on how to deal with Black Hat
SEOs and their ilk. No approach is perfect, but the strategies follow
what Google and Yahoo are already doing to prevent Black Hat SEOs from
ruining search on the web.

Getting past the inherent greed and evilness of hostile authors is
something that many standards on the web deal with - how is HTML5 or
XHTML2 going to deal with hostile authors? Blackhat SEOs? People that
don't know better?

If the standards we are creating need to get past the inherent greed and
evilness of a small minority of the world, then we are all surely
doomed. It is a good thing that most of us are optimists here, otherwise
nothing like HTML5 would have ever been started in the first place!

When confronting these issues, the answer should never be "let's not go
down that road", but rather "let's see if we can produce far more
goodness than evilness, eliminating evilness if we can".

> As I mentioned earlier, there is another solution, one that doesn't rely 
> on either getting authors to be any more accurate or precise than they are 
> now, one that doesn't require any effort on the part of authors, and one 
> that can be used in conjunction with today's anti-spam tools to avoid 
> being gamed by them and potentially to in fact dramatically improve them: 
> have the computers learn the human languages themselves.

I most wholeheartedly agree - machine learning at both the web spider
side and the web browser side will revolutionize the web as we know it!

Ian, I don't know how long it will be before we get to accurate machine
learning of human concepts - but I'm positive that it will happen
eventually.

That's the holy grail that all of us are after, but the research that we
have seen over the past decade shows that we are far from there. While
impressive in tightly focused cognitive domains, most cognitive
psychology models fail when applied to a general problem. Numenta[1] is
such a company that is at the forefront of machine learning... we have
looked at their approach when attempting to auto-categorize musical
styles and genres.

We (Digital Bazaar), are of the opinion that RDFa and RDF in general is
one method that will be used on this road towards machine learning.

Take a step back and think about the web as we know it today. It is the
largest collection of human knowledge that has ever existed. You could
say that it represents global culture, most everything that humanity has
ever discovered exists on what we know as the Web today.

The web is a giant knowledge repository... it is THE knowledge
repository for the human race. Any logically coherent cognitive model
MUST have a database to work from that contains facts and premises. At
present, we can both model and express these interrelationships as
knowledge graphs, which have a mapping to RDF graphs.

The Microformats community along with the RDFa community is a very
pragmatic bunch. Similarly, I care about solving real problems using
real solutions that are within arms reach.

If your argumentation is that we use computer learning to solve this
problem, I posit that RDFa and RDF in general is one of the first steps
toward building a machine that can understand human concepts.

I have broken the second part of your e-mail out into a separate
document... and will respond to it shortly.

-- manu

[1] http://www.numenta.com/

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.
blog: Bitmunk 3.0 Website Launches
http://blog.digitalbazaar.com/2008/07/03/bitmunk-3-website-launches

Received on Tuesday, 26 August 2008 12:23:18 UTC