Re: RDFa and Web Directions North 2009

This is a bulk reply to several e-mails on this thread. I apologise for 
its length.

On Fri, 13 Feb 2009, Ben Adida wrote:
> Ian Hickson wrote:
> > > So, can we look at the use cases as a whole?
> > 
> > In a word, no.
> > 
> > A very common architectural mistake that software engineers make is 
> > looking at five problems, seeing their commonality, and attempting to 
> > solve all five at once. The result is almost always a solution that is 
> > sub-par for all five problems.
> 
> I think you're taking a good piece of advice -- "don't over-generalize" 
> -- to the other and equally dangerous extreme. You want to refuse to 
> look for *any* common patterns?

Oh don't get me wrong, if there are solutions that have commonalities, 
then obviously we should reuse solutions where possible. For example, we 
had several use cases -- offline Hotmail, offline Google Spreadsheets, 
offline Flickr -- and we came up with a single solution that covers all of 
these. But we evaluated the solution against each case independently.

In other words, to get a good result, instead of:

 1. Find problems.
 2. Extract commonalities of problems.
 3. Adopt a solution that solves the commonalities.

...the process needs to be:

 1. Find problems.
 2. Propose solutions that solve one or more of those problems.
 3. Evaluate the solutions against each problem.
 4. If a solution is found that addresses many of the problems, adopt it.

That is, the use cases have to be used both at the start of the process 
_and_ at the end of the process. Otherwise, we risk ending up with 
something that doesn't actually solve any of the use cases we were 
attempting to solve.

The reason I bring this up is that I have noticed that whenever I am 
talking about RDF with someone, the conversation tends to go

   me: "Tell me a problem that RDF solves."
other: "RDF solves X!"
   me: "Wouldn't Y be a better solution for X?"
other: "Well, X was a bad example."

I don't know that I've ever heard a _good_ example!


> Consider, for example, Creative Commons. We can't afford to get everyone 
> to build Creative Commons support into their tools if that involves 
> buying into a CC-specific language and toolset. Neither can Bitmunk with 
> respect to music. But if we come together, use the same markup and 
> parsing technology, and even share relevant pieces of our respective 
> vocabularies, then it becomes tractable. The work that Manu does 
> benefits me, and vice-versa.

IMHO, the syntax and data model is the easy part. If you had trouble 
getting adoption of your vocabulary with a trivial dedicated syntax, I 
don't think you're likely to have any more luck now that your vocabulary 
comes with a general-purpose data model and half a dozen different 
syntaxes. But your mileage may vary, I guess.

This line of argumentation (that small problems should share solutions so 
as to leverage each others' work) is not convincing to me.


> Using the same principle, we also future-proof our work. At CC, we're 
> not sure what other fantastic media will appear next. 3D video? Full 
> virtual reality? Who knows. But when those come out, with their custom 
> attributes to describe properties we don't even know about yet, we'll 
> still be able to use the same RDF and RDFa to express their licensing 
> terms, and the same parser to pick things up.

Personally I prefer to address today's problems today and tomorrow's 
problems tomorrow, so that as we meet new problems, they are addressed 
with surgical precision, rather than trying to come up with systems that 
can solve everything forever. But again, to each his own.

This line of argumentation (that we should design systems that solve all 
future needs, whether forseeable or not) is also not convincing to me.


On Fri, 13 Feb 2009, Ben Adida wrote:
> 
> [...] we're not asking browsers to implement any specific features other 
> than make those attributes officially available in the DOM.

You presumably do want some user agents some where at some time to do 
something with these triples, otherwise what's the point? Whether this is 
through extensions, or through browsers in ten years when the state of the 
art is at the point where something useful can be done with any RDFa, or 
through search engines processing RDFa data, there has to be _some_ user 
agent somewhere that uses this data, otherwise what's the point?


> In fact, I would say the cost of doing it *differently* is higher for 
> HTML5, too, since none of our test suite, none of our parsing rules, 
> none of our existing work could be reused. Currently, as Mark has 
> mentioned, a *lot* of our work can be easily reused by HTML5, including 
> our test suite.

I agree that if it is the case that there are problems that are best 
solved through RDFa, that it would make sense to use RDFa as is and that 
not using it would be silly.

Of course, it may be that there are no such problems, or that such 
problems aren't compelling enough to need to solve them in HTML5, or that 
all these problems that are solved through RDFa are in fact a subset of 
the problems that can all be solved using a common feature. In these 
cases, reusing RDFa wouldn't make sense -- we'd want to (respectively) not 
use anything, not use anything yet, or use something else from which one 
could obtain triples as well as other things.


On Sat, 14 Feb 2009, Kjetil Kjernsmo wrote:
> On Saturday 14 February 2009, you wrote:
> > 
> > Please don't take these questions as personal attacks. I honestly am 
> > trying to find out how RDF and RDFa are to work in HTML5, to see if 
> > they make sense to add.
> 
> Sure! Skepticism is sound, but you have be aware that the questions you 
> raise has all been discussed at length elsewhere, and sometimes all this 
> advocacy seems to be a waste of time, time that would be better spent 
> actually writing code (and stick to XHTML for the web page needs) to 
> prove the case by actual running code. Thus, I will be very brief.

The problem is that every time I ask these questions, I get that reply -- 
we've answered these questions long ago, so the answers will be brief. 
Unfortunately this doesn't really end up answering my questions.


> > > > Note that you can already "ask questions" on the Web. For example, 
> > > > I just searched for "which country napolean", which is neither the 
> > > > right question nor correctly spelt (though that wasn't 
> > > > intentional), and Google answered:
> > >
> > > Well, you just proved that google sucks, didn't you? It couldn't get 
> > > the answer to that basic question right...
> >
> > Would a system based on RDF or RDFa give a better answer to the same 
> > question? How? Is there a system running somewhere that can 
> > demonstrate this? Does it require all data to be marked up as RDFa?
> 
> I suggested a SPARQL query builder for KDE yesterday. It would be very 
> good at cases as this.

A SPARQL query builder does nothing for most people. It is not a 
substitute for a freeform query UI.

How would a system based on RDF or RDFa give a better answer to the same 
freeform question?

Would such systems require all data to be marked up as RDFa or other RDF 
variants? I assume a SPARQL query builder can't do free-form searches 
across the Web corpus -- what should happen if the RDF stores of the world 
don't include the data you're looking for, or have contradictory data?

These aren't rhetorical questions. Without real, complete answers to these 
questions, the problem isn't solved.

I'm not trying to be difficult here. It would be far easier for me to just 
say "why yes, you're right, RDFa solves this problem" and just ignore all 
these problems. But I wouldn't be doing my job if I did that.


> > > Another example, I'd like to have the latest version of the SPARQL 
> > > Update spec, and I expect to get it if I ask for "sparql update".
> >
> > How does RDF or RDFa solve this problem?
> 
> dct:date

I beg your pardon?


> > Do we have reason to believe that it is more likely that we will get 
> > authors to widely and reliably include such relations than it is that 
> > we will get high quality natural language processing? Why?
> 
> Yeah. Because high quality natural language processing is very unlikely 
> to ever happen. It will remain a niche auxiliary system, and something 
> that is only half-decent for English.

IMHO, the odds of us getting authors to widely and reliably include such 
relations are zero, which is even less likely than "very unlikely".

Note that natural language processing today is the solution we are using 
to the problem of "finding information" (qv. Google, Yahoo! Search, etc). 
Sure, it's extremely primitive, but it works better than structured data 
analysis does today, and it doesn't rely on the authors really marking 
anything up -- most of the semantics of HTML documents are basically 
ignored by search engines. This is necessary because authors have 
difficulty doing the most basic things in HTML, such as using <h1> 
correctly, or using <table> only for tabular data, etc.

What makes you think we can get authors to widely and reliably include 
data relations?


> > How would an RDF/RDFa system deal with people gaming the system?
> 
> trust networks.

Ok, so let me describe a system I might expect to see in an ideal RDF- and 
RDFa-enabled world. Stop me if I go wrong.

In this world, wikipedia, instead of being a presentational wiki, is a 
wiki with relationships marked up, so that, for instance, if I visit the 
page on Paul Desmond, my browser can tell that every instance of the word 
"Paul Desmond" on that page is actually a Person, described by that page, 
and further can tell that this Person has released Audio CDs, including 
one with the track "Caravan", recorded in 1969. It knows all this, so that 
after I have visited this page, I can later ask my browser (using some 
mechanism that I won't describe here) for it to give me the name of the 
person who played "Caravan" and the date that they played it.

It knows that if I ask this question, it can trust Wikipedia, because I 
trust Jimmy Wales, and Jimmy Wales trusts content on Wikipedia. And thus 
it can tell me that the answer is "Paul Desmond".

I also trust Amazon, and I visit the page for the MP3 for Caravan and it 
includes an assertion regarding the price. So later, when I ask the 
browser for the name and price of the track that was released in 1969 that 
was played by Desmond, it can tell me what it is: Caravan, $2.99.

Or so I think.

Unbeknownst to me, or Jimmy Wales, or Amazon, at the time I visited 
Wikipedia, there was another assertion on the Paul Desmond page. That 
assertion was written by another user, and was reverted shortly after I 
visited the page, but it was present while I visited. That assertion said 
that there was another track by Paul Desmond, released in 1969 (and indeed 
every year from 1900 to 2009), called "Viagra", which costs $0.99.

What stops my browser from telling me that this is the answer?


This isn't farfetched. There is a multibillion dollar industry doing this 
24/7, writing software that automates such spamming. Such software is 
actually on the leading edge of massively parallelised programming, with 
clusters of hundreds of thousands of nodes (computers owned by the 
unsuspecting people who can't even use <h1> correctly) posting on blogs, 
forums, wikis, etc, non-stop.

Right now it's not such a problem for the end user, because search engine 
vendors spend millions and millions of dollars every year to combat the 
problem with their own huge computational power.


> > How would an RDF/RDFa system deal with the problem of the _questions_ 
> > being unstructured natural language?
> 
> See my tuberculosis use case. You make the false assumption that the 
> user needs to formulate a question.

The assumption I'm making is that the user has a question, and that they 
want an answer to it. But I'm willing to accept that there might be other 
interfaces -- what are they? The tuberculosis example didn't include 
sample UI. Could you give me an example of what the aforementioned users 
are going to see?


> > How would an RDF/RDFa system deal with data provided by companies that 
> > have no interest in providing the data in RDF or RDFa? (e.g. companies 
> > providing data dumps in XML or JSON.)
> 
> I think we need something I've called GRLLA, i.e. the guerilla version 
> of GRDDL ;-)

If we're assuming that there'll be a way to convert from dedicated formats 
to RDF, why bother with RDFa? Why not just have people output the data in 
their most convenient format, and then convert from that? That way RDF 
isn't made special, and if an even better data model comes along, people 
can convert from the native format to that one too.


> > How would an RDF/RDFa system deal with companies that do not want to 
> > provide the data free of charge?
> 
> That's OK. As long as there are links to something that the rest of the 
> world likes, this is not a problem, it is a good thing.

I don't understand your answer.


> > How would an RDF/RDFa system deal with companies that want to track 
> > per-developer usage of their data?
> 
> Wrong question, developers as we see them today will be an anachronism, 
> that's part of the fun.

I don't understand your answer. Where are the developers going to go?


> > > > How does RDFa solve the problem that they have that I described 
> > > > but that you cut from the above quotes, namely that they want to 
> > > > track usage on a per-developer basis?
> > >
> > > OK, it doesn't.
> >
> > If the problem is that we want price data out of Amazon pages, and 
> > RDFa doesn't solve the problem to Amazon's satisfaction, then why is 
> > RDFa being put forward as a solution?
> 
> I think Amazon will realise that they do not act in their own best 
> interest, though it may take some time.

Do you not see the parallel here between what you just said and what you 
say you are hearing from non-RDF people? You can't tell someone that they 
are wrong and that your way is the right way and that they'll come to 
realise it eventually. You have to actually listen to their needs, and 
then actually address them. Just ignoring their needs and saying "they'll 
realise they're wrong in due course" is not going to make them adopt your 
solution. It'll just sideline your work and make it irrelevant.


> > What did you do with the genres once you had them all aligned with 
> > union, intersection, and sams-as relationships? That doesn't seem like 
> > the most useful structure for data to be exposed to a random user.
> 
> We did a bit of reasoning, constructed a graph from it where all the 
> relations between genres are expressed, then found that the we didn't 
> have the hardware to do what we wanted, so we chopped it up to a tree 
> again. So the user has a nice 2D tree on a ball that can be rotated at 
> 30 frames per second. With better hardware, we want to do a 3D rendering 
> of it.

Holy mackrel. And all this was easier than the few lines of code to read 
the ID3 tags out of the music files?! I certainly wouldn't know how to do 
all of the above in a dozen lines of code!


> > > I want to provide pointers to detailed descriptions of the things I 
> > > mention in what I write.
> >
> > Isn't an <a href=""> suitable for this already?
> 
> Nope, this should be self-evident.

Providing pointers to things is what <a href=""> does. So it's not really 
evident at all that it isn't suitable for providing pointers, no. Could 
you elaborate on this?


> > > I want to be able to express myself succinctly with pointers to 
> > > other places on the Web where descriptions of the people, places, 
> > > subject matter can be obtained.
> >
> > Again, <a href=""> seems to have solved this problem well until now, 
> > why does it no longer solve the problem?
> 
> I really don't understand that you cannot see the problem with how this 
> is done today...

How is what we have today broken?

I honestly don't understand how if you have a document, and you want to 
point to other documents, <a href=""> doesn't do what you want.


> > > Note, I don't want to point them to another chunk of blurb, I want 
> > > to point my readers to a page that has the sole function of 
> > > describing the aforementioned entities via their attributes and 
> > > relationships.
> >
> > Why?
> 
> Oh, please... This is the kind of questions that gives people a strong 
> impression that talking to you is a total waste of time...

I'm sorry if that is the impression I give. I haven't been in the RDF 
world, so things that may seem obvious to you are really not obvious to 
me. Could you humour me and explain why you would want to do this?


> > > As a page reader:
> > > I want to have access to the entities behind the blurb. Today I can 
> > > see an opaque but nice looking Web page, I can also see the markup 
> > > behind the page, but I cannot easily discern the description of 
> > > entities mentioned in a Web Page.
> >
> > What good are these entities? What is my dad supposed to do with them?
> 
> The same thing that the people talking with our librarians are doing 
> with them, actually find the information they look for.

Could you be more specific? Are these HTML files? PNG files? RDF triples? 
Is the user expected to store them? Read them? Print them and go to the 
library with them?


> > If the above represents the state of the art for RDF or RDFa, then we 
> > are a _long_ way from RDF being ready to be exposed to regular users.
> 
> Yeah... Well, it is a question of how you'd expose it... It is the data, 
> not the model that is interesting to expose to the user right now.

If the state of the art does not yet make this data actually usable by the 
user, then we shouldn't expose it, or we will permanently break the 
feature and make it unusable.

Here's an example of this actually happening:

HTML4 has a "longdesc" attribute on <img> elements that was intended to 
allow the author to provide a URI to a Web page that described the image.

The state of the art in accessibility tool wasn't really ready for this, 
and browsers didn't do anything with it. However, there was evangelisation 
for people to use it.

People didn't know how to use it, but knew they should use it. They had no 
feedback loop to determine if they were using it correctly. They ended up 
uniformly using it wrongly (on 99.9987% of pages it is either missing or 
used wrong, according to our data [1]). This made the feature essentially 
useless, because once the tools supported the feature, users were actually 
worse off for using it, even though it was intended to help them.

[1] http://blog.whatwg.org/the-longdesc-lottery

If the state of the art is not ready for RDF to be used widely, then we 
shouldn't expose it yet, because otherwise we will poison the well and 
make it unusable.


> > People have a hard enough time (as you point out!) doing simple 
> > natural language queries where all they have to do is express 
> > themselves in their own native tongue.
> >
> > Asking them to understand "yago:BattlesOfTheNapoleonicWars" or 
> > "dbpedia-owl:MilitaryConflict" isn't going to fly.
> 
> Actually, this is an easier problem that you'd might think, it just 
> hasn't had any attention yet. It is easy enough to attach an rdfs:label 
> to those URIs, in any language, which would make it a lot more friendly.

IMHO these kinds of problems need to be resolved _before_ we unleash RDF 
onto the world in HTML.

In practice, I fear you'll find localisation of a huge number of terms 
like the above is far, far harder a problem than just attaching an 
rdfs:label as your suggest.
 

On Fri, 13 Feb 2009, Jeremy Carroll wrote:
> 
> e.g. could these additional attributes be included in a script data block element?
> 
> <script type="text/rdfa">
>  about="http://example.org"
>  datatype="xsd:int"
> </script>

You could actually just include raw RDF/XML or n3 or any other RDF 
serialisation straight into the <script> block, no need for RDFa at that 
point. This is allowed and possible in HTML5 today. If it turns out that 
RDFa or some other data annotation mechanism isn't added to HTML5, this 
would be the (suboptimal, certainly) alternative.

> (although not wanting to get back to the rdf/xml in comments within 
> HTML) ... are we allowed an XML element inside a script element - that 
> would be less ugly.

Yes, this is allowed, provided you don't have the "</script" sequence 
anywhere in the content.


On Sat, 14 Feb 2009, Kingsley Idehen wrote:
> 
> In a sense, we are actually playing out via this debate the very thing 
> we are hoping the Web will ultimately simplify: discourse discovery and 
> participation.

I assume you don't mean RDFa will actually literally help with discussions 
like this... if you do mean that, could you elaborate on how? That would 
be something that would be convincing. It doesn't sound like it needs 
broad uptake to work; is there something I can do to obtain this benefit 
immediately? Is there software that already helps with this?


> NLP is not the issue at hand here. This isn't about linguistics. It is 
> about structured data, more like a DBMS.

We haven't added any kind of declarative DBMS mechanism to HTML5 either. 
We have added an API for working with SQL, though. I could understand 
wanting an API for working with RDF data directly, but that doesn't seem 
to be what is being requested here.


> > How would an RDF/RDFa system deal with people gaming the system?
>
> Great question!
> 
> It would help identify the people gaming the system. See the recent 
> foaf+ssl [1] endeavor for instance.

It appears you didn't include the link. I would be quite interested in 
finding out more about this.

How would the people be identified when they are anonymous wikipedia 
contributors posting using automated distributed malware running on the 
aforementioned unsuspecting users' computers, either generating new 
certificates for each edit, or hijacking the user's certificates?


> > How would an RDF/RDFa system deal with the problem of the _questions_ 
> > being unstructured natural language?
>
> RDF has a query language: SPARQL.

Ok, but none of these users are going to learn SPARQL, so that's mostly an 
academic concern. What is the UI going to look like? Where is it going to 
come from?


> > How would an RDF/RDFa system deal with data provided by companies that 
> > have no interest in providing the data in RDF or RDFa? (e.g. companies 
> > providing data dumps in XML or JSON.)
> 
> We transform, and the expose as RDFa, as per this example which does 
> expose RDFa:
> 
> http://linkeddata.uriburner.com/about/html/http://en.wikipedia.org/wiki/Napoleon_I_of_France

This seems to assume that you have the license to do this, which in most 
cases you would not.

Incidentally, this brings up an interesting question. The above Wikipedia 
page says "An autopsy concluded he died of stomach cancer". How would this 
be exposed in RDFa? Or is this not the kind of thing that we would expose?


> Amazon should not be a factor in this discussion. Ditto Google, or any 
> other entity. We are talking about the Web.

Amazon was brought up as a use case for RDFa:

   http://lists.w3.org/Archives/Public/public-rdfa/2009Feb/0035.html

...which is the only reason I mentioned it.

Google was brought up in the context of a search engine (I also brought up 
Microsoft's search engine at the same time), because search engines are 
how users find information on the Web today, and RDFa was put forward as a 
way to address the problem of users wanting to find out information.

These seem like reasonable reasons for them to be a factor in the 
discussion.

I'd also like to point out that both Google and Amazon are part of the 
Web, so they are reasonable topics of discussion when we are talking about 
the Web, as you put it.


> > This is what I mean by evaluating solutions, by the way. I don't 
> > personally care whether we use RDFa or something else. I _do_, 
> > however, want to make sure that whatever solution we end up using is a 
> > solution that actually solves the problems we set out to solve.
> > 
> > Here, if the problem is "associate price with item on Amazon pages", 
> > RDFa does not solve the problem.
>
> RDFa allows us to choose to associate price with an item in a structured 
> way.

Right, but if the problem is "associate price with item on Amazon pages", 
as opposed to the pages of someone else, then RDFa does not solve the 
problem, because, as demonstrated by Amazon's use of APIs rather than a 
simple class value, Amazon has needs that aren't addressed by RDFa.


> Here is a document (with RDFa) about "Mosquitoes" from the GoeSpecies 
> Linked Data space. A few email exchanges between the GeoSpecies kbase 
> author and I lead to this:
>
> http://linkeddata.uriburner.com/about/html/http://species.geospecies.org/family_concept_uuid/1e0e9bfe-f1ee-4b14-9511-cb896e8ebf97/
> 
> The document above is itself a purveyor of structured data for anyone 
> esle on the Web to exploit.
> 
> I don't want to be the only one capable of doing this on the Web, I want 
> anyone that uses the Web to be able to do this, and RDFa is a very low 
> cost mechanism for achieving this goal.
> 
> I was to express myself clearly and succinctly without compromising 
> clarity or brevity, when I publish documents on the Web. Likewise, I 
> want to read documents from others who are able to do the very same 
> thing: express themselves clearly and succinctly without compromising 
> clarity or brevity.

How does my biologist friend, who knows nothing about computers, but does 
know about mosquitoes, make use of this information? How does it help her
more than this page would?:

   http://species.geospecies.org/family_concept_uuid/1e0e9bfe-f1ee-4b14-9511-cb896e8ebf97/

This isn't a rhetorical question. I'm sure that there is indeed something 
that would help my friend here. I just don't know what it is.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 17 February 2009 06:48:05 UTC