Re: Multiple itemtypes in microdata from Ian Hickson on 2011-10-18 (public-html-data-tf@w3.org from October 2011)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 18 Oct 2011 06:51:24 +0000 (UTC)
To: Bradley Allen <bradley.p.allen@gmail.com>
cc: Jeni Tennison <jeni@jenitennison.com>, Stéphane Corlosquet <scorlosquet@gmail.com>, "public-html-data-tf@w3.org" <public-html-data-tf@w3.org>
Message-ID: <Pine.LNX.4.64.1110180629580.27449@ps20323.dreamhostps.com>
On Fri, 14 Oct 2011, Bradley Allen wrote:
> 
> Here is perhaps the archetypal example.
> 
> In 1634, Pierre Fermat took a copy of Diophantus' Arithmetica and wrote 
> the equivalent of the following in the margin of one of the pages:
> 
> <p itemscope itemtype="http://purl.org/ao/core/Annotation
> http://swan.mindinformatics.org/ontologies/1.2/discourse-elements/ResearchStatement">
>   No three positive integers a, b, and c can satisfy the equation a^n
> + b^n = c^n for any integer value of n greater than two.
> </p>

You can do that today. Take a copy of the page, and add your microdata. 
Why would you need to mark it up as two things? You only need to mark up 
whatever you're adding; the original page still has the original 
annotation. No?

Or even more likely, just write something on your page and link to the 
original. I don't see why this would require microdata at all, let alone 
an item with two types, but even if we grant that someone might want to 
mark this up as microdata, it seems like what you want to do is say that 
the "annotation" is an annotation of a "research statement", not that the 
annotation and the research statement are one and the same.

I apologise if I'm being dense here. I'm failing to understand what would 
actually be marked up (the example above is meaningless in microdata -- 
the item is empty of any name-value pairs, so it conveys no data). I don't 
really understand what would be part of one vocabulary and what would be 
part of the other.


> >> Individual items can have different senses. We can represent these 
> >> different senses as distinct types. Those types can be obtained from 
> >> different vocabularies.
> >
> > I'm not sure we are using the word "item" in the same way here. An 
> > "item" is just a self-contained group of name-value pairs, such as a 
> > particular instance of movie metadata, or a particular instance of the 
> > description of a hypothesis, or some such.
> 
> That's an intensional way of describing an item.

It's the way the spec defines it.


> I'm using an extensional way. They amount to the same thing.

I think they might in fact be opposites, and I'm not convinced the meaning 
you are using is correct in the context of microdata.

Two pages with exactly the same microdata annotations, including each 
having an item with the exact same itemid="", are two _different_ items. 
An item is something that exists only in the context of a page and its 
DOM. It does not have an independent existence outside the page. The data 
it describes can be extracted out (e.g. into a JSON form), but then that's 
not the item any more, it's just a copy of the item's data. An item that 
describes a Web page is not the Web page itself. An item that describes 
the movie is not the movie. Two items can describe the same Web page or 
the same movie.


> > To put it another way, you could annotate a research statement twice, 
> > right? And the annotations wouldn't be the same annotation.
> 
> The counterexample of Fermat's Last Theorem shows that we can have 
> research statements that are annotations.

I do not understand this example sufficiently to see how it is a counter 
example. As far as I can tell, it's wrong. What if the annotation had 
contained two research statements? Indeed, my phraseology here betrays my 
assumption: that the annotation _contains_ the research statement, it 
isn't itself the research statement.

An annotation could contain a movie and a music clip and a document. It 
would not _be_ those things. It would contain them somehow. (At a 
microdata level, it might be even more indirect: maybe it would contain 
items that themselves refer to those things.)


> > It sounds somewhat like rather than wanting to put hypotheses and 
> > annotations and so forth in Web pages that are primarily prose, what 
> > you are describing and what I've seen in the documents you cited above 
> > is more a database that would be directly filled in, in which case 
> > microdata really has no bearing on the discussion. You wouldn't want 
> > to use microdata unless the document you are annotating is primarily 
> > prose -- articles, book chapters, and the like. If the data is 
> > primarily this structured information, HTML isn't the right place to 
> > put it. It should just be put straight into its native form in the 
> > database.
> 
> The problem that arises by leaving this kind of rich scientific content 
> in databases is that it becomes part of the deep Web, and hence 
> undiscoverable using resources like Google.

I don't see why such a database couldn't be exposed to Google, in any 
number of ways: either by putting a Web interface on the database that 
exposes all the data as individual pages, or by some sort of direct 
database API (the way Twitter was exposed, back in the day), or through 
some system like Google Base... I don't see how microdata would help here. 
It's not like Google is going to support these vocabularies and do 
anything different with ResearchStatements or Annotations.


> By expressing these statements in HTML, and using microdata to capture 
> structured data that places statements on the Web page in the correct 
> context of the (primarily prose-centric) scientific discourse of which 
> they are part, we can enable both better discoverability and a much 
> richer user experience for the researcher.

I disagree with the premise that discoverability would in any way be 
affected. Whether the Web pages are the canonical source of the data, 
crawled by tools and put into a database, or if the database is the 
canonical source of the data, exposed to the world through generated Web 
pages, the discoverability is identical.

I definitely disagree that the user experience would be richer if the data 
is spread across many pages than if it is in a centralised (or indeed, 
decentralised) database system with dedicated data models, APIs, and UIs.


> In addition, embedding the structured data in the page and allowing 
> processing that uses the structured data to drive the user experience in 
> the browser, rather than requiring calls to a back-end database

I think it is naive to think that browsers are going to do anything with 
this data, and dedicated tools can do a far better job with centralised 
data storage and querying. How do you perform any kind of data analysis if 
to get this data you have to follow links (either in an automated fashion 
or manually) and process the data client-side, rather than just having the 
server do everything in RAM and then return the result?

It's like saying that Web search would be better if it was done by the 
browser rather than requiring calls to a back-end database. It just 
doesn't work at Web scale.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 18 October 2011 06:55:34 UTC