Re: RDFa vs Microdata, and the separation of data and presentation

The RDF WG had an email discuss where it was suggested that an HTML document might have a <script type="text/turtle></script> where the semantic information associated with the page could be embedded. If you really didn't care to combine the structure of the document with semantics, this would be a way to approach it.

However, RDFa, Microdata and Microformats, _are_ about giving the actual structure of the document semantics. So, for example, when I have a block of HTML that describes a calendar event, I can give it meaning by using appropriate markup. If you follow the Semantic HTML convention, where the HTML is really a data model, and not a presentation, then it follows that you would give more formal semantic meaning to the markup itself. POSH (Plain Old Semantic HTML) [1] allows you to do this in a limited way with existing HTML elements and properties such as <cite>, <abbr>, <blockquote>, <dl>  and such, along with conventions for using @class, @rel, @id and other properties.

In different ways, RDFa, Microdata and Microformats all play on this basic idea of imbuing the data representation within an HTML page with meaning. This in contrast to the earlier view that there would be two separate, but equal, tracks where a URI could have either an HTML representation or a semantic (RDF/XML) representation; that was shown to not really work in practice.

In my view, JSON-LD has a similar objective of providing objective semantics to JSON with a minimum of syntactic overhead. To the degree that we can hide the semantics in @context, I think we're doing pretty well. The concessions to this are for basic identity (@subject) and type (@type).

Gregg

[1] http://microformats.org/wiki/posh

On Jul 1, 2011, at 10:49 AM, glenn mcdonald wrote:

I've been pondering RDFa and Microdata and what unifying them (or replacing them with a single new thing) might mean, in parallel to thinking about JSON graph-serialization.

The thing that bothers me about all the approaches to embedding ids and predicates and scopes and types and such in HTML is that, well, they involve embedding, or attempting to embed, machine-audience data structures inside human-audience presentation structures. Why are we doing this at all? It's terrible and we know better, and it isn't even helpful. We may very reasonably want to know when a bit of presentation corresponds to a bit of data, but I see no argument at all for why the entire structures need or even want to be interleaved.

Here is what might be a much, much simpler and yet better idea:

1. Add to HTML5 a new global attribute called "data". This takes, as a value, a space-separated list of absolute or relative IRIs, which identify data objects represented by the contents of the HTML element so-marked. The exact semantics of "represented by" are human, not technical, but we could provide many guiding examples.

2. Add to HTML5 a new element called "DATA". The contents of this are a canonical JSON serialization of the data structure underlying the contents of the page, presumably including (but not limited to) the objects referred to by "data" attributes on elements in the BODY.

Isn't this vastly simpler to understand, produce and consume than any of the existing embedding schemes, to at least the same benefit?

The embedding part of this is now concerned only with associating the visible content ands its corresponding data, so we get from 3 embedding schemes to 1.

But maybe even more importantly, by separating the data-structure from the embedding we eliminate the need to have embedded encodings separate from the regular non-embedded encodings. And we provide the most compelling possible justification for using JSON as the canonical serialization (i.e., DATA is effectively a SCRIPT block with an implicit jsonp callback). And we eliminate the need for content negotiation in a vast number of cases, because a machine agent can just take the DATA from the page. And we ensure that people use IRIs for everything, because that's how it all works. And we start to establish the expectation that a data-backed page should have its data included.

And then the task of this mailing-list/group/whatever would become very specific: provide the rules for how the DATA element is written. That is, it's not just a JSON serialization, but the JSON serialization. In fact, it's not just a web data-graph serialization, but basically the web data-graph serialization.

glenn

Received on Friday, 1 July 2011 18:12:34 UTC