parsing documents that describe users from Michiel de Jong on 2012-07-24 (public-fedsocweb@w3.org from July 2012)

From: Michiel de Jong <michiel@unhosted.org>
Date: Tue, 24 Jul 2012 19:07:25 +0200
To: public-fedsocweb@w3.org
Message-ID: <CA+aD3u3QLyihXJ6nQ9dzkTtAGRRSfe8+v5NGysnzan_vJhqbTw@mail.gmail.com>

As I progress with the useraddress.net code, i found that Content-Type
headers are actually at least as valuable as link relationships in
deciding how to process a document. I divide them into the following
categories:

- json
- html
- rdf
- xrd


I'm learning the formats as I go along, and just make them work
heuristically, without too many strict rules. Apart from that I take
into account the link relation that brought us to the document (if
any), which can for instance tell us that something should be
interpreted as a poco document. In many other cases, the link relation
is useless for the document interpretation.

But even using these hints, you can easily get to points where the
data is not unambiguously machine-readable. For instance, for facebook
and twitter API documents we need to take into account which API they
came from.

Also I found that quite  a few documents are served with the wrong
Content-Type (e.g. Diaspora serve their host-meta with an html
Content-Type) so for these I think i'll just send pull requests to get
them fixed.

Supporting StatusNet, Friendica, Diaspora and Google is relatively
straightforward, and twitter and facebook are super-simple once you
consult their custom and proprietary API documentation. But by far the
most work is all the custom domains. I'm trying to support Melvin,
Tantek and TimBL, but they each work in different ways. I hope to make
some progress on that soon, and try to support all of these before I
publish my proof-of-concept version.

Also, some people point their sameAs relation to the human-readable
profile page (like Tantek, http://www.facebook.com/tantek.celik ), and
some point it to the API (like TimBL,
http://graph.facebook.com/512908782 ). Both are sub-optimal, because
human-readable profile pages are not always marked up, and API
documents sometimes require knowledge of the proprietary API used. So
this means a lot of the sameAs links i've seen so far are actually
useless for building up search-engine data.

What I haven't made much progress with is Buddycloud; there is an xmpp
client for nodejs, but I haven't dived into how I can retrieve a vcard
with that. So that will probably not make it into the current version.

I'll try to finish my proof-of-concept by the weekend, and then we can
compare it with openfollow.net to see how the two can integrate.


Ciao!
Michiel

Received on Tuesday, 24 July 2012 17:07:53 UTC