My RDF Manifesto

I have been watching the RDF working group's discussion about layers and
surfaces and named graphs with great interest. This is a topic of genuine
concern for me. When I first learned of RDF I felt it was very limited. More
like a CS 201 homework assignment run amok than anything that could be
applied to the real world. I only recently became interested in RDFa as a
means of embedding citation information within HTML (family) formatted
content. Even though there are now triplestore databases that can store
billions of triples and reasoners that can sus out who is a friend of whom,
I still feel the same way about RDF.  In my view, the "triples can do
everything" model is seriously limited. It really does seem as though the
triple model was created primarily to make it easier to write programs and
ever since then people have been trying to cram the real world into a
three-cornered box. 

So it is nice to see the RDF WG thinking about expanding the RDF model
beyond just a bunch of triples. However, I really feel that you aren't going
nearly far enough. Up to this point, I have kept my thoughts on this matter
to myself - limiting my comments to how to write good documentation for
RDFa. However, now that the RDF WG is tossing around ideas - and in the
spirit of Dan Brickley's post about not making hasty decisions that could
block future ideas - I thought it would be a good time to send you this
"manifesto" on my thoughts about RDF and where it should go in the future. 

=============================================
First, the limitations of RDF, as I see them:
=============================================
I can understand the notion of using a simple construct to build more
complex constructs. This does make programming easier and it certainly makes
embedding data within XML documents more feasible. But triples - as they are
currently used - don't tell the whole story.

1) There is no meta-metadata. 
-----------------------------
In other words, triples encode metadata about other things but there is no
way to encode metadata about the triples themselves. There is no way to
indicate where a triple came from, how well it is trusted, how old is the
reference, how much influence it should have on reasoning software, or
anything else. 

2) RDF is entirely Boolean.
---------------------------
I can see how an entirely Boolean system would appeal to computer
scientists. However - just as the world is not flat - the world is not
Boolean. The world is full of "somewhat"s, "probably"s, and "kinda-sorta"s.
Under RDF I either foaf:knows you or I do not. There is no way to tell if we
are like blood-brothers or if I just met you at a conference a couple of
times. If one wants to express different levels of "knowledge" - from
acquaintance up through "carnal" - then one has to create an entirely
different predicate for each different level. Sure, it is possible to create
an entire vocabulary expressing a dozen different levels of knowing someone
and then use RDFS or OWL to rank them using some predicate that means "is
stronger than" and then subclass them all under foaf:knows However, if a
reasoner has access to some RDF data which uses this vocabulary but doesn't
have access to the vocabulary definition files themselves, then it will have
no idea that "acquaintance" is similar to "Buddy," differing primarily by
degree.

3) RDF is fragile and impermanent.
----------------------------------
RDF is based upon IRIs. People in the internet community like to think of
these IRIs as relatively permanent, but they are suffering from a delusion.
As far as I know, every IRI (which uses the http:// URI-Scheme) is dependent
upon the owner of the domain reregistering that domain name on a regular
basis. I am unaware of any means of registering a domain name for
perpetuity. This means that RDF data I create today could be useless as
early as tomorrow. (Not likely, but possible.) In addition, if a domain name
is forfeited and taken over by someone else, that second party could
redefine the vocabulary, completely changing the meaning of legacy RDF data.
I am primarily concerned with scientific and educational information,
therefore I am thinking in terms of hundreds of years. Finally, IRIs are
often at the mercy of web site administrators who may not know that a URL on
their site has been used as an IRI in some RDF data somewhere. So, said
administrator may reorganize a web site and totally destroy years of work in
one afternoon. 

Now, someone could attempt to maintain an archive of all older vocabularies
as well as the IRIs from which they were retrieved and the date-range within
which they were valid. However: A) How would any RDF reasoner know which
date range to use if the stored triples don't have date metadata attached to
them? And B) How can such a system know which IRIs to archive. 

4) Blank Nodes are too ambiguous yet not fuzzy enough.
------------------------------------------------------
Within one document or file it is possible to know with certainty that two
blank nodes are the same node. However, across documents - even if those
documents have been "merged" into one data store - it is impossible to know
for sure if two blank nodes refer to the exact same entity. Especially if
you consider very long term storage of RDF data. Many tutorials on RDF give
the example of using e-mail addresses to "pin down" a blank node. The
reasoning goes: "If two blank nodes from two different sources are
associated with the same e-mail address then it can be assumed the blank
nodes refer to the same entity." However, it is entirely possible for one
person to give up an e-mail address and then - after what seems like a
reasonable period - that e-mail address could be assigned to a different
person. But a reasonable period for daily use by people is different from a
reasonable period for very long term archival of data. A hundred years from
now, your current e-mail address may have been used by five different
people. 

There does not seem to be any mechanism for indicating - with certainty -
that two different blank nodes from two different original sources both
refer to the exact same entity. All that reasoners can do is conjecture.
And, while I have nothing against conjecture, there is also no means to
indicate the degree of certainty with which said conjectures are stated.
Either two blank nodes are assumed to be the same or they are not associated
at all, which goes back to the Boolean nature of RDF.

5) I hate RDF-Linked-Lists too!
-------------------------------
Like Manu, I am really NOT a fan of RDF-Linked-Lists. And, by extension, not
too very interested in the RDF-Linked-List part of the RDFa Core 1.1 spec at
all. First of all, what's up with all those extra blank nodes? You could
have just left them out entirely. Secondly, as Manu stated, most web
designers don't know from linked lists. The data structure that is used is
so complicated, as far as they will be concerned, that they will just assume
that lists are "Too Hard" and ignore them altogether. If there was ever a
part of a spec that was doomed to be largely ignored, this is it. Third, the
attribute chosen in the RDFa Core 1.1 spec to indicate that something should
be in a RDF-Linked-List is inappropriate. By using "inlist" you make it seem
as if that is the only type of list that could ever occur. You are closing
the door to simple ordered lists, etcetera.

One should never choose a generic term to refer to a very specific entity.
What, then, does one do when yet another entity that also falls under that
generic term comes into use. Do you then use a more specific term for that
entity? That is guaranteed to lead to confusion. Unfortunately,
RDF-Linked-Lists are now part of the RDFa Candidate Recommendation and it
would be specification suicide to try to take it out now. However, the
situation can still be salvaged by simply changing the name of the attribute
used to something more specific like "rdf-linked-list" that conveys what is
really going to be done with that data. Then, future specifications can use
"orderedlist" (and "unorderedlist") to mean a simple list that is ordered
based on the order of appearance within the document (or not, respectively).
These are terms with which regular web developers are intimately familiar.
In fact, it may even be possible to figure out a way so that web designers
could incorporate an "orderedlist" or "unorderedlist" directly into an <ol>
or <ul> tag structure in an HTML document. 


==================================================
Next, the limitations of - or issues I have with - the current proposed
surface, layers, named graph model being discussed. 
==================================================
I have mentioned how it seems as if RDF is like a three-cornered box. The
current discussion makes me think that the RDF community has gotten so used
to that box that they can no longer think outside of it. The changes being
discussed are radical changes, to be sure, but it has the feel of merely
expanding the box from the inside, using what cardboard you have laying
around. I may have missed it, but I have not seen any discussion about where
you really want to go with these changes; what people will do with them; or
how these changes will allow for creating more accurate models of the real
world. Before you go any further, it seems to me that you all need to sit
down and map out a long term vision for what you want people to be able to
encode in RDF data as well as a pragmatic look at what will and won't be
possible. 


A) N-Quads is only a single step in the right direction. 
--------------------------------------------------------
N-Quads allow one to assign a "layer" (or "level" or "tag" or "name" or
whatever you want to call it) to a triple. However, what if a triple needs
to be in more than one layer? Do you repeat the triple? How many times? Now,
I know that this repetition can be normalized in a data store but a
serialized version of this data structure would be excessively verbose. It
seems it would be better to allow any number of additional "layer IRIs" to
be listed along with a triple. Let the author of the document decide which
is more efficient: repeating triples or repeating additional "layer IRIs." 

B) N-Quads seem to be telling the users of the data
   what they can do with it.
---------------------------------------------------
If a triple has a particular "layer IRI" associated with it this seems to
imply that it must go on that "layer" in a data model. I think it would be
best to merely call these additional values "tags" and let users of the data
do with them whatever they choose. Leave it up to developers of RDF analysis
software to invent new metaphors like layers and surfaces and such. They can
take the information in the triples along with the information in the
associated tags and filter, sort, display, or otherwise rejigger it any way
they please. By pre-imposing a metaphor, and then choosing your terminology
and writing your specs around that metaphor, you are locking in peoples'
thinking about what they can do or how they can re-envision that data. 

C) Metaphors are nice, but don't lock people in.
------------------------------------------------
Discussed above. I just wanted to reiterate this point. Choose terminology
that is as UNlimiting as possible, rather than picking a metaphor, and
choosing terminology to match that metaphor. Just provide means to express
as much information as possible and let people decide what they want to
express and what they don't. Leave it to user documentation writers like me
to explain to web authors or data managers all the different ways they can
make use of the information that may or may not be stored in RDF data (as
well as encourage them to be creative and invent new ways to envision that
same information). Specifications should only tell people how to encode the
information, not tell them what to do with it or how to think about it. 


============
My proposal:
============
Fortunately, it is possible to solve most of these issues with relatively
simple changes. 

Terminology:
------------
Call the extra values, which you are adding with N-Quads, "tags" and leave
it at that.

Don't try to apply any metaphor at all. Let software developers and users do
that. 

N-Tuples:
---------
Basically just allow any number of additional tags instead of only one.

If two identical triples exist ANYWHERE but with different tags, treat them
as one triple with a union of all the tags. 

Permanent IRIs:
---------------
Create some new official URI-Schemes[1]. Under those create a new "domain"
system, organized based on the purpose of the domain, rather than who paid
money to register a name. Create a legal structure that allows you to
register sub-domains in perpetuity. 

For instance: For my DEMML project[2], I intend to register an official
URI-Scheme called "demml://" under which I will organize a vast tree of
topic codes for every subject that may need to be learned about by anyone.
(Yes, it will be a huge tree but I have figured out how to encode around
10^31 topics with only 40 characters, including slashes[3]. When someone
attempts to dereference a demml:// URI, plugins or other software on their
system will first look in a specified folder on their hard drive, then look
at a near-by server, then a more distant mirror, etc., automatically
following links made available on each of those machines to find the next
machine up in the chain where the sought-after material may exist. Those
topic codes could also be used as IRIs to specify the topic of a web page or
something. However, the IRI may not be directly dereferenceable. Instead, it
may only be able to be dereferenced via the aforementioned mechanism and
possibly only asynchronously[4]. While at the same time, that IRI may be
used to mark millions of other documents which could be found by searching
for that unique string of characters. This unique string of characters will
be valid for all of perpetuity because I didn't have to register a domain
through ICANN.

Reduce dependence on DNS system:
--------------------------------
At any point the DNS system can be co-opted by governments or hackers. IRIs
could be redirected or made entirely unavailable. However, very rarely is
the entire internet blocked from availability. By redundantly posting RDF
data and vocabularies on various different servers it makes it much more
difficult for malicious agents to block access to this information. It also
simply protects information from being destroyed by catastrophic disasters.
Then, instead of relying on the DNS to locate the ONE copy of that data
stored on ONE vulnerable server, we can rely on search engines to index this
information for us. I know, it sounds radical and even more fragile but
think about it. There can be an almost infinite number of search engines
available to choose from. It would be nearly impossible to block them all.
Sure, any one search engine may filter out some content, but that search
engine would not be very popular. Or it may filter content in just the way
you want, something the DNS doesn't do. At first people would have to browse
to a specific web site to search in a specific search engine. However,
software would soon be written to automatically search via a selection of
preferred search engines so that the user of RDF analysis software would
never notice that the file had not necessarily been directly dereferenced.
Besides, a lot of IRIs used in RDF are never meant to be directly
dereferenced in normal use anyway. They are, in fact, merely labels. The
only difference here, is that - if that IRI needs to be dereferenced -
software can do so even if the original server is blocked or down. 

The best part is that all that is really required is to change one's
conception of what an IRI is used for. Simply use IRIs like labels to be
searched for rather than addresses representing locations (real or
imaginary). You can still use the same formatting. And there is nothing to
prevent one from storing an original copy at the location specified by the
IRI. It is just that you don't depend upon that IRI to be dereferencable at
all. Instead, you expect to be able to find many instances of it used within
metadata on a page, spread all over the internet. Additional metadata will
indicate whether a found instance of an IRI is a label for an exact copy of
the original content or is a subset (like a quote) or simply a reference
back to that original content. 

This search-engine-based dereferencing system can be combined with the above
notion of "permanent IRIs via newly-created URI-Schemes" by the creation of
a special URI-Scheme with the following simple rule: First try to
dereference via HTTP or HTTPS but if that fails, then dereference via search
engines. Leave it up to software developers to determine various innovative
means to optimize this searching process. They could cache previous results
on the desktop, or use a master list stored on some server. Software users
will vote on the best methods with their dollars and their feet (or their
clicks).

Give RDF more granularity while making it all warm and fuzzy:
-------------------------------------------------------------
My primary beef with RDF is the Boolean nature mentioned above. There is no
way to create a weighted graph. This can be solved by making yet another
relatively simple conceptual change. Currently, the entire string of an IRI
is considered when comparing two IRIs for equivalence. (At least according
to answers to my question on Stack Overflow[5].) All that is necessary is
to, instead, say that only the path and fragment parts of an IRI are
considered the official resource identifier for RDF purposes. Any query
string or CGI data is treated as additional metadata about that IRI. This
allows a weight factor to be assigned to a specific edge in a graph simply
by adding a key-value pair to the IRI of the predicate. Now, instead of an
all-or-nothing "foaf:knows" I can say I kind-of know you by using
"foaf:knows?weight=.5". One might list a good friend using
"foaf:knows?weight=.9" and perhaps claim "foaf:knows?weight=1.1" for one's
wife (though she might say it is closer to "foaf:knows?weight=.4").

Additional key-value pairs could be added to indicate additional metadata.
One could indicate the original source of an IRI; The dc:creator of a
triple; or the date that IRI was born on. Some metadata would be more
appropriate for use in the subject-IRI and other metadata would be best in
the predicate or object IRIs. Only certain keys would be used for IRI
matching purposes. For instance, two subject IRIs with identical paths and
fragments but different "born on" dates might be considered to be two
distinct nodes. However, two triples with identical paths & fragments for
each of the subject, predicate, and object but with different "triple
created on" and "source" values listed in the predicate might be considered
to be identical triples that just happened to say the same thing in
different places, on different days. Kind of like retweeting. Naturally, two
predicates with different weight values would be considered the same but
different. So, if two identical triples have different weight values then
some kind of calculation could be done to determine the weight to use for
reasoning purposes. Different reasoners could use different calculations, as
they see fit. 

Some may say that this change could break some IRIs currently in use. But
does anyone really use the query string in their IRIs? Considering that all
that would currently achieve is creating an entirely different IRI, I cannot
see anyone choosing that method of creating different IRIs over the method
of simply using fragments. Therefore, I can't see that making this change
would break more than a handful of IRIs.

Give blank nodes identifiers and make them fuzzy too:
-----------------------------------------------------
This seems counterintuitive. A blank node is supposed to be blank, right?
But how do you tell the difference between one blank node and another? Well,
within parsing and reasoning software, blank nodes are assigned identifiers.
This is similar to the index number that is added to a row in a database
table but is rarely seen by users of the database. Rather than saying that
internal identifiers should be ignored when transmitting blank nodes from
one system to another, provide a means of labeling those blank nodes with a
globally unique serial number. This can be achieved by, again, using the
query string to add metadata to the existing blank node IRI. Instead of just
'_:reference' (where "reference" is usually a short string that is unique
within a particular document), we could allow '_:reference?query="string'.
Using this query string, one can add metadata to a blank node indicating
date and time of creation, a global serial number, etcetera. Then reasoners
can use this additional metadata along with the information in the triples
that contain the blank nodes and calculate a probability that those two
blank nodes are, in fact, the exact same node. This probability can then be
expressed by a weight factor in the query string of the predicate connecting
the two blank nodes. 

Now, web authors should not be required to devise and insert a globally
unique serial number into the metadata of each and every one of their blank
nodes. This information could instead be derived from the unique IRI of the
document and the short name given within the document. So a processor, when
parsing the RDF data embedded within a document would simply convert
something like '_:john' to
'_:john?source="http://example.com/aboutjohn.html"&dateretrieved="2012-03-14
^^xsd:date"'  (with appropriate escaping and conversion to a format
compatible with query strings). Implicitly created blank nodes (created
through chaining, etc.) could simply be given a document-unique serial
number to use along with the IRI of the document as above. However, I have
yet to derive an algorithm that could ensure that the same blank nodes
within a document get assigned the same serial number upon different
parsings, even if the document has been edited between parsings. So that
will require some more thought. Perhaps web authors could assign IDs to
blank nodes through the use of an additional predicate and literal object
each time they intentionally use chaining. I know, this would be a pain, but
software tools could make this easier.


===========
Conclusion:
===========
>From its conception RDF was both genius and incredibly limited. By making a
few conceptual changes, throwing in some outside help in the form of special
URI-Schemes, and opening up to the possibility that the DNS is not
infallible, it is possible to grow RDF into a data standard that truly
models the real world instead of some tri-cornered, cookie-cutter version of
it. No one should be required to use any of these additional features.
However, allowing the use of this additional metadata, embedded within the
query string of an IRI, would give RDF reasoners a vast amount of additional
information with which to, well, reason. Yes, this will then require the
definition of certain standard query-string field-names and the approved
datatypes that can be used with them. But just enough to avoid chaos. Allow
web designers and experimenters to come up with their own field names as
they see fit (with the warning that any one of those could be co-opted and
put into a standard some day) and then sit back and see what happens. As has
been mentioned many times, the best innovation sometimes occurs outside of
the working groups. Give people the incredible flexibility that these
changes will provide, and I believe a whole new world of
data-interoperability will open up. 


[1] http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
[2] www.demml.org
[3] http://www.demml.org/standard/classification/
[4] http://www.ideationizing.com/2009/07/intelligent-epidemic-routing.html
[5]
http://stackoverflow.com/questions/9171416/is-a-query-string-allowed-in-a-ur
i-used-in-rdf 

Received on Friday, 11 May 2012 01:32:58 UTC