- From: Danny Ayers <danny.ayers@gmail.com>
- Date: Fri, 11 May 2012 14:33:24 +0200
- To: grantsr@gmail.com
- Cc: www-rdf-comments@w3.org
Hi Grant, I'm not a member of the group, but would like to respond to some of your comments. tl;dr I believe RDF with named graphs does actually cover most of your issues when viewed as part of a layered system. Add the bits you believe are missing on top. But I too have been worrying a little about the IRI/HTTP/DNS issue. I'm looking into how the magnet: URI scheme may help. (It's essentially like a data: URI with a hash of the representation). 1) There is no meta-metadata. Named graphs appear to be the best solution here. Whatever they ultimately wind up looking like, there is always the HTTP model to fall back on: put an RDF-friendly format representation of a resource online (set of triples), talk about that. 2) RDF is entirely Boolean. Not quite, "true" vs. "unknown" is a slightly different species. But I had exactly the same kind of issue with RDF when I first encountered it, and others have been there before. I believe Aaron Swartz came up with a property :kindaLike. But the fact that you can put numeric values in literals mean that it is possible to connect to number-crunching systems. I'd suggest that putting 3) RDF is fragile and impermanent. It's more or less as fragile and impermanent as Platonic solids (though YMMV as far as specs and mindshare are concerned :) But I agree very much that dependency on HTTP's dependency on DNS is troublesome. 4) Blank Nodes are too ambiguous yet not fuzzy enough Then don't use them :) 5) I hate RDF-Linked-Lists too! Me too! Had so much grief with them. But the underlying list model is clearly sound. The projection into syntax gets a bit messy, but I can't see a better alternative. Having said that, in some recent code for pragmatic reasons I'm leaning towards using lists expressed as property-numbered resources, more like the old rdf:Seq. Incidentally, re. registering "demml://" - new URI schemes are rarely the best approach, check http://www.w3.org/TR/webarch/#URI-registration Cheers, Danny. On 11 May 2012 03:32, Grant Robertson <grantsr@gmail.com> wrote: > I have been watching the RDF working group's discussion about layers and > surfaces and named graphs with great interest. This is a topic of genuine > concern for me. When I first learned of RDF I felt it was very limited. More > like a CS 201 homework assignment run amok than anything that could be > applied to the real world. I only recently became interested in RDFa as a > means of embedding citation information within HTML (family) formatted > content. Even though there are now triplestore databases that can store > billions of triples and reasoners that can sus out who is a friend of whom, > I still feel the same way about RDF. In my view, the "triples can do > everything" model is seriously limited. It really does seem as though the > triple model was created primarily to make it easier to write programs and > ever since then people have been trying to cram the real world into a > three-cornered box. > > So it is nice to see the RDF WG thinking about expanding the RDF model > beyond just a bunch of triples. However, I really feel that you aren't going > nearly far enough. Up to this point, I have kept my thoughts on this matter > to myself - limiting my comments to how to write good documentation for > RDFa. However, now that the RDF WG is tossing around ideas - and in the > spirit of Dan Brickley's post about not making hasty decisions that could > block future ideas - I thought it would be a good time to send you this > "manifesto" on my thoughts about RDF and where it should go in the future. > > ============================================= > First, the limitations of RDF, as I see them: > ============================================= > I can understand the notion of using a simple construct to build more > complex constructs. This does make programming easier and it certainly makes > embedding data within XML documents more feasible. But triples - as they are > currently used - don't tell the whole story. > > 1) There is no meta-metadata. > ----------------------------- > In other words, triples encode metadata about other things but there is no > way to encode metadata about the triples themselves. There is no way to > indicate where a triple came from, how well it is trusted, how old is the > reference, how much influence it should have on reasoning software, or > anything else. > > 2) RDF is entirely Boolean. > --------------------------- > I can see how an entirely Boolean system would appeal to computer > scientists. However - just as the world is not flat - the world is not > Boolean. The world is full of "somewhat"s, "probably"s, and "kinda-sorta"s. > Under RDF I either foaf:knows you or I do not. There is no way to tell if we > are like blood-brothers or if I just met you at a conference a couple of > times. If one wants to express different levels of "knowledge" - from > acquaintance up through "carnal" - then one has to create an entirely > different predicate for each different level. Sure, it is possible to create > an entire vocabulary expressing a dozen different levels of knowing someone > and then use RDFS or OWL to rank them using some predicate that means "is > stronger than" and then subclass them all under foaf:knows However, if a > reasoner has access to some RDF data which uses this vocabulary but doesn't > have access to the vocabulary definition files themselves, then it will have > no idea that "acquaintance" is similar to "Buddy," differing primarily by > degree. > > 3) RDF is fragile and impermanent. > ---------------------------------- > RDF is based upon IRIs. People in the internet community like to think of > these IRIs as relatively permanent, but they are suffering from a delusion. > As far as I know, every IRI (which uses the http:// URI-Scheme) is dependent > upon the owner of the domain reregistering that domain name on a regular > basis. I am unaware of any means of registering a domain name for > perpetuity. This means that RDF data I create today could be useless as > early as tomorrow. (Not likely, but possible.) In addition, if a domain name > is forfeited and taken over by someone else, that second party could > redefine the vocabulary, completely changing the meaning of legacy RDF data. > I am primarily concerned with scientific and educational information, > therefore I am thinking in terms of hundreds of years. Finally, IRIs are > often at the mercy of web site administrators who may not know that a URL on > their site has been used as an IRI in some RDF data somewhere. So, said > administrator may reorganize a web site and totally destroy years of work in > one afternoon. > > Now, someone could attempt to maintain an archive of all older vocabularies > as well as the IRIs from which they were retrieved and the date-range within > which they were valid. However: A) How would any RDF reasoner know which > date range to use if the stored triples don't have date metadata attached to > them? And B) How can such a system know which IRIs to archive. > > 4) Blank Nodes are too ambiguous yet not fuzzy enough. > ------------------------------------------------------ > Within one document or file it is possible to know with certainty that two > blank nodes are the same node. However, across documents - even if those > documents have been "merged" into one data store - it is impossible to know > for sure if two blank nodes refer to the exact same entity. Especially if > you consider very long term storage of RDF data. Many tutorials on RDF give > the example of using e-mail addresses to "pin down" a blank node. The > reasoning goes: "If two blank nodes from two different sources are > associated with the same e-mail address then it can be assumed the blank > nodes refer to the same entity." However, it is entirely possible for one > person to give up an e-mail address and then - after what seems like a > reasonable period - that e-mail address could be assigned to a different > person. But a reasonable period for daily use by people is different from a > reasonable period for very long term archival of data. A hundred years from > now, your current e-mail address may have been used by five different > people. > > There does not seem to be any mechanism for indicating - with certainty - > that two different blank nodes from two different original sources both > refer to the exact same entity. All that reasoners can do is conjecture. > And, while I have nothing against conjecture, there is also no means to > indicate the degree of certainty with which said conjectures are stated. > Either two blank nodes are assumed to be the same or they are not associated > at all, which goes back to the Boolean nature of RDF. > > 5) I hate RDF-Linked-Lists too! > ------------------------------- > Like Manu, I am really NOT a fan of RDF-Linked-Lists. And, by extension, not > too very interested in the RDF-Linked-List part of the RDFa Core 1.1 spec at > all. First of all, what's up with all those extra blank nodes? You could > have just left them out entirely. Secondly, as Manu stated, most web > designers don't know from linked lists. The data structure that is used is > so complicated, as far as they will be concerned, that they will just assume > that lists are "Too Hard" and ignore them altogether. If there was ever a > part of a spec that was doomed to be largely ignored, this is it. Third, the > attribute chosen in the RDFa Core 1.1 spec to indicate that something should > be in a RDF-Linked-List is inappropriate. By using "inlist" you make it seem > as if that is the only type of list that could ever occur. You are closing > the door to simple ordered lists, etcetera. > > One should never choose a generic term to refer to a very specific entity. > What, then, does one do when yet another entity that also falls under that > generic term comes into use. Do you then use a more specific term for that > entity? That is guaranteed to lead to confusion. Unfortunately, > RDF-Linked-Lists are now part of the RDFa Candidate Recommendation and it > would be specification suicide to try to take it out now. However, the > situation can still be salvaged by simply changing the name of the attribute > used to something more specific like "rdf-linked-list" that conveys what is > really going to be done with that data. Then, future specifications can use > "orderedlist" (and "unorderedlist") to mean a simple list that is ordered > based on the order of appearance within the document (or not, respectively). > These are terms with which regular web developers are intimately familiar. > In fact, it may even be possible to figure out a way so that web designers > could incorporate an "orderedlist" or "unorderedlist" directly into an <ol> > or <ul> tag structure in an HTML document. > > > ================================================== > Next, the limitations of - or issues I have with - the current proposed > surface, layers, named graph model being discussed. > ================================================== > I have mentioned how it seems as if RDF is like a three-cornered box. The > current discussion makes me think that the RDF community has gotten so used > to that box that they can no longer think outside of it. The changes being > discussed are radical changes, to be sure, but it has the feel of merely > expanding the box from the inside, using what cardboard you have laying > around. I may have missed it, but I have not seen any discussion about where > you really want to go with these changes; what people will do with them; or > how these changes will allow for creating more accurate models of the real > world. Before you go any further, it seems to me that you all need to sit > down and map out a long term vision for what you want people to be able to > encode in RDF data as well as a pragmatic look at what will and won't be > possible. > > > A) N-Quads is only a single step in the right direction. > -------------------------------------------------------- > N-Quads allow one to assign a "layer" (or "level" or "tag" or "name" or > whatever you want to call it) to a triple. However, what if a triple needs > to be in more than one layer? Do you repeat the triple? How many times? Now, > I know that this repetition can be normalized in a data store but a > serialized version of this data structure would be excessively verbose. It > seems it would be better to allow any number of additional "layer IRIs" to > be listed along with a triple. Let the author of the document decide which > is more efficient: repeating triples or repeating additional "layer IRIs." > > B) N-Quads seem to be telling the users of the data > what they can do with it. > --------------------------------------------------- > If a triple has a particular "layer IRI" associated with it this seems to > imply that it must go on that "layer" in a data model. I think it would be > best to merely call these additional values "tags" and let users of the data > do with them whatever they choose. Leave it up to developers of RDF analysis > software to invent new metaphors like layers and surfaces and such. They can > take the information in the triples along with the information in the > associated tags and filter, sort, display, or otherwise rejigger it any way > they please. By pre-imposing a metaphor, and then choosing your terminology > and writing your specs around that metaphor, you are locking in peoples' > thinking about what they can do or how they can re-envision that data. > > C) Metaphors are nice, but don't lock people in. > ------------------------------------------------ > Discussed above. I just wanted to reiterate this point. Choose terminology > that is as UNlimiting as possible, rather than picking a metaphor, and > choosing terminology to match that metaphor. Just provide means to express > as much information as possible and let people decide what they want to > express and what they don't. Leave it to user documentation writers like me > to explain to web authors or data managers all the different ways they can > make use of the information that may or may not be stored in RDF data (as > well as encourage them to be creative and invent new ways to envision that > same information). Specifications should only tell people how to encode the > information, not tell them what to do with it or how to think about it. > > > ============ > My proposal: > ============ > Fortunately, it is possible to solve most of these issues with relatively > simple changes. > > Terminology: > ------------ > Call the extra values, which you are adding with N-Quads, "tags" and leave > it at that. > > Don't try to apply any metaphor at all. Let software developers and users do > that. > > N-Tuples: > --------- > Basically just allow any number of additional tags instead of only one. > > If two identical triples exist ANYWHERE but with different tags, treat them > as one triple with a union of all the tags. > > Permanent IRIs: > --------------- > Create some new official URI-Schemes[1]. Under those create a new "domain" > system, organized based on the purpose of the domain, rather than who paid > money to register a name. Create a legal structure that allows you to > register sub-domains in perpetuity. > > For instance: For my DEMML project[2], I intend to register an official > URI-Scheme called "demml://" under which I will organize a vast tree of > topic codes for every subject that may need to be learned about by anyone. > (Yes, it will be a huge tree but I have figured out how to encode around > 10^31 topics with only 40 characters, including slashes[3]. When someone > attempts to dereference a demml:// URI, plugins or other software on their > system will first look in a specified folder on their hard drive, then look > at a near-by server, then a more distant mirror, etc., automatically > following links made available on each of those machines to find the next > machine up in the chain where the sought-after material may exist. Those > topic codes could also be used as IRIs to specify the topic of a web page or > something. However, the IRI may not be directly dereferenceable. Instead, it > may only be able to be dereferenced via the aforementioned mechanism and > possibly only asynchronously[4]. While at the same time, that IRI may be > used to mark millions of other documents which could be found by searching > for that unique string of characters. This unique string of characters will > be valid for all of perpetuity because I didn't have to register a domain > through ICANN. > > Reduce dependence on DNS system: > -------------------------------- > At any point the DNS system can be co-opted by governments or hackers. IRIs > could be redirected or made entirely unavailable. However, very rarely is > the entire internet blocked from availability. By redundantly posting RDF > data and vocabularies on various different servers it makes it much more > difficult for malicious agents to block access to this information. It also > simply protects information from being destroyed by catastrophic disasters. > Then, instead of relying on the DNS to locate the ONE copy of that data > stored on ONE vulnerable server, we can rely on search engines to index this > information for us. I know, it sounds radical and even more fragile but > think about it. There can be an almost infinite number of search engines > available to choose from. It would be nearly impossible to block them all. > Sure, any one search engine may filter out some content, but that search > engine would not be very popular. Or it may filter content in just the way > you want, something the DNS doesn't do. At first people would have to browse > to a specific web site to search in a specific search engine. However, > software would soon be written to automatically search via a selection of > preferred search engines so that the user of RDF analysis software would > never notice that the file had not necessarily been directly dereferenced. > Besides, a lot of IRIs used in RDF are never meant to be directly > dereferenced in normal use anyway. They are, in fact, merely labels. The > only difference here, is that - if that IRI needs to be dereferenced - > software can do so even if the original server is blocked or down. > > The best part is that all that is really required is to change one's > conception of what an IRI is used for. Simply use IRIs like labels to be > searched for rather than addresses representing locations (real or > imaginary). You can still use the same formatting. And there is nothing to > prevent one from storing an original copy at the location specified by the > IRI. It is just that you don't depend upon that IRI to be dereferencable at > all. Instead, you expect to be able to find many instances of it used within > metadata on a page, spread all over the internet. Additional metadata will > indicate whether a found instance of an IRI is a label for an exact copy of > the original content or is a subset (like a quote) or simply a reference > back to that original content. > > This search-engine-based dereferencing system can be combined with the above > notion of "permanent IRIs via newly-created URI-Schemes" by the creation of > a special URI-Scheme with the following simple rule: First try to > dereference via HTTP or HTTPS but if that fails, then dereference via search > engines. Leave it up to software developers to determine various innovative > means to optimize this searching process. They could cache previous results > on the desktop, or use a master list stored on some server. Software users > will vote on the best methods with their dollars and their feet (or their > clicks). > > Give RDF more granularity while making it all warm and fuzzy: > ------------------------------------------------------------- > My primary beef with RDF is the Boolean nature mentioned above. There is no > way to create a weighted graph. This can be solved by making yet another > relatively simple conceptual change. Currently, the entire string of an IRI > is considered when comparing two IRIs for equivalence. (At least according > to answers to my question on Stack Overflow[5].) All that is necessary is > to, instead, say that only the path and fragment parts of an IRI are > considered the official resource identifier for RDF purposes. Any query > string or CGI data is treated as additional metadata about that IRI. This > allows a weight factor to be assigned to a specific edge in a graph simply > by adding a key-value pair to the IRI of the predicate. Now, instead of an > all-or-nothing "foaf:knows" I can say I kind-of know you by using > "foaf:knows?weight=.5". One might list a good friend using > "foaf:knows?weight=.9" and perhaps claim "foaf:knows?weight=1.1" for one's > wife (though she might say it is closer to "foaf:knows?weight=.4"). > > Additional key-value pairs could be added to indicate additional metadata. > One could indicate the original source of an IRI; The dc:creator of a > triple; or the date that IRI was born on. Some metadata would be more > appropriate for use in the subject-IRI and other metadata would be best in > the predicate or object IRIs. Only certain keys would be used for IRI > matching purposes. For instance, two subject IRIs with identical paths and > fragments but different "born on" dates might be considered to be two > distinct nodes. However, two triples with identical paths & fragments for > each of the subject, predicate, and object but with different "triple > created on" and "source" values listed in the predicate might be considered > to be identical triples that just happened to say the same thing in > different places, on different days. Kind of like retweeting. Naturally, two > predicates with different weight values would be considered the same but > different. So, if two identical triples have different weight values then > some kind of calculation could be done to determine the weight to use for > reasoning purposes. Different reasoners could use different calculations, as > they see fit. > > Some may say that this change could break some IRIs currently in use. But > does anyone really use the query string in their IRIs? Considering that all > that would currently achieve is creating an entirely different IRI, I cannot > see anyone choosing that method of creating different IRIs over the method > of simply using fragments. Therefore, I can't see that making this change > would break more than a handful of IRIs. > > Give blank nodes identifiers and make them fuzzy too: > ----------------------------------------------------- > This seems counterintuitive. A blank node is supposed to be blank, right? > But how do you tell the difference between one blank node and another? Well, > within parsing and reasoning software, blank nodes are assigned identifiers. > This is similar to the index number that is added to a row in a database > table but is rarely seen by users of the database. Rather than saying that > internal identifiers should be ignored when transmitting blank nodes from > one system to another, provide a means of labeling those blank nodes with a > globally unique serial number. This can be achieved by, again, using the > query string to add metadata to the existing blank node IRI. Instead of just > '_:reference' (where "reference" is usually a short string that is unique > within a particular document), we could allow '_:reference?query="string'. > Using this query string, one can add metadata to a blank node indicating > date and time of creation, a global serial number, etcetera. Then reasoners > can use this additional metadata along with the information in the triples > that contain the blank nodes and calculate a probability that those two > blank nodes are, in fact, the exact same node. This probability can then be > expressed by a weight factor in the query string of the predicate connecting > the two blank nodes. > > Now, web authors should not be required to devise and insert a globally > unique serial number into the metadata of each and every one of their blank > nodes. This information could instead be derived from the unique IRI of the > document and the short name given within the document. So a processor, when > parsing the RDF data embedded within a document would simply convert > something like '_:john' to > '_:john?source="http://example.com/aboutjohn.html"&dateretrieved="2012-03-14 > ^^xsd:date"' (with appropriate escaping and conversion to a format > compatible with query strings). Implicitly created blank nodes (created > through chaining, etc.) could simply be given a document-unique serial > number to use along with the IRI of the document as above. However, I have > yet to derive an algorithm that could ensure that the same blank nodes > within a document get assigned the same serial number upon different > parsings, even if the document has been edited between parsings. So that > will require some more thought. Perhaps web authors could assign IDs to > blank nodes through the use of an additional predicate and literal object > each time they intentionally use chaining. I know, this would be a pain, but > software tools could make this easier. > > > =========== > Conclusion: > =========== > >From its conception RDF was both genius and incredibly limited. By making a > few conceptual changes, throwing in some outside help in the form of special > URI-Schemes, and opening up to the possibility that the DNS is not > infallible, it is possible to grow RDF into a data standard that truly > models the real world instead of some tri-cornered, cookie-cutter version of > it. No one should be required to use any of these additional features. > However, allowing the use of this additional metadata, embedded within the > query string of an IRI, would give RDF reasoners a vast amount of additional > information with which to, well, reason. Yes, this will then require the > definition of certain standard query-string field-names and the approved > datatypes that can be used with them. But just enough to avoid chaos. Allow > web designers and experimenters to come up with their own field names as > they see fit (with the warning that any one of those could be co-opted and > put into a standard some day) and then sit back and see what happens. As has > been mentioned many times, the best innovation sometimes occurs outside of > the working groups. Give people the incredible flexibility that these > changes will provide, and I believe a whole new world of > data-interoperability will open up. > > > [1] http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes > [2] www.demml.org > [3] http://www.demml.org/standard/classification/ > [4] http://www.ideationizing.com/2009/07/intelligent-epidemic-routing.html > [5] > http://stackoverflow.com/questions/9171416/is-a-query-string-allowed-in-a-ur > i-used-in-rdf > > -- http://dannyayers.com http://webbeep.it - text to tones and back again
Received on Friday, 11 May 2012 12:33:54 UTC