- From: Seaborne, Andy <Andy_Seaborne@HPLB.HPL.HP.COM>
- Date: Fri, 9 May 2003 15:30:13 +0100
- To: "Butler, Mark" <Mark_Butler@HPLB.HPL.HP.COM>, "'SIMILE public list'" <www-rdf-dspace@w3.org>
Mark, One of the things that any web system has to cope with is the fact that the web (semantic or otherwise) is sufficiently large that it doesn't all work at once. Any system that depends on others also has to take into account that there will be temporary problems. So the ground rules are: ontologies may not be perfect; connectivity is not guaranteed; systems do the best they can in the circumstances. There is value in working with information from other systems but it has implications. Since into N3: <> rdfs:seeAlso <http://lists.w3.org/Archives/Public/www-rdf-interest/2003May/0026.html> In RDF, lack of a schema for a namespace does not stop the system doing anything with it. Example: @prefix x: <http://example.org/WebLibraryTerms#> . <http://dspace.org/> rdf:type x:LibrarySite ; x:administrator "Mick" . and the WebLibraryTerms namespace isn't one that a system knows about. What does it do? It can decide to read from the namespace URL; it can choose not to. While good style says that the schema should be available at the namespace place indicated, it may not be. So when asked the query 'what are the websites run by "Mick"' the system will return nothing. It does not know, if it has not read in the schema: x:LibrarySite rdfs:subClassOf knownSchema:WebSite . unless it reads in the schema. The answer to queries on the semantic aren't "yes" and "no", they are "yes" and "don't know". There is no global processing model or global consistency. There are local decisions on what to do about things. Maybe some community using SIMILE does know something about WebLibraryTerms: it can ask for everything known about http://dspace.org/ and the server can ship it useful stuff. The fact the server does not fully understand all the implications of the data isn't important. Later, the community can ask the SIMILE system to install WebLibraryTerms so they can do their searches on the server side if it does not automatically read it or log the fact that an unknown namespace was used a lot and the admins have already decided to get it. [[If I visit a website today, and get a webpage and the image logo for "graded good by SomeOrg-I-trust" does not display properly, then I don't know that fact. The page is still useful even though I don't know it is graded good.]] So a key question: how often do new schemas change? how often do unknown schemas turn up? If a new schema arises and is important to some community of SIMILE uses, they ask the system to use that schema. It may not happen immediately and it may involve a person doing some configuration, but it does deliver useful value to people in the presence of a less than perfect global semantic web. Specifically for the history store: I would expect it to have a site cache of schema/ontologies, indexed by namespace. Boringly practical - if some schema are deemed "bad" (unhelpful) at a site, they just don't use it. A cache is prudent because even if the namespace does reside at its namespace URL, using HTTP, it may be unreachable just at the moment it is needed. Schemas are slow changing things so using a cached copy - even one fixed if the master copy is trivially broken (a local decision again). The use of the cache is a local choice. I would like it to read in any namespaces it encounters but it isn't necessary that it do so. On the query side, there will be a few (2? 3?) key schema that the history has to deal with and is tested against these schemas. On the data input side, the validation process can fetch new schemas as encountered. We are not a fully magic world. If a new schema is encountered, say it has a property foo:articleTitle that is equivalent to dc:title, then until the system uses a rule that these are equivalent it treats them as different. Is foo:articleTitle really, truly, exactly equivalent to dc:title? It depends. It depends on who is asking, it depends on what they want the information for. A good system admits these alternatives and does the best it can in the circumstances. In many ways, I don't see this as specific to the semantic web. It is a consequence of being a large federation, not a centrally managed system. Because SIMILE has an ingest and validation process, I hope that is seen by other systems as a high quality source of information. If it is perceived as such it will get used; if it is not seen as such, it will not get used. Andy -----Original Message----- From: Butler, Mark [mailto:Mark_Butler@hplb.hpl.hp.com] Sent: 7 May 2003 17:22 To: 'SIMILE public list' Subject: Use of www-rdf-dspace for comments re: early draft note, DSpace H istory System These comments are much more general than the other comments, so apologies for this in advance. I'm sure some of the following points are controversial but hopefully they will create further discussion. One of the promises of the semantic web is that "if person A writes his data in one way, and person B writes her data in another way, as long as they have both used semantic web tools, then we can leverage those tools to merge data from A and data from B declaratively i.e. without having to rewrite the software used by A or by B, and without necessitating them to change their individual data sets". Before the semantic web, we could have used data A with data B, but it would have necessitated some changes to the data and software of one or both of the parties. However in this proposal, it seems that rather than exploring the first path i.e."we have a load of data in the history system format. This was similiar to Harmony and Dublin Core, but since then those technologies have moved on. Let's see if we can map between these different data formats by using schema and ontology languages without changing any code" it seems like we are taking the second by default i.e. "we have a load of data in the history system format but its incompatible with the latest versions of ABC and Dublin Core. Let's rewrite the software that generates it so its complies with their latest specifications". Now the problem with adopting this second approach is we aren't really demonstrating the utility of the semantic web. Now the history system may be sufficiently broken that it's just not possible to use the first approach. Alternatively the SW tools available may not yet be sufficiently advanced to support the first approach. However ideally I think we ought to at least explore the alternatives that try to follow approach one, assuming this is possible with time constraints. So ideally I would like to see the descriptive note discuss more alternative solutions and then evaluate those solutions. At the moment it just describes a single solution. The outcome of the document may still be the same, i.e. the approach we use to solving the problem, but I think there is a bit more thinking or how we arrived at this point could be made explicit. I would like to illustrate this using by concentrating on is the use of namespaces in the current DSpace history system. If I understand the document correctly, one of the criticisms made about the current DSpace History system is that is uses eight different namespaces to refer to what are effectively different classes. There are a number of reaons why is undesirable: - the classes all belong to the same conceptualization, or to use the jargon "maintain the same ontological commitment". Therefore common practice is to use a common namespace to indicate this. - the document notes that if the history system was to use certain well known schemas, e.g. Dublin Core and ABC, then it is possible that processors might know something about those schemas and be able to process this information. However, a lot of articles that discuss why we need the semantic web describe how the SW will allow things to work together automagically. My guess is the enabling technology for this is automated discovery, by which I mean some mechanism that a processor can use to automatically configure itself so it can process a document or model. So next I will outline several different approaches for automated discovery (or "processing models") that can be applied to RDF, and then consider how they might be used to solve some of the issues outlined in the document. (Schema Discovery via namespace processing model) "The processor gets a piece of RDF, inspects the namespaces and it tries to retrieve the schema from the namespace. If it can retrieve the schema, it processes it and is able to map the DSpace information into another schema that it is familiar with e.g. Dublin Core." Now I have a lot of sympathy with the processing model (PM) above, but in fact it turns out the PM above is quite controversial because namespaces do not indicate schemas. Just because a piece of RDF defines a namespace with a URI that uses HTTP, this doesn't mean HTTP can be used to retrieve an RDF schema that gives you more information about that RDF. This is because - RDF does not formally require this. We could overcome this by formally requiring it for our RDF application (by application I mean a usage of RDF, rather than a piece of software) but how does the generalised processor know we've done this? - if there is nothing at that URI, the only way the processor will determine this is via a timeout which will cause any requests that are invoking the processor to fail also - it is not clear which resource you should have at the HTTP address e.g. an XML Schema, an RDF Schema, an OWL ontology etc. See [2] and [3] for related discussions. I observe there is some disagreement amongst the SW community on this e.g. in conversation with Tim Berners-Lee it seems his implicit assumption is that a namespace should point to a schema whereas I remember Dan Connolly expressing the opinion that RDF must be processable as is, without the schema. Furthermore it seems to me the recent RDFcore datatyping decision that datatypes must be declared explicitly in the RDF instance data, rather than defined in the associated RDF schema, was arrived at from the latter viewpoint. There have also been proposals about how to overcome this e.g. - use URN's if a namespace just indicates identity, whereas use HTTP if it indicates identity and points to additional resources. - the RDF graph could define "processing instructions" that indicate how to process it. CC/PP does this for some things but not all as it provides a method for subgraph inclusion called defaults. Of course, applications defining their own "processing instructions" would not be sufficient, as these processing instructions would need to be standardised in order to support automated discovery. Let's try to concretise this with some other processing models: (Manifest discovery via namespace processing model) "The processor receives a piece of RDF and it inspects the namespace used. The processor also knows what data languages* it supports e.g. RDF Schema, OWL, XForms, XML Schema, XSLT etc. It tries to retrieve information from the HTTP address indicated by the namespace, performing a content negotiation so that it retrieves all the resources that it can process. This solves the problem of needing to know what type resource should be at the namespace URI as the processor retrieves any that are useful to it. The processor then uses these resources to try to help process the RDF." (* data languages probably isn't the best term here. This is similiar to Rick Jeliffe's proposal in [3]) (An aside: This processing model is probably a bit controversial as it admits XML based languages to the SW stack, and the SW folks often argue that we need to replace all the XML in the world with RDF. I disagree with this, especially as currently we have a bunch of tools that are useful that use XML and a bunch of tools that use RDF. Re-engineering all the XML tools to be written in RDF will take years, so let's see if we can tweak them so they work together.) (Schema discovery via processing instruction processing model) "the processor receives a piece of RDF, and it inspects the RDF model for RDF statements using the swpm (semantic web processing model namespace). These statements give processing instructions about how to the process the model. The processor follows these instructions e.g. retrieves the relevant schemas. It then uses this information to process the RDF as outlined in the previous PMs". (Two asides: first as is probably becoming obvious now, there is great potential for these processing models to be inter-mixed. Second we can use the processing instruction processing model and manifests to leverage XML for the SW, as if we just add processing instructions to XML we can keep our data in XML, but the processing instruction points at a manifest that includes an XSLT stylesheet that converts the XML to RDF/XML, so the data is now SW compatible. Via manifests, it can also retrieve a large bunch of other resources.) (Schema discovery via namespace with transport dependence processing model) "when the processor receives a piece of RDF, it inspects the namespace used. If the namespace starts with HTTP, this indicates a resource is retrievable from that address. If it starts with another transport, e.g. URN, then it regards the namespace as simply defining identity. In the event of a retrievable resource it retrieves it and uses it to process as necessary" Other reasons why we need automated discovery One of Tim Berners-Lee's dictums for good design on the web is "good URIs don't change". This causes problems. Let's say I create a schema but its wrong e.g. its not compliant with a "clarification" in RDF. However I've published it, so I can't fix it because I can't change the contents of the URI. So what are my options? I can republish all my data and schema so it is correct using a new namespace. Alternatively I can just say to the people with the RDF processors "it's your problem, you deal with it". Consider another problem: lets say a new format comes along, e.g. which ends up dominating the user base. I may want to add information to my schema that explain how to map my data to that format. However I can't get at the schema of the new format (because I don't own it), and I can't change the contents of the URI to change my schema. In light of these issues, perhaps this advice is right? However there is a whole host of issues here that is probably beyond the scope of this document. The point is say we fix the system as we outline, but then a new version of Dublin Core or ABC is released. Do we have to recode the history system again? At the moment, yes because we can't add additional data to the schema once we've created it. Due to the "good URIs don't change" advice, it's now cast in stone. This is why we need to consider the other processing models. Another alternative is to use dereferencing URIs as PURL does: (Schema discovery via dereferenced namespace) The processor recieves a piece of RDF, and inspects the namespace used. It queries this namespace with an intermediate server that stores the dereferences. The server could be identified via the namespace e.g. as in PURL or some other approach could be used. Thr dereference points to a particular schema, optionally on another server. This server could contain several dated versions of the schema, but the derefence just points to the most up to date one. Then if we want to update the schema so it has additional information that maps it onto a newly released version of Dublin Core, we can do so because the contents of URIs never change, but the contents of the dereferenced URIs do. Or to put it another way, I think TBLs dictum is to draconian: we may have URIs on the web that change and those that don't, we just need an explicit way of distinguishing between those two types of URIs. OWL's processing model In OWL, the processor loads an OWL ontology that can use includes to load other OWL ontologies and it then has data about those ontologies. But that's it, there's no way to automatically load ontologies on demand, it has to be explicitly configured. Now I may be wrong here as I'm not an expert on OWL, but my guess is this design decision is deliberate because you can't just combine ontologies arbitarily, you need to do consistency checks first. Typically this is done at ontology creation time (see OilEd) as there is a large processing overhead associated with this. Now of course in RDF you don't need to do these consistency checks prior to combination because the model theory avoids inconsistencies. Of course OWL may change in the future, but this is another processing model. In fact, it's the model I use in DELI, because we found that most publishing RDF schemas just got them totally wrong, and the people producing instance data just seemed to make up namespaces as they went along, so instead we loaded all the information we needed up front, and also defined some equivalences so we could deal with the most commonly encountered mistakes in the instance data e.g. (start-up schema load processing model) "The processor loads a set of schemas at start-up time. When it receives RDF, it makes a best attempt to process it. If it recognises it via the startup schemas, it processes it. If not, it tries to process it but at the end of the day if the schema is not recognised responsiblity passes to the application sitting on the processor. However it is fairly easy to reconfigure the processor to deal with new schemas, it's just a matter of changing some kind of configuration script. This allows whoever is configuring the processor to do some kind of "quality control" on the schemas." Okay, so I've proposed a lot of ideas here. So how does this map back onto the history document? Well we can solve the "usage of external schemas", "duplicate properties", "usage of outdated harmony properties" in a number of ways: i) we modify the code to change the namespace to the official DC and ABC namespaces and to use the updated harmony properties i.e. the approach proposed in the document. ii) add a processing instruction to the RDF generated by the history system. Of course the processing instructions need to be standarised, but that's a side-issue. This processing instruction points at a piece of RDFS or OWL that resolves the three issues above. Let's call this the "update schema". iii) the processor could look up any of the namespaces used in a "schema namespace server". This server would know that these namespaces are defined in the "update schema", so it returns that to the processor. iv) the processor uses start-up schema loading, so we just make the "update schema" available and it is then the responsibility of the person configuring the processor to add that schema to the start-up configuration. So the history system document has decided to go with approach i). I think with approaches ii), iii), and iv) there are two questions we can ask: a) is RDFS or OWL sufficiently rich so that we can solve the "usage of external schemas", "duplicate properties" and "usage of outdated harmony properties" issues? My guess is OWL can probably do the first two, although with RDFS it is harder as RDFS cannot define equivalences only subclasses and subproperties. Arguably these are not the same, as they are not reflexive. I'm not so sure about what the outdated harmony properties involve though, so I can't make a call on whether this can be solved with OWL or not. b) assuming we can map between the data formats declaratively, what are the pros and cons of approaches i), ii), iii) and iv)? As a result of this, which is the best approach? (I guess this is a general question for the RDF community). However, this leaves us with seven other issues (lack of type information, empty or missing properties, expressions of qualified properties, relationships expressed using local identifiers, usage of local URIs, formatted text in property values, and references to non-existant states) that it is not possible to solve this way, but this is okay as these issues seem to be more along the lines of "things that are broken" rather than "things that have changed, that we ought to be able to fix with the SW tools". [2] http://www.intertwingly.net/stories/2002/09/09/gentleIntroductionToNamespace s.html [3] http://www.xml.com/pub/a/2001/01/10/rddl.html Dr Mark H. Butler Research Scientist HP Labs Bristol mark-h_butler@hp.com Internet: http://www-uk.hpl.hp.com/people/marbut/
Received on Friday, 9 May 2003 10:38:24 UTC