Re: Introducing myself - SOA organised with RDF from Frank Carvalho on 2007-08-22 (semantic-web@w3.org from August 2007)

From: Frank Carvalho <dko4342@vip.cybercity.dk>
Date: Wed, 22 Aug 2007 15:22:37 -0700 (PDT)
To: semantic-web@w3.org
Message-ID: <12283991.post@talk.nabble.com>
Hi, and thanks already for some very good and relevant answers.

Richard Newman wrote:

>   I would very much suggest using a dedicated RDF store (any one  
>would do), rather than storing the XML serialization of the RDF graph  
>in an XML database. You will gain the ability to run queries against  
>the graph, rather than just one of its possible tree serializations,  
>and your scalability problem goes away (for a while, at least).

Well, I don't really understand if there is any theoretical difference
between querying the XML serialization and the graph itself, if the
serialization is in fact a representation of the graph. What do you mean
when you say "tree serialization", BTW? The only serialization I work with
is a large set of triples. 
I do reckon though that a dedicated store of course may be a lot more
efficient than a general purpose XML database.

>   cwm is not really designed for large-scale storage. 

No, I kind of suspected that from it's behaviour. It's really a shame. It
was descibed as a sort of RDF swiss army knife, and on small graphs it seems
to be able to merge graphs nicely. But when I started to load large graphs,
it came up with odd errors. 

>Take a look at  
>this list of alternative systems on the ESW Wiki:
><http://esw.w3.org/topic/ 
>SemanticWebTools#head-805c63479c854babe4657d5184de605910f6d3e2>
>
>   If you're dealing with large graphs (>100M triples), you might find this
> list useful.
>
><http://esw.w3.org/topic/LargeTripleStores>

Very helpful, thank you. I will take a look at those. Eventually I suspect
we will be using very large graphs. The current ones are perhaps up to 20M,
but given all the tasks we plan on using the graphs for, we are likely to
increase this number significantly.

>   If you need to do reasoning on large graphs, your choices are more  
>limited, and the kind of reasoning you want to use might dictate your  
>solution. (I won't reveal any biases on a public forum :D)

In fact we don't need reasoning so much yet. It is the "resource
description" aspect that currently has the biggest importance for us. We
need to be able to do a lot of forward and backward chaining, but if I am
not mistaken that really is not the same as reasoning. I do expect to assign
some proper Owl interprations to the UML class diagrams - and probably the
contents of the entire modelling tool we're using - some day, but as I don't
really see how I can explain any specific benefits to my organization by
doing so, that idea has a low priority right now. (I drive this project by
visible benefits).

Brian McBride also wrote, and thank you also, Brian, for your inspiring
answers. 

Comments to the comments:

>> Second we are facing a challenge of controlling our
>> suppliers, rather than being controlled by them.
>
>I'm wondering what you mean by control there.  It is well known that if
>a customer invests heavily in implementing systems that depend on the
>characteristics of system components, e.g. using proprietary data
>formats or APIs, then this creates a barrier to changing suppliers.  I
>was expecting you to write that because RDF is based on standards, it
>would be in customer's interests to promote its use to give them the
>flexibility to change supplier.  But that's not what you wrote ...

What I meant was that the cooperation with us as customers and our suppliers
traditionally has been on the terms of the suppliers. Our organization has a
lot of business knowledge, but very little professional IT experience. So
historically the suppliers have had succes convincing the organisation to
buy suboptimal solutions at a too high price. Our department is there to
change that, and to professionalize us as customers. "Control" was perhaps a
bad word. "In charge" would have been better. 
Technically what we do is to establish well-defined webservicees between the
many systems we have, and our SOA infrastructure. We have no intentions of
dictating the internal designs of the systems - some of them are old COBOL
systems anyway. In a SOA, the systems are characterized entirely by their
interfaces  - as a black box. So we only dictate the interfaces and leave
the internal system design to the suppliers (roughly speaking). 
We rather want to collect the documentation of those (heterogenous)
solutions, connect the documentation to the main graphs (by generating more
bits of RDF from the documentation), and thus enable impact analysis into
the system. The purpose is, of course, to be able to assert the extent and
cost of changes, by analysing the amount of derived change it may require.

>Ah right.  I think there are number of existing solutions that do this -
>though not using RDF - e.g. IBM's metadata server.  Have you looked at
>that.  Is there something missing from that solution that RDF would
>address?

We are frequently being contacted by vendors of metadata systems, and I am
not surprised that IBM also has such a product. We are using Telelogic
System Architect here. Also, our experiences with suppliers of metadata
repositories have been very bad so far.

However, my main concern is to avoid vendor lock-in and proprietary internal
formats. To do so I believe it is paramount to use open portable standards
to carry the meta-information. This is where RDF comes in. RDF is easier to
migrate between platforms, using the same core graphs. And it it will be
much easier to integrate different sources of metadata, without proprietary
point-to-point system integrations. Currenly we have a big issue here about
carrying information between Systinet Information Manager and Telelogic
System Architect. A direct integration may prove costly. But if both tools
had an import/export facility for RDF, they could at least add useful
information to the same pool of metadata. I am sure I could extract all
essential information from both tools entirely into RDF and make it useful
in other tools. It will not solve all problems as that information still has
to be interpreted to be useful, but even so, if two different tools share
nodes, their graphs will be able to connect, and new information can be
extracted. I think it is a big step in the right direction.

>It is important to bear in mind that its best to think of RDF in terms
>of its abstract syntax, i.e. a graph of nodes, rather than the RDF/XML
>concrete syntax.  

Well, this is also how I think of it. 

>There are a number of systems around that will store
>significant numbers of RDF triples in a relational store.  We do one,
>Jena (http://jena.sourceforge.net) and there are others - sesame,
>mulgari, redland, etc.  I'd strongly suggest you take a look at these,
>or, if you really feel an XML database is the way to go - I'd like to
>understand why.
>
>An issue with using XML is that that same RDF graph can be represented
>many different ways in RDF/XML.  This would make your queries dependent
>on the particular way that an RDF/XML document happened to represent a
>graph - and that's just - well - wrong - you would be programming to an
>inappropriate level of abstraction.

Yes, this is my experience too.
It took me some time to understand the different weird RDF/XML notations I
found at the w3c specification, until I started to see it as "syntactic
sugar", which in turn means that each block of RDF/XML could be broken down
into a number of simple triples. After realising that I started to ignore
the more "user-friendly" syntaxes of the w3c spec, and stick to the simplest
form.
In fact I always reduce the graphs to simple form before I load them into
the database. I looked at cwm mainly to see if it could work as a tool to
break the graphs down into triples. My first attempts to break down compound
expressions into triples with XQuery were not succesful, so currently I'm
doing it externally before I ever load the data into the store.
The RDF I generate myself is always RDF/XML in its simplest form - as
triples - and the way I use the XML database is from the assumption that I
deal with triples exclusively. This makes it much easier to build sensible
database indexes in the XML database, where you index the node ids.
Performance is not spectacular, but is currently at an acceptable enough
level to be useful. 
I use an XML database (eXist) mainly because I have a long history with XML,
XSD, XSL, and now XQuery, so I can use knowledge I already have. Also
because of the portable nature of XML, XSL and XQuery, and the numerous
products empowering XML. (The vendor lock-in issue), and also because I like
how the database integrates with web browsers, and is easy to load and
maintain etc.. 
In any case, as long as the core data are triples, I think that a move from
RDF/XML to a dedicated RDF store can be done at any time, should it be
necessary for performance reasons.

>I'd be very interested in talking with you; I'm happy to share our
>experience with you and am hoping to learn more about your applications
>and requirements to aid in our development efforts.

Well, I also hope we can continue this discussion. I have already gotten
some useful links.

Best

Frank Carvalho
-- 
View this message in context: http://www.nabble.com/Introducing-myself---SOA-organised-with-RDF-tf4263503.html#a12283991
Sent from the w3.org - semantic-web mailing list archive at Nabble.com.
Received on Wednesday, 22 August 2007 22:22:41 UTC