- From: Frank Manola <fmanola@acm.org>
- Date: Wed, 22 Feb 2006 10:37:29 -0500
- To: Misha Wolf <Misha.Wolf@reuters.com>
- CC: Semantic Web <semantic-web@w3.org>, newsml-2@yahoogroups.com
Misha Wolf wrote: > Hi Frank, > > Thanks for your long and thoughtful mail of 18 Jan 2006 and > apologies for the delay in my reply, caused by the usual syndrome of > too much to do and not enough hours in the day (or night). > Yesterday I heard on the radio about some new pills which cut down > the amount of sleep one requires. Maybe I need to get a pack ... Mischa-- Thanks yourself. And I'm sorry not to have responded sooner myself. As usual, other stuff intervened, and I needed to do some additional research anyway. I don't think the pills you mention would be a good idea in any case: if you used them, you probably would then need to take other pills to go to sleep! Your message clarified (I think!) some things for me about the issues you raised. First, some general comments. It seems to me what you're looking for is basically some "schema design" or data organization discussion here, i.e., how to handle application-related situations X and Y. The SWBP group is doing some of that, but their stuff is pretty generic. I agree that this would be one forum where such discussions could take place. But the details of the specific problems need to be laid out. You note that no one but the news industry has apparently embraced the Semantic Web to this extent. Unfortunately, the result of pioneering is sometimes that you run into uncharted territory, even for dealing with what appear to be simple and generic things. I think the situation is probably that different people have worked out ways to do some of these things. What may well be the case, though, is that there aren't fully-worked-out "best practices" or general agreements (let alone "standards") about how to do some of the things you need to do. That means there will be many opinions and alternatives. That may ultimately be the case anyway: we're essentially talking "information engineering" here, and what is a good solution given one set of constraints may be less than ideal in a different situation. I have to say this seems natural. The situation with respect to the Semantic Web can be compared, it seems to me, with the situation that used to exist with respect to databases. The RDF and OWL specs are more-or-less comparable to the definition of the relational model for databases. But the definition of the relational model itself didn't provide much guidance as to how to design relational databases to handle specific data modeling requirements. A lot of additional work, some of it generic (e.g., working out the various normal forms), and some of it specific to certain applications, was necessary to work those additional details out. But even now (unfortunately in my opinion), it's hard to find repositories of good database design practice for various situations, except for the most obvious ones. The same situation is true for the Semantic Web (with the Web hopefully making it easier to share design experience than was true decades ago). While it may be unfortunate from the point of view of someone like yourself who has to get a design task accomplished, I wouldn't have expected all the variety of requirements you're faced with to have been worked out ahead of time (and certainly not by the W3C!). Now to more specific comments: > > I'm responding here to two of your points. > > >>Why don't you explain a little more (in particular, say why you >>think NewsML is the most important application to enter the >>Semantic Web arena), if you think it's that worthwhile as a matter >>of general interest? > > > The News Industry (ie the News Agencies, represented by the > International Press Telecommunications Council) has decided to > develop a standard for B2B News exchange which is designed to be > compatible with the Semantic Web. I know of no other industry whose > entire output will form part of the Semantic Web. Do you? No, and you raise a good point here. However, I don't recall this point having been made this clearly before, which may affect how much attention it got/gets. > > >>First, I've reviewed some of your past email on NewsML. It seems >>to me that, to the extent you explicitly asked for help, you got >>it (I recall a reification question; were there others?). > > > This is absolutely correct. My concern is that something as > fundamental as how to express: > > - who said that the subject/genre/creator/etc of the story/picture/ > /video/etc is foo? > > - when did they say it? > > - with what confidence did they say it? > > - etc > > seems to still be at the level of "Well, you could do it like this, > or you could do it like that". > > Forgive me for being naive, but I do not comprehend how it will be > possible to formulate successful queries along the lines of: > > Find me all stories about which X says with more than 70% > confidence that they have subject Y. > > if there isn't an agreed way of making these assertions, supported > by an adequate range of off-the-shelf tools. I think I need some additional information (or clarification) here. It seems to me that when you have a specific way of recording information about such things as "regarding story FOO, X says with more than 70% confidence, that it has subject Y", that you can certainly formulate queries on such NewsML information (or RDF derived from it), and you could use, e.g., SPARQL tools to do it. It's as if you'd designed a particular database schema to represent that information, and given that you have such a schema, you can query a database designed in accordance with that schema. But it seems to me that you're asking for more than this, namely that everyone in the Semantic Web should agree on how to represent such assertions, not just that NewsML needs to have a way of doing it. While it would certainly be a Good Thing to have that global agreement, because then everyone could query everyone's assertions about stories and subjects the same way whether they use NewsML or not, I don't think you're going to have that level of agreement "out of the box" so to speak. But surely you can formulate successful queries on NewsML material once you've decided how to represent that kind of information can't you? > > There are also a number of more detailed issues on which we've got no > help at all. I don't recall on which list we aired them, so it may > not have been here. These include: > > - The inability of various RDF-related formats etc to deal with > numeric codes. I don't recall seeing this issue. Is this a reference to the issue described in Section 4.3 in the NewsML Architecture document, e.g., the example from the CURIE document of wanting to use something like iptc:10112244 (your description here doesn't make that entirely clear)? If it is, certainly CURIEs would address this issue, as you note below. However, wouldn't using XML entities also work to help abbreviate the URIs? (This is illustrated in Example 8 in the RDF Primer for use in abbreviating typed literals, and the OWL specs use XML entities for other abbreviations as well). Here, it's necessary to very precisely read your reference to "RDF-related formats", since the problem isn't with the RDF model per se, but rather the use of various notations (RDF/XML in this case) for encoding it. Alternatively, have you looked at PRISM's approach to dealing with controlled vocabularies? > > - The problem of how to reconcile having 20-30 taxonomies in a > document with keeping the document reasonably small. We have > asked about alternative mechanisms for declaring alias/URI > correspondence, but all we have got back is: Use XML Namespaces. > This is despite the fact that we are not declaring namespaces > for elements/attributes etc, and so do not need to be bound by > the contraints specified for those. Again, I don't recall seeing this issue, and I'm not exactly sure what you have in mind here. Can you give an example or point to something in the NewsML specs that illustrates it? It sounds like entities might be a possible approach here as well, though, although you might have reasons for not wanting to use them (and, as I said, I'm not necessarily sure of what you have in mind). > > The RDF-in-XHTML task force is well on the path to specifying > CURIEs, which will address the first of these two concerns, but not > the second. Consequently, we are having to invent our own > declaration mechanism, which is regrettable. I can certainly see how this seems regrettable from your point of view, but after all *someone* has to invent the declaration mechanism don't they? And hopefully they would do so on the basis of detailed application requirements, such as those you are bringing to the table, rather than "out of thin air". I hope and believe that this sort of feedback will help improve the whole Semantic Web infrastructure, so from that point of view this isn't at all regrettable, however unfortunate it may seem to you now. --Frank
Received on Wednesday, 22 February 2006 15:49:39 UTC