- From: Karsten Otto <otto@math.fu-berlin.de>
- Date: Fri, 12 Jul 2002 16:41:37 +0200 (CEST)
- To: Brian McBride <bwm@hplb.hpl.hp.com>
- cc: www-rdf-comments@w3.org
Hello! Short answer: I vote in favor of Test D. When trying to find information on the Semantic Web, I prefer to get exact results, not some fuzzy almost-match. However, I am also aware of the importance of Test A1. I believe the enforcement of canonical data representation provides a solution here. Special cases like Test A1 will work, and search engines do not have to "know" thousands of possible data types for simple comparison. This may not look pretty in the triple dumps, but then RDF is primarily intended for machine processing. Long answer follows... > Let's explain the basic ideas behind our approach to datatyping. The aim > is to define how datatype values, e.g. integers, dates etc should be > represented in RDF. We are building on the XML Schema datatypes > specification. To answer this question, I want to distinguish between smart applications (agents etc) that understand = have intrinsic knowledge about certain vocabularies, and generic applications (search engines etc) which do not, that is they have to work solely with available triples. I assume a scope of generic applications, specially search engines, for this discussion. I believe they are crucial for the success of the Semantic Web, because they will enable people (and agents for that matter) to find the information they look for. It does not help if there are scattered information pieces which are formally correct according to some style, but cannot be found and used because they do not match the style of the request. > It is important in getting the semantics correct that we distinguish > between a datatype value, e.g. the integer 10 and a lexical representation > of the value, e.g. the string "10". In that light, my answer also depends on the restriction to canonical representation styles for literals, that is, if two values are equal, their string representation must be equal too. >Test A: > > <Jenny> <ageInYears> "10" . > <John> <ageInYears> "10" . > >Should an RDF processor conclude that the value of the ageInYears >properties for Jenny and John are the same? Yes, but only for canonical representation style. In that case the use of the same property means the same range type, whatever it is, which in turn means the same canonical representation style. Thus, as the literals have the same representation, they also denote the same value. No, if we do not have canonical representation style. In that case, there is insufficient data. We cannot determine the type or actual value of the literal. Remember that <ageInYears> looks like <vhfgdj> to a generic application, and its values might as well be encoded in binary for example. >Test A2: > > <Jenny> <ageInYears> "10" . > <Jenny> <testScore> "10" . > >Should an RDF processor conclude that the value of Jenny's ageInYears >property is the same as the value of Jenny's testScore property? No, in any case. Again we have insufficient data. <ageInYears> and <testScore> could very well have completely different range types. We cannot determine if the given literals have the same canonical representation style, and thus cannot compare them. Note that we could do that if we knew the range types. If both were for example known to be decimals, we could compare them and find them to be equal. >Test A3: > > <Jenny> <ageInYears> "10" . > <Film> <title> "10" . > >Should an RDF processor conclude that the value of Jenny's age property is >the same as the value of the Film's title property? If the value the ><ageInYears> property is an integer, and the value of the <title> property >is a string, they are not the same thing and are thus not equal. No, again we have insufficient data. Note that a smart application could do this however. Assume it knows the meaning and type of <ageInYears> and <title>. Then it could "typecast" the <title> value to an <ageInYears> value (decimals in this case) and see if they match. >Now for a different kind of test. How do the values of the two idioms relate? > >Test D: > > <Jenny> <ageInYears> "10" . > <ageInYears> rdfs:range xsd:decimal . > > <John> <ageInYears> _:a . > _:a xsdr:decimal "10" . > >Should an RDF processor conclude that Jenny and John have the same >age? [Note: in this example the range constraint is expressed using >rdfs:range. We may have to introduce a special datatyping range property, >but that is an independent detail for now.] > Yes, we know the literals have the same type. Besides, if we have two ways of expressing type, of course they should be treated as equivalent. If we have canonical representation styles, we can compare the literal strings and conclude the value is the same. Otherwise, we have to know the given data type xsdr:decimal, that is, we must know how to parse its various representations, and compare the resulting value as appropriate for this type. This is easy for simple types, but just think of some xsdr:date... Conclusion: I vote for strong data typing, while enforcing the use of canonical representation styles. That way, search engines can provide exact and type-safe query results. At the same time, they can safely compare values by their string representation, without having to really understand a given type. If we have an extensible type system, this ability is crucial for handling a potentially very large number of obscure and complex data types. Cheers, Karsten
Received on Friday, 12 July 2002 10:41:41 UTC