- From: Ivan Mikhailov <imikhailov@openlinksw.com>
- Date: Fri, 26 Mar 2010 20:00:14 +0600
- To: "public-rdb2rdf-wg@w3.org" <public-rdb2rdf-wg@w3.org>
Colleagues The below contains some highlights of the conversation of March 23, 2010. I then expand on some of these issues and finish with a description of what ought to be proven and what restrictions will be practically necessary for dealing with the impedance mismatch of SPARQL and SQL. Regards Orri The Issue of Translatability Can an arbitrary SPARQL query be translated into SQL, 1. against a triple table 2. against a relational schema? Souripriya affirms so. Orri seconds, but after the experience of doing both for some years, qualifies the statement with certain specifics. Does such translation and its possibility require formal proof? Souripriya declares that such a proof is trivial. The grounds for this are that since SQL is accepted as expressive enough, the mapping itself consists of column renaming only, and column renaming is so simple that it does not require formal proof. Any schema alignment is expressed as SQL views/derived tables/scalar subqueries, putting all the complexity on the SQL side. We note that as soon as there are expressions in the mapping, these expressions need defined formal semantics. But if the expressions are on the SQL side, then such semantics is not part of the mapping. Eric points out that if one transforms a RDB into XML and then does XSLT on that and then XQuery on the result, it can be hard in practice, even though often possible in theory, to take the XQuery and map it into SQL over the original table. So, by analogy, if the mapping is very oblique, passing through many SQL complexities, it will be hard to implement a SPARQL to SQL translation that can make good SQL. Ahmed says that if all you do is query one database, you may just as well stay with SQL. The mapping will be needed in cases of integration between multiple, semantically similar but structurally different sources. Having this, one needs to again prove the semantics of the mapping. Eric asks Richard if one can write queries in D2RQ that will not map into a single SQL statement. Richard says that this is easy, there will be cases where multiple SQL queries will be made and combined in the D2RQ layer. Orri notes that Virtuoso always makes one SQL for one SPARQL query but that the SQL in question is proprietary extended for cases like variables in the predicate position. Forgetting variables in predicate position, SQL with no of little propriatary extensions will result. Lee asks what the difference between SQL style and RDF style is. Orri says that it is one of syntax flavor. Inside, things will be RDF style but the RDF style can be inferred from a SQL looking outward syntax and that an SQL looking outward syntax is likely to be better accepted by a SQL public. *** The following is commentary by Orri on the above excerpts of conversation. The key issue seems to be whether the mapping reduces to column renaming and identifying columns as subject/predicate/object/graph and indicating whether a string column is considered a string or an IRI. We note that the columns can be expressions and that the where clause may have arbitrary SQL. This is pretty much Virtuoso RDF Views. As a shorthand, we can then make multicolumn selects and pick one column as subject and other columns as values of fixed predicates. This gives us Triplify. We could argue that we would need a proof of the possibility to transform arbitrary SPARQL into SQL via such a mapping. We specially note that many views/select statements may produce the same subject/predicate/object. Thus any SQL generator needs to know when to make union. For performance it will need to know when not to make unions. Dealing with unions and not generating union clauses which are always empty at run time is the principal difficulty of mapping in OpenLink's experience. However, the matter of dealing with a triple pottentially coming from multiple sources is, in the strict sense, outside of the scope of the mapping WG and is an implementation optimization. On the other hand, we could argue that the standard being developed ought to give support for such optimization where possible, or at least ought not to make such optimization harder than it needs to be. We note that the primary use case is integration, i.e. similar looking triples from many sources. It has been pointed out that the WG's agenda is making a mapping language and that this mapping language does not need to be concerned with implementation, specifically not the ways in which SPARQL is trannslated to SQL. This is technically true. We note that in our experience, some hints have proven necessary. These are necessary because there is information which is not expressed in the RDB schemas but is still relevant for optimizing away unions. The mapping author can know that domains are disjoint even though the RDB schemas do not say so, for example. Thus a mechanism for expressing such meta information is needed in practice. Whether it can be agreed upon in the WG is a different matter. Souripriyas drastic simplification is to cast aside such considerations and to say that all semantics will be in SQL and the mapping is rename only, plus saying which column is the subject, which is an IRI and which a scalar. Under such a situation, putting said meta information in the XML (or other) glue remains possible. In triple-centric models said information goes into the vocabulary. If the SQL in the mappings is very complex, then the problem of SPARQL to SQL is similar to reuse of materialized views, for example. One needs to compare queries and determine if one subsumes the other. If the SQL in the mappings is simple, then a conversion into a triple-oriented declaration system like D2RQ or Virtuoso is almost trivial. *** Let us consider what precisely we should prove in order to say something about the expressibility of SPARQL in SQL through a mapping. We will first consider a fairly powerful mapping, i.e. Virrtuoso quad maps. We will then restrict this to make the matter more manageable. With quad maps, a mapping is expressed as a collection of SQL select statements of exactly 4 columns each. The columns are the G, S, P and O of an RDF quad (triple plus graph). If there is a row in any of the select statements constituting the mapping, the quad is considered to exist and is thus retrievable by any SPARQL query that would find it if the quad physically existed in a quad store. The columns are SQL scalar expressions. We declare whether the O is a literal or IRI. Of each IRI, we can specify an IRI class. Of the IRI classes we may assert subclass and disjointness. The where and from of the selects is de facto unrestricted since the quad map may refer to a view. Virtuoso expands the views, hence if the tables are in another DBMS, the other DBMS can be read-only. We note that a Triplify or Oracle proposal style syntax where a constant predicate is assigned to columns of a multi-column select with an AS or similar is trivially transformable into the above quad maps. This is why we do not particularly worry about syntax. We are now to prove that any SPARQL 1.1 query may be transformed into SQL via such maps. We encounter the following issues: We query select ?o from <people> where { <person1> foaf:name ?o } Suppose that multiple quad maps match the <people> graph and the foaf:name predicate. Suppose the expression in one of them is a NVARCHAR and a VARCHAR in another. The single result column has two data types. Logic needs to exist to do the right cast when gennerating the SQL or the SQL engine evaluating the union must be typed at run time. The latter is not often the case, even though Virtuoso and probably MS SQL Server could do it. Others could do it through use of UDT's, which would complicate things and take space and time at run time. UDT's are also not very standardized, even though most DBMS's have such. We query select ?p ?o from <people> where { <person1> ?p ?o} Again, multiple quad mapts match and we get a union. The union has single column selects, one per mapped column of each table. The type is again arbitrary, requiring typing at run time. Since the same row is accessed for each column, the optimizer can merge these into a single multi-column select but then the single row that is fetched from each table containing people will have to be presented as multiple rows in a result set. Again, this is not impossible through the use of a table valued function. Virtuoso has a special SQL extension for this but could do it also with table valued functions. But table valued functions are a vendor specific thing, thus we are outside of core SQL. We suppose a gender column in two person tables. One has M and F and the other 0 and 1. Both map to an IRI representing gender. This is mapped into an IRI by a user defined function in each case. We query select count (*) from <people> where {?p a <person> . ?p <gender> <male> } The SQL becomes a count of a union and the unions select gender where UDF (gender) = constant. In the general case, there should be a function index or the SQL implementation or the SPARQL to SQL mapping should know about the inverse function of the gender UDF. A function index on a low cardinality column like gender is not very practical because one would do a table scan rather than an index lookup with such cardinality. So for efficiency the inverse should be declarable and supported. We could further reasonably expect a SQL implementation to transform a count of a union into a sum of union term counts. Given our multiple tables with people, let us join to department and then group by the head of department, counting the people working under each department head. We suppose there are multiple EMP and DEPT tables and that each DEPT table has a fk into EMP for the department head. select ?head count (*) from <people> where { ?p a <person> . ?p <has_dept> ?dept . ?dept <has_dept_head> ?head } group by ?head. Each EMP table has employee id'spotentially overlapping with the other ones. So we must not join between the wrong EMP and DEPT tables. This is accomplished by mapping each with a distinct IRI scheme so the intersection of the subject IRI's is empty. The SQL generator must know that the intersection of the EMP IRI's from the many EMP tables is empty, else it will generate a union where each EMP table is joined with each DEPT table. Of these, only one will be non-empty due to the distinct IRI schemes but the implementation will end up doing prohibitive extra work unless it is smart about this. Up to now, we have not found anything impossible. But we have found many cases which are easily very awkward to map to SQL and require features outside of the standard (UDT, run time typing, table valued functions). We note that for mapping to be viable where it is most needed, i.e. data integration for supporting analytics, the SQL produced and executed cannot be substantially different from that any relational report generator would make. The DBMS's are written to deal fairly well with such SQL, not necessarily with SQL with UDT's in the place of base data types or table valued functions inside large joins. There is a big difference between feasible in theory and workable in practice. Let us now look at what we ought to prohibit in order to stay within non-extended SQL. The variable in the predicate position has to go,. Also, we must require that types declared for O's of the same P be all castable into a common supertype within base SQL, e.g. varchar and nvarchar become nvarchar. With numbers we already have more of an issue with integer, decimal and double. Depending on the application whether you want precision or range, the correct answer is double or decimal. Such things must be declarable in mappings since the answer is application dependent. Prohibiting a variable P, we would ipso facto always have a constant in the P position of the quad map. If a variable P were allowed there but prohibited in the queries, we would not be free of the type casting problems. Aside this, we do not have any particular problem with the G being computed. When we require the P of the quad maps to be constant, we end up with a mapping that is most readily expressible a la Triplify, i.e. a multicolumn select where exactly one column is declared to be the subject, one is declared as the graph (most often a constant) and others are assigned constant predicates. Thus we should prove that given mappings of this sort, SPARQL 1.1 with no variables in predicate position transforms to single SQL statements. This seems to be the case, as experience demonstrates that even unrestricted SPARQL transforms into single SQL statements, albeit relying on run time types and other extensions. We have not undertaken a formal proof of this but the WG could well produce such.
Received on Friday, 26 March 2010 14:00:46 UTC