- From: Robert Scanlon <rscanlon@revelytix.com>
- Date: Mon, 7 Mar 2011 10:10:10 -0600
- To: public-rdb2rdf-wg@w3.org
- Message-ID: <AANLkTi=rL--7GV+eex4cZ-P3W3JUkDEMthHMHa0e3Tp2@mail.gmail.com>
Hi all, We've started implementing some of the test cases. We very much like the general approach of having multiple unique databases, and attaching mapping test cases to those; and in general, the use of short unique identifiers for test cases ('tc0000a', etc). However, coming at it from a perspective of large-scale automation as well as browsing and understandability, we do have some concerns about how both the data sets and test cases are identified. Is there a chance that the WG might consider some minor tweaks? The 2 main concerns are: 1. there's no explicit short, sortable unique identifier for databases (schemas or 'data sets') 2. the test case identifiers are not consistent and scalable *Summary* To address these, an identification system similar to the one below is proposed. Note that the organization is identical to what's there now, it is simply the identifiers (and associated file names using those identifiers) that are changed. Details are below for those who want more background. *D000 *- Simple 1 table, 1 column database; empty. TC000-G000 - Direct graph - map_g000.ttl, ... TC000-R000 - <description> - map_r000.ttl, ... TC000-R001 - <description> - map_r001.ttl, ... ... *D001 *- Simple 1 table, 1 column database; 1 record. TC001-G000 - Direct graph - map_g000.ttl, ... TC001-R000 - <description> - map_r000.ttl, ... TC001-R001 - <description> - map_r001.ttl, ... ... *D002 *- Simple 1 table, 2 columns database; 1 record. TC002-G000 - Direct graph - map_g000.ttl, ... TC002-R000 - <description> - map_r000.ttl, ... TC002-R001 - <description> - map_r001.ttl, ... TC002-R002 - <description> - map_r002.ttl, ... TC002-R003 - <description> - map_r003.ttl, ... ... I show capital letters above, which is appropriate for use in docs, but in directories and file names, all identifiers should be lowercase. *Background, Item 1* For the first item, currently the names are 'munged' short descriptions, such as *1table0rows*, *1table1compositeprimarykey3columns1row*, etc. These can get quite long, but more importantly, they don't sort nicely in directory/file listings (such as under dvcs.w3.org/hg/rdb2rdf-tests/file<https://dvcs.w3.org/hg/rdb2rdf-tests/file/92f2b4b0c9ca>). Besides file system sorting, the names begin with numbers, which keeps them from being used in any way as part of a database schema (for some/many systems). I would suggest we use a unique identifier such as 'D000', 'D001', etc, for 'data' or 'database'. This addresses all the issues raised above. [Note that while I say there's no *explicit* sortable identifier for data sets, there does seem to be an *implicit *one, bound into the names of all test cases under a data set. E.g., tc0000, tc0000a, tc0000b, etc are all tied to the first data set (which one could infer had an identifier along the lines of '0000'). Similarly for tc0001, etc. So really I'm just suggesting that the data set identifier be made explicit (and start with an alpha character). It can be 4 digits in length (as the test cases currently are), but 3 seems sufficient (maybe even 2; too many unique data sets will make test case management more difficult).] *Background, **Item 2* For the second item, the test case (mapping) identifiers have a couple variations depending on where you look, are not explicitly tied to their parent data set, and seem limited to 26 mappings/tests per database. As far as variations, in the wiki ( http://www.w3.org/2001/sw/rdb2rdf/wiki/R2RML_Test_Cases_v1), some of the test cases have an 'Id' property, with values such as R2RMLTC0001a, R2RMLTC0001b, etc. (Not all do, and some have a header with a space, such as 'Direct Graph TC0000'). Whereas the 'official' test case spec ( http://www.w3.org/2001/sw/rdb2rdf/test-cases/) has similar names, but no explicit 'Id', and the R2RML headers have a space like the Direct Graph ones. It would be nice to make these Id's consistent, and short, e.g., use a code letter instead of 'Direct Graph' and 'R2RML' in the Id. As far as links to database/data set, becuase there are not explicitly-defined data set Id's, one has to infer that the '0000', '0001', etc parts of these are tying them to the (not explicitly identified) data sets. That can be addressed by adding explicit data set Id's as suggested above. As far as scalability, the use of the letters 'a', 'b', 'c', etc for the individual mappings/test cases attached to each data set seems to limit the # test cases to 26 cases per database. That seems insufficient. To resolve these, I'd suggest we uniquely identify each mapping test case as follows: test case id = TC<data set #>-<mapping id> where <mapping id> would be local to a data set, with values G000, G001, etc, for direct graph mappings and R000, R001, R002, etc for R2RML mappings. (I'm assuming that there *could* eventually be more than 1 direct mapping test case per data set, based on configuration; otherwise the 'G' and 'R' for the mapping part would not be needed, with '000' representing the direct graph, like it is now). As with the data set Id, the # of digits can probably safely be 3 (or perhaps 2, though that *could* come up short in some cases). Ideally, the mapping files for any given data set would be named using the local mapping ID, e.g., map_g000.ttl, map_r000.ttl, map_r001.ttl, etc, so they can be unambiguously linked to spec definitions and so they (and output filenames generated from them) sort nicely within the context of a given data set's directory. Thanks, Bob Scanlon Revelytix
Received on Monday, 7 March 2011 16:10:48 UTC