- From: Boris Villazón Terrazas <bvillazon@fi.upm.es>
- Date: Tue, 08 Mar 2011 02:46:46 +0100
- To: Robert Scanlon <rscanlon@revelytix.com>
- CC: public-rdb2rdf-wg@w3.org, aleon <aleon@delicias.dia.fi.upm.es>
- Message-ID: <4D758A86.1070101@fi.upm.es>
Hi Bob First of all, thank you very much for start implementing the Test Cases ... ;-) Second of all, apologies but the previous week I was quite busy with some project deadlines. Third of all, I would like to announce that Alex is back in Madrid, and he will join the team of the test cases definition. Finally, my comments inline On 07/03/2011 17:10, Robert Scanlon wrote: > Hi all, > > We've started implementing some of the test cases. We very much like > the general approach of having multiple unique databases, and > attaching mapping test cases to those; and in general, the use of > short unique identifiers for test cases ('tc0000a', etc). However, > coming at it from a perspective of large-scale automation as well as > browsing and understandability, we do have some concerns about how > both the data sets and test cases are identified. Is there a chance > that the WG might consider some minor tweaks? Yes, of course ... if everyone agree > The 2 main concerns are: > > 1. there's no explicit short, sortable unique identifier for > databases (schemas or 'data sets') > You mean, something like database1, database2, etc? > > 1. the test case identifiers are not consistent and scalable > It is our first approach, and we know we can improve it ... thanks for pointing out this. > > > _*Summary*_ > > To address these, an identification system similar to the one below is > proposed. Note that the organization is identical to what's there > now, it is simply the identifiers (and associated file names using > those identifiers) that are changed. Details are below for those who > want more background. > > *D000 *- Simple 1 table, 1 column database; empty. > TC000-G000 - Direct graph - map_g000.ttl, ... > TC000-R000 - <description> - map_r000.ttl, ... > TC000-R001 - <description>- map_r001.ttl, ... > ... > *D001 *- Simple 1 table, 1 column database; 1 record. > TC001-G000 - Direct graph - map_g000.ttl, ... > TC001-R000 - <description>- map_r000.ttl, ... > TC001-R001 - <description>- map_r001.ttl, ... > ... > *D002 *- Simple 1 table, 2 columns database; 1 record. > TC002-G000 - Direct graph - map_g000.ttl, ... > TC002-R000 - <description>- map_r000.ttl, ... > TC002-R001 - <description>- map_r001.ttl, ... > TC002-R002 - <description>- map_r002.ttl, ... > TC002-R003 - <description>- map_r003.ttl, ... > ... > > I show capital letters above, which is appropriate for use in docs, > but in directories and file names, all identifiers should be lowercase. > > Ok, I'll check it and I'll modify them. > *_Background, Item 1_* > > For the first item, currently the names are 'munged' short > descriptions, such as *1table0rows*, > *1table1compositeprimarykey3columns1row*, etc. These can get quite > long, but more importantly, they don't sort nicely in directory/file > listings (such as under dvcs.w3.org/hg/rdb2rdf-tests/file > <https://dvcs.w3.org/hg/rdb2rdf-tests/file/92f2b4b0c9ca>). Besides > file system sorting, the names begin with numbers, which keeps them > from being used in any way as part of a database schema (for some/many > systems). These short descriptions were suggested by Eric ;) ... but I see the problem ... Eric, do you want to say something about this? > > I would suggest we use a unique identifier such as 'D000', 'D001', > etc, for 'data' or 'database'. This addresses all the issues raised > above. > > [Note that while I say there's no _explicit_ sortable identifier for > data sets, there does seem to be an /implicit /one, bound into the > names of all test cases under a data set. E.g., tc0000, tc0000a, > tc0000b, etc are all tied to the first data set (which one could infer > had an identifier along the lines of '0000'). Similarly for tc0001, > etc. So really I'm just suggesting that the data set identifier be > made explicit (and start with an alpha character). It can be 4 digits > in length (as the test cases currently are), but 3 seems sufficient > (maybe even 2; too many unique data sets will make test case > management more difficult).] > > Ok, I see ... if everyone agree I'll update the test cases. any comments? > *_Background, _**_Item 2_* > > For the second item, the test case (mapping) identifiers have a couple > variations depending on where you look, are not explicitly tied to > their parent data set, and seem limited to 26 mappings/tests per > database. > > As far as variations, in the wiki > (http://www.w3.org/2001/sw/rdb2rdf/wiki/R2RML_Test_Cases_v1), some of > the test cases have an 'Id' property, with values such as > R2RMLTC0001a, R2RMLTC0001b, etc. (Not all do, and some have a header > with a space, such as 'Direct Graph TC0000'). Whereas the 'official' > test case spec (http://www.w3.org/2001/sw/rdb2rdf/test-cases/) has > similar names, but no explicit 'Id', and the R2RML headers have a > space like the Direct Graph ones. It would be nice to make these Id's > consistent, and short, e.g., use a code letter instead of 'Direct > Graph' and 'R2RML' in the Id. > > As far as links to database/data set, becuase there are not > explicitly-defined data set Id's, one has to infer that the '0000', > '0001', etc parts of these are tying them to the (not explicitly > identified) data sets. That can be addressed by adding explicit data > set Id's as suggested above. > > As far as scalability, the use of the letters 'a', 'b', 'c', etc for > the individual mappings/test cases attached to each data set seems to > limit the # test cases to 26 cases per database. That seems > insufficient. > > To resolve these, I'd suggest we uniquely identify each mapping test > case as follows: > > test case id = TC<data set #>-<mapping id> > > where <mapping id> would be local to a data set, with values G000, > G001, etc, for direct graph mappings and R000, R001, R002, etc for > R2RML mappings. (I'm assuming that there *could* eventually be more > than 1 direct mapping test case per data set, based on configuration; > otherwise the 'G' and 'R' for the mapping part would not be needed, > with '000' representing the direct graph, like it is now). As with > the data set Id, the # of digits can probably safely be 3 (or perhaps > 2, though that *could* come up short in some cases). > > Ideally, the mapping files for any given data set would be named using > the local mapping ID, e.g., map_g000.ttl, map_r000.ttl, map_r001.ttl, > etc, so they can be unambiguously linked to spec definitions and so > they (and output filenames generated from them) sort nicely within the > context of a given data set's directory. > > Ok, I see ... if everyone agree I'll update the test cases. any comments? Thank you very much for your suggestions Bob Boris > Thanks, > Bob Scanlon > Revelytix >
Received on Tuesday, 8 March 2011 01:47:18 UTC