- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Mon, 7 Mar 2011 22:43:49 -0500
- To: Boris Villazón Terrazas <bvillazon@fi.upm.es>
- Cc: Robert Scanlon <rscanlon@revelytix.com>, public-rdb2rdf-wg@w3.org, aleon <aleon@delicias.dia.fi.upm.es>
* Boris Villazón Terrazas <bvillazon@fi.upm.es> [2011-03-08 02:46+0100] > Hi Bob > > First of all, thank you very much for start implementing the Test > Cases ... ;-) > > Second of all, apologies but the previous week I was quite busy with > some project deadlines. > > Third of all, I would like to announce that Alex is back in Madrid, > and he will join the team of the test cases definition. > > Finally, my comments inline > > On 07/03/2011 17:10, Robert Scanlon wrote: > >Hi all, > > > >We've started implementing some of the test cases. We very much > >like the general approach of having multiple unique databases, and > >attaching mapping test cases to those; and in general, the use of > >short unique identifiers for test cases ('tc0000a', etc). > >However, coming at it from a perspective of large-scale automation > >as well as browsing and understandability, we do have some > >concerns about how both the data sets and test cases are > >identified. Is there a chance that the WG might consider some > >minor tweaks? > Yes, of course ... if everyone agree > >The 2 main concerns are: > > > > 1. there's no explicit short, sortable unique identifier for > > databases (schemas or 'data sets') > > > You mean, something like database1, database2, etc? > > > > 1. the test case identifiers are not consistent and scalable > > > It is our first approach, and we know we can improve it ... thanks > for pointing out this. > > > > > >_*Summary*_ > > > >To address these, an identification system similar to the one > >below is proposed. Note that the organization is identical to > >what's there now, it is simply the identifiers (and associated > >file names using those identifiers) that are changed. Details are > >below for those who want more background. > > > >*D000 *- Simple 1 table, 1 column database; empty. > > TC000-G000 - Direct graph - map_g000.ttl, ... > > TC000-R000 - <description> - map_r000.ttl, ... > > TC000-R001 - <description>- map_r001.ttl, ... > > ... > >*D001 *- Simple 1 table, 1 column database; 1 record. > > TC001-G000 - Direct graph - map_g000.ttl, ... > > TC001-R000 - <description>- map_r000.ttl, ... > > TC001-R001 - <description>- map_r001.ttl, ... > > ... > >*D002 *- Simple 1 table, 2 columns database; 1 record. > > TC002-G000 - Direct graph - map_g000.ttl, ... > > TC002-R000 - <description>- map_r000.ttl, ... > > TC002-R001 - <description>- map_r001.ttl, ... > > TC002-R002 - <description>- map_r002.ttl, ... > > TC002-R003 - <description>- map_r003.ttl, ... > > ... > > > >I show capital letters above, which is appropriate for use in > >docs, but in directories and file names, all identifiers should be > >lowercase. > > > > > Ok, I'll check it and I'll modify them. > >*_Background, Item 1_* > > > >For the first item, currently the names are 'munged' short > >descriptions, such as *1table0rows*, > >*1table1compositeprimarykey3columns1row*, etc. These can get > >quite long, but more importantly, they don't sort nicely in > >directory/file listings (such as under > >dvcs.w3.org/hg/rdb2rdf-tests/file > ><https://dvcs.w3.org/hg/rdb2rdf-tests/file/92f2b4b0c9ca>). > >Besides file system sorting, the names begin with numbers, which > >keeps them from being used in any way as part of a database schema > >(for some/many systems). > These short descriptions were suggested by Eric ;) ... but I see the > problem ... > Eric, do you want to say something about this? In my experience, descriptive test names save a human lookup and cut down on test redundancy (which means you can focus more on coverage). The names are never perfect as they can only capture a subset of the interesting characteristics, but in e.g. SPARQL, I found the descriptively-named tests more useful than the ones given arbitrary names. I find descriptive filenames even more helpful as it encourages us to re-use files for multiple tests. I won't lie down in the road over this, but I do find descriptive names more helpful than harmful. The leading numbers is an interesting point. I continually wrestle with XML's adoption of the the lexical name restrictions in common programming languages (most notably, no leading digits) and hadn't even thought it being a pain here. > >I would suggest we use a unique identifier such as 'D000', 'D001', > >etc, for 'data' or 'database'. This addresses all the issues > >raised above. > > > >[Note that while I say there's no _explicit_ sortable identifier > >for data sets, there does seem to be an /implicit /one, bound into > >the names of all test cases under a data set. E.g., tc0000, > >tc0000a, tc0000b, etc are all tied to the first data set (which > >one could infer had an identifier along the lines of '0000'). > >Similarly for tc0001, etc. So really I'm just suggesting that the > >data set identifier be made explicit (and start with an alpha > >character). It can be 4 digits in length (as the test cases > >currently are), but 3 seems sufficient (maybe even 2; too many > >unique data sets will make test case management more difficult).] > > > > > Ok, I see ... if everyone agree I'll update the test cases. > any comments? > > >*_Background, _**_Item 2_* > > > >For the second item, the test case (mapping) identifiers have a > >couple variations depending on where you look, are not explicitly > >tied to their parent data set, and seem limited to 26 > >mappings/tests per database. > > > >As far as variations, in the wiki > >(http://www.w3.org/2001/sw/rdb2rdf/wiki/R2RML_Test_Cases_v1), some > >of the test cases have an 'Id' property, with values such as > >R2RMLTC0001a, R2RMLTC0001b, etc. (Not all do, and some have a > >header with a space, such as 'Direct Graph TC0000'). Whereas the > >'official' test case spec > >(http://www.w3.org/2001/sw/rdb2rdf/test-cases/) has similar names, > >but no explicit 'Id', and the R2RML headers have a space like the > >Direct Graph ones. It would be nice to make these Id's > >consistent, and short, e.g., use a code letter instead of 'Direct > >Graph' and 'R2RML' in the Id. > > > >As far as links to database/data set, becuase there are not > >explicitly-defined data set Id's, one has to infer that the > >'0000', '0001', etc parts of these are tying them to the (not > >explicitly identified) data sets. That can be addressed by adding > >explicit data set Id's as suggested above. > > > >As far as scalability, the use of the letters 'a', 'b', 'c', etc > >for the individual mappings/test cases attached to each data set > >seems to limit the # test cases to 26 cases per database. That > >seems insufficient. > > > >To resolve these, I'd suggest we uniquely identify each mapping > >test case as follows: > > > >test case id = TC<data set #>-<mapping id> > > > >where <mapping id> would be local to a data set, with values G000, > >G001, etc, for direct graph mappings and R000, R001, R002, etc for > >R2RML mappings. (I'm assuming that there *could* eventually be > >more than 1 direct mapping test case per data set, based on > >configuration; otherwise the 'G' and 'R' for the mapping part > >would not be needed, with '000' representing the direct graph, > >like it is now). As with the data set Id, the # of digits can > >probably safely be 3 (or perhaps 2, though that *could* come up > >short in some cases). > > > >Ideally, the mapping files for any given data set would be named > >using the local mapping ID, e.g., map_g000.ttl, map_r000.ttl, > >map_r001.ttl, etc, so they can be unambiguously linked to spec > >definitions and so they (and output filenames generated from them) > >sort nicely within the context of a given data set's directory. > > > > > Ok, I see ... if everyone agree I'll update the test cases. > any comments? > > Thank you very much for your suggestions Bob > > Boris > > >Thanks, > >Bob Scanlon > >Revelytix > > > -- -ericP
Received on Tuesday, 8 March 2011 03:44:34 UTC