Re: Test case identification from Boris Villazón Terrazas on 2011-03-08 (public-rdb2rdf-wg@w3.org from March 2011)

From: Boris Villazón Terrazas <bvillazon@fi.upm.es>
Date: Tue, 08 Mar 2011 02:46:46 +0100
To: Robert Scanlon <rscanlon@revelytix.com>
CC: public-rdb2rdf-wg@w3.org, aleon <aleon@delicias.dia.fi.upm.es>
Message-ID: <4D758A86.1070101@fi.upm.es>
Hi Bob

First of all, thank you very much for start implementing the Test Cases 
... ;-)

Second of all, apologies but the previous week I was quite busy with 
some project deadlines.

Third of all, I would like to announce that Alex is back in Madrid, and 
he will join the team of the test cases definition.

Finally, my comments inline

On 07/03/2011 17:10, Robert Scanlon wrote:
> Hi all,
>
> We've started implementing some of the test cases.  We very much like 
> the general approach of having multiple unique databases, and 
> attaching mapping test cases to those; and in general, the use of 
> short unique identifiers for test cases ('tc0000a', etc).  However, 
> coming at it from a perspective of large-scale automation as well as 
> browsing and understandability, we do have some concerns about how 
> both the data sets and test cases are identified.  Is there a chance 
> that the WG might consider some minor tweaks? 
Yes, of course ... if everyone agree
> The 2 main concerns are:
>
>    1. there's no explicit short, sortable unique identifier for
>       databases (schemas or 'data sets')
>
You mean, something like database1, database2, etc?
>
>    1. the test case identifiers are not consistent and scalable
>
It is our first approach, and we know we can improve it ... thanks for 
pointing out this.
>
>
> _*Summary*_
>
> To address these, an identification system similar to the one below is 
> proposed.  Note that the organization is identical to what's there 
> now, it is simply the identifiers (and associated file names using 
> those identifiers) that are changed.  Details are below for those who 
> want more background.
>
> *D000 *- Simple 1 table, 1 column database; empty.
>     TC000-G000 - Direct graph - map_g000.ttl, ...
>     TC000-R000 - <description> - map_r000.ttl, ...
>     TC000-R001 - <description>- map_r001.ttl, ...
>     ...
> *D001 *- Simple 1 table, 1 column database; 1 record.
>     TC001-G000 - Direct graph - map_g000.ttl, ...
>     TC001-R000 - <description>- map_r000.ttl, ...
>     TC001-R001 - <description>- map_r001.ttl, ...
>     ...
> *D002 *- Simple 1 table, 2 columns database; 1 record.
>     TC002-G000 - Direct graph - map_g000.ttl, ...
>     TC002-R000 - <description>- map_r000.ttl, ...
>     TC002-R001 - <description>- map_r001.ttl, ...
>     TC002-R002 - <description>- map_r002.ttl, ...
>     TC002-R003 - <description>- map_r003.ttl, ...
>   ...
>
> I show capital letters above, which is appropriate for use in docs, 
> but in directories and file names, all identifiers should be lowercase.
>
>
Ok, I'll check it and I'll modify them.
> *_Background, Item 1_*
>
> For the first item, currently the names are 'munged' short 
> descriptions, such as *1table0rows*, 
> *1table1compositeprimarykey3columns1row*, etc.  These can get quite 
> long, but more importantly, they don't sort nicely in directory/file 
> listings (such as under dvcs.w3.org/hg/rdb2rdf-tests/file 
> <https://dvcs.w3.org/hg/rdb2rdf-tests/file/92f2b4b0c9ca>).  Besides 
> file system sorting, the names begin with numbers, which keeps them 
> from being used in any way as part of a database schema (for some/many 
> systems).
These short descriptions were suggested by Eric ;) ... but I see the 
problem ...
Eric, do you want to say something about this?

>
> I would suggest we use a unique identifier such as 'D000', 'D001', 
> etc, for 'data' or 'database'.  This addresses all the issues raised 
> above.
>
> [Note that while I say there's no _explicit_ sortable identifier for 
> data sets, there does seem to be an /implicit /one, bound into the 
> names of all test cases under a data set.  E.g., tc0000, tc0000a, 
> tc0000b, etc are all tied to the first data set (which one could infer 
> had an identifier along the lines of '0000').  Similarly for tc0001, 
> etc.  So really I'm just suggesting that the data set identifier be 
> made explicit (and start with an alpha character).  It can be 4 digits 
> in length (as the test cases currently are), but 3 seems sufficient 
> (maybe even 2; too many unique data sets will make test case 
> management more difficult).]
>
>
Ok, I see ... if everyone agree I'll update the test cases.
any comments?

> *_Background, _**_Item 2_*
>
> For the second item, the test case (mapping) identifiers have a couple 
> variations depending on where you look, are not explicitly tied to 
> their parent data set, and seem limited to 26 mappings/tests per 
> database.
>
> As far as variations, in the wiki 
> (http://www.w3.org/2001/sw/rdb2rdf/wiki/R2RML_Test_Cases_v1), some of 
> the test cases have an 'Id' property, with values such as 
> R2RMLTC0001a, R2RMLTC0001b, etc.  (Not all do, and some have a header 
> with a space, such as 'Direct Graph TC0000').  Whereas the 'official' 
> test case spec (http://www.w3.org/2001/sw/rdb2rdf/test-cases/) has 
> similar names, but no explicit 'Id', and the R2RML headers have a 
> space like the Direct Graph ones.  It would be nice to make these Id's 
> consistent, and short, e.g., use a code letter instead of 'Direct 
> Graph' and 'R2RML' in the Id.
>
> As far as links to database/data set, becuase there are not 
> explicitly-defined data set Id's, one has to infer that the '0000', 
> '0001', etc parts of these are tying them to the (not explicitly 
> identified) data sets.  That can be addressed by adding explicit data 
> set Id's as suggested above.
>
> As far as scalability, the use of the letters 'a', 'b', 'c', etc for 
> the individual mappings/test cases attached to each data set seems to 
> limit the # test cases to 26 cases per database.  That seems 
> insufficient.
>
> To resolve these, I'd suggest we uniquely identify each mapping test 
> case as follows:
>
> test case id = TC<data set #>-<mapping id>
>
> where <mapping id> would be local to a data set, with values G000, 
> G001, etc, for direct graph mappings and R000, R001, R002, etc for 
> R2RML mappings.  (I'm assuming that there *could* eventually be more 
> than 1 direct mapping test case per data set, based on configuration; 
> otherwise the 'G' and 'R' for the mapping part would not be needed, 
> with '000' representing the direct graph, like it is now).  As with 
> the data set Id, the # of digits can probably safely be 3 (or perhaps 
> 2, though that *could* come up short in some cases).
>
> Ideally, the mapping files for any given data set would be named using 
> the local mapping ID, e.g., map_g000.ttl, map_r000.ttl, map_r001.ttl, 
> etc, so they can be unambiguously linked to spec definitions and so 
> they (and output filenames generated from them) sort nicely within the 
> context of a given data set's directory.
>
>
Ok, I see ... if everyone agree I'll update the test cases.
any comments?

Thank you very much for your suggestions Bob

Boris

> Thanks,
> Bob Scanlon
> Revelytix
>
Received on Tuesday, 8 March 2011 01:47:18 UTC