Test case identification from Robert Scanlon on 2011-03-07 (public-rdb2rdf-wg@w3.org from March 2011)

From: Robert Scanlon <rscanlon@revelytix.com>
Date: Mon, 7 Mar 2011 10:10:10 -0600
To: public-rdb2rdf-wg@w3.org
Message-ID: <AANLkTi=rL--7GV+eex4cZ-P3W3JUkDEMthHMHa0e3Tp2@mail.gmail.com>

Hi all,

We've started implementing some of the test cases. We very much like the
general approach of having multiple unique databases, and attaching mapping
test cases to those; and in general, the use of short unique identifiers for
test cases ('tc0000a', etc). However, coming at it from a perspective of
large-scale automation as well as browsing and understandability, we do have
some concerns about how both the data sets and test cases are identified.
Is there a chance that the WG might consider some minor tweaks? The 2 main
concerns are:

1. there's no explicit short, sortable unique identifier for databases
(schemas or 'data sets')
2. the test case identifiers are not consistent and scalable

*Summary*

To address these, an identification system similar to the one below is
proposed. Note that the organization is identical to what's there now, it
is simply the identifiers (and associated file names using those
identifiers) that are changed. Details are below for those who want more
background.

*D000 *- Simple 1 table, 1 column database; empty.
TC000-G000 - Direct graph - map_g000.ttl, ...
TC000-R000 - <description> - map_r000.ttl, ...
TC000-R001 - <description> - map_r001.ttl, ...
...
*D001 *- Simple 1 table, 1 column database; 1 record.
TC001-G000 - Direct graph - map_g000.ttl, ...
TC001-R000 - <description> - map_r000.ttl, ...
TC001-R001 - <description> - map_r001.ttl, ...
...
*D002 *- Simple 1 table, 2 columns database; 1 record.
TC002-G000 - Direct graph - map_g000.ttl, ...
TC002-R000 - <description> - map_r000.ttl, ...
TC002-R001 - <description> - map_r001.ttl, ...
TC002-R002 - <description> - map_r002.ttl, ...
TC002-R003 - <description> - map_r003.ttl, ...
...

I show capital letters above, which is appropriate for use in docs, but in
directories and file names, all identifiers should be lowercase.

*Background, Item 1*

For the first item, currently the names are 'munged' short descriptions,
such as *1table0rows*, *1table1compositeprimarykey3columns1row*, etc. These
can get quite long, but more importantly, they don't sort nicely in
directory/file listings (such as under
dvcs.w3.org/hg/rdb2rdf-tests/file<https://dvcs.w3.org/hg/rdb2rdf-tests/file/92f2b4b0c9ca>).
Besides file system sorting, the names begin with numbers, which keeps them
from being used in any way as part of a database schema (for some/many
systems).

I would suggest we use a unique identifier such as 'D000', 'D001', etc, for
'data' or 'database'. This addresses all the issues raised above.

[Note that while I say there's no *explicit* sortable identifier for data
sets, there does seem to be an *implicit *one, bound into the names of all
test cases under a data set. E.g., tc0000, tc0000a, tc0000b, etc are all
tied to the first data set (which one could infer had an identifier along
the lines of '0000'). Similarly for tc0001, etc. So really I'm just
suggesting that the data set identifier be made explicit (and start with an
alpha character). It can be 4 digits in length (as the test cases currently
are), but 3 seems sufficient (maybe even 2; too many unique data sets will
make test case management more difficult).]

*Background, **Item 2*

For the second item, the test case (mapping) identifiers have a couple
variations depending on where you look, are not explicitly tied to their
parent data set, and seem limited to 26 mappings/tests per database.

As far as variations, in the wiki (
http://www.w3.org/2001/sw/rdb2rdf/wiki/R2RML_Test_Cases_v1), some of the
test cases have an 'Id' property, with values such as R2RMLTC0001a,
R2RMLTC0001b, etc. (Not all do, and some have a header with a space, such
as 'Direct Graph TC0000'). Whereas the 'official' test case spec (
http://www.w3.org/2001/sw/rdb2rdf/test-cases/) has similar names, but no
explicit 'Id', and the R2RML headers have a space like the Direct Graph
ones. It would be nice to make these Id's consistent, and short, e.g., use
a code letter instead of 'Direct Graph' and 'R2RML' in the Id.

As far as links to database/data set, becuase there are not
explicitly-defined data set Id's, one has to infer that the '0000', '0001',
etc parts of these are tying them to the (not explicitly identified) data
sets. That can be addressed by adding explicit data set Id's as suggested
above.

As far as scalability, the use of the letters 'a', 'b', 'c', etc for the
individual mappings/test cases attached to each data set seems to limit the
# test cases to 26 cases per database. That seems insufficient.

To resolve these, I'd suggest we uniquely identify each mapping test case as
follows:

test case id = TC<data set #>-<mapping id>

where <mapping id> would be local to a data set, with values G000, G001,
etc, for direct graph mappings and R000, R001, R002, etc for R2RML
mappings. (I'm assuming that there *could* eventually be more than 1 direct
mapping test case per data set, based on configuration; otherwise the 'G'
and 'R' for the mapping part would not be needed, with '000' representing
the direct graph, like it is now). As with the data set Id, the # of digits
can probably safely be 3 (or perhaps 2, though that *could* come up short in
some cases).

Ideally, the mapping files for any given data set would be named using the
local mapping ID, e.g., map_g000.ttl, map_r000.ttl, map_r001.ttl, etc, so
they can be unambiguously linked to spec definitions and so they (and output
filenames generated from them) sort nicely within the context of a given
data set's directory.

Thanks,
Bob Scanlon
Revelytix

Received on Monday, 7 March 2011 16:10:48 UTC