- From: Boris Villazón Terrazas <bvillazon@fi.upm.es>
- Date: Tue, 08 Mar 2011 02:46:46 +0100
- To: Robert Scanlon <rscanlon@revelytix.com>
- CC: public-rdb2rdf-wg@w3.org, aleon <aleon@delicias.dia.fi.upm.es>
- Message-ID: <4D758A86.1070101@fi.upm.es>
Hi Bob
First of all, thank you very much for start implementing the Test Cases
... ;-)
Second of all, apologies but the previous week I was quite busy with
some project deadlines.
Third of all, I would like to announce that Alex is back in Madrid, and
he will join the team of the test cases definition.
Finally, my comments inline
On 07/03/2011 17:10, Robert Scanlon wrote:
> Hi all,
>
> We've started implementing some of the test cases. We very much like
> the general approach of having multiple unique databases, and
> attaching mapping test cases to those; and in general, the use of
> short unique identifiers for test cases ('tc0000a', etc). However,
> coming at it from a perspective of large-scale automation as well as
> browsing and understandability, we do have some concerns about how
> both the data sets and test cases are identified. Is there a chance
> that the WG might consider some minor tweaks?
Yes, of course ... if everyone agree
> The 2 main concerns are:
>
> 1. there's no explicit short, sortable unique identifier for
> databases (schemas or 'data sets')
>
You mean, something like database1, database2, etc?
>
> 1. the test case identifiers are not consistent and scalable
>
It is our first approach, and we know we can improve it ... thanks for
pointing out this.
>
>
> _*Summary*_
>
> To address these, an identification system similar to the one below is
> proposed. Note that the organization is identical to what's there
> now, it is simply the identifiers (and associated file names using
> those identifiers) that are changed. Details are below for those who
> want more background.
>
> *D000 *- Simple 1 table, 1 column database; empty.
> TC000-G000 - Direct graph - map_g000.ttl, ...
> TC000-R000 - <description> - map_r000.ttl, ...
> TC000-R001 - <description>- map_r001.ttl, ...
> ...
> *D001 *- Simple 1 table, 1 column database; 1 record.
> TC001-G000 - Direct graph - map_g000.ttl, ...
> TC001-R000 - <description>- map_r000.ttl, ...
> TC001-R001 - <description>- map_r001.ttl, ...
> ...
> *D002 *- Simple 1 table, 2 columns database; 1 record.
> TC002-G000 - Direct graph - map_g000.ttl, ...
> TC002-R000 - <description>- map_r000.ttl, ...
> TC002-R001 - <description>- map_r001.ttl, ...
> TC002-R002 - <description>- map_r002.ttl, ...
> TC002-R003 - <description>- map_r003.ttl, ...
> ...
>
> I show capital letters above, which is appropriate for use in docs,
> but in directories and file names, all identifiers should be lowercase.
>
>
Ok, I'll check it and I'll modify them.
> *_Background, Item 1_*
>
> For the first item, currently the names are 'munged' short
> descriptions, such as *1table0rows*,
> *1table1compositeprimarykey3columns1row*, etc. These can get quite
> long, but more importantly, they don't sort nicely in directory/file
> listings (such as under dvcs.w3.org/hg/rdb2rdf-tests/file
> <https://dvcs.w3.org/hg/rdb2rdf-tests/file/92f2b4b0c9ca>). Besides
> file system sorting, the names begin with numbers, which keeps them
> from being used in any way as part of a database schema (for some/many
> systems).
These short descriptions were suggested by Eric ;) ... but I see the
problem ...
Eric, do you want to say something about this?
>
> I would suggest we use a unique identifier such as 'D000', 'D001',
> etc, for 'data' or 'database'. This addresses all the issues raised
> above.
>
> [Note that while I say there's no _explicit_ sortable identifier for
> data sets, there does seem to be an /implicit /one, bound into the
> names of all test cases under a data set. E.g., tc0000, tc0000a,
> tc0000b, etc are all tied to the first data set (which one could infer
> had an identifier along the lines of '0000'). Similarly for tc0001,
> etc. So really I'm just suggesting that the data set identifier be
> made explicit (and start with an alpha character). It can be 4 digits
> in length (as the test cases currently are), but 3 seems sufficient
> (maybe even 2; too many unique data sets will make test case
> management more difficult).]
>
>
Ok, I see ... if everyone agree I'll update the test cases.
any comments?
> *_Background, _**_Item 2_*
>
> For the second item, the test case (mapping) identifiers have a couple
> variations depending on where you look, are not explicitly tied to
> their parent data set, and seem limited to 26 mappings/tests per
> database.
>
> As far as variations, in the wiki
> (http://www.w3.org/2001/sw/rdb2rdf/wiki/R2RML_Test_Cases_v1), some of
> the test cases have an 'Id' property, with values such as
> R2RMLTC0001a, R2RMLTC0001b, etc. (Not all do, and some have a header
> with a space, such as 'Direct Graph TC0000'). Whereas the 'official'
> test case spec (http://www.w3.org/2001/sw/rdb2rdf/test-cases/) has
> similar names, but no explicit 'Id', and the R2RML headers have a
> space like the Direct Graph ones. It would be nice to make these Id's
> consistent, and short, e.g., use a code letter instead of 'Direct
> Graph' and 'R2RML' in the Id.
>
> As far as links to database/data set, becuase there are not
> explicitly-defined data set Id's, one has to infer that the '0000',
> '0001', etc parts of these are tying them to the (not explicitly
> identified) data sets. That can be addressed by adding explicit data
> set Id's as suggested above.
>
> As far as scalability, the use of the letters 'a', 'b', 'c', etc for
> the individual mappings/test cases attached to each data set seems to
> limit the # test cases to 26 cases per database. That seems
> insufficient.
>
> To resolve these, I'd suggest we uniquely identify each mapping test
> case as follows:
>
> test case id = TC<data set #>-<mapping id>
>
> where <mapping id> would be local to a data set, with values G000,
> G001, etc, for direct graph mappings and R000, R001, R002, etc for
> R2RML mappings. (I'm assuming that there *could* eventually be more
> than 1 direct mapping test case per data set, based on configuration;
> otherwise the 'G' and 'R' for the mapping part would not be needed,
> with '000' representing the direct graph, like it is now). As with
> the data set Id, the # of digits can probably safely be 3 (or perhaps
> 2, though that *could* come up short in some cases).
>
> Ideally, the mapping files for any given data set would be named using
> the local mapping ID, e.g., map_g000.ttl, map_r000.ttl, map_r001.ttl,
> etc, so they can be unambiguously linked to spec definitions and so
> they (and output filenames generated from them) sort nicely within the
> context of a given data set's directory.
>
>
Ok, I see ... if everyone agree I'll update the test cases.
any comments?
Thank you very much for your suggestions Bob
Boris
> Thanks,
> Bob Scanlon
> Revelytix
>
Received on Tuesday, 8 March 2011 01:47:18 UTC