- From: Eric Jain <Eric.Jain@isb-sib.ch>
- Date: Mon, 14 Oct 2002 10:08:30 +0200
- To: "xmlschema-dev" <xmlschema-dev@w3.org>
One of the more well known bioinformatics databases, SWISS-PROT, is going to be distributed in XML sometimes soon now. There has been a lot of demand, mostly because the data is pretty complex and the current flat text format is not quite trivial to parse, and of course because XML will solve everyones problems. The programmers that process our data are often, but not always, entry level programmers, typically working with Perl, Java or C++. Tasks range from trying to integrate our data (or more likely parts of it) with other databases to simply reformatting and displaying single database entries in detail. We decided to provide an XML Schema because it can be used as a detailed description of the data as well as for generating code for simple applications, not to mention web services... Also, it provides me with a pretext for asking people on this list for free XML advice :-) There remain some open issues we haven't been able to decide on: * Element naming: geneName vs. gene-name. Mixed-case names seem to be more fashionable at the time, but I tend to prefer the second, less programming-language-like style. Certainly the Perl programmers wouldn't approve of Java-style names. And then there are the Python programmers... Similarly, should lists be explicitly named as such, e.g. <geneList> or <gene-list> vs. <genes>? What do you prefer to work with? * Should we avoid Schema features not supported by JAXB? Any other features you would advise against using if you don't want to anger/frustrate anyone? * Importance of tools. The general opinion is that it should be left up to users to write their own parsers. My view here is that we should provide a set of tools for reading, writing and representing our data right from the beginning, sort of a reference implementation (in Java and possibly Perl, both of which we are using internally anyways). From your point of view, how much of a help would you consider such tools? Would you use them or, being an experienced XML developer, rather simply write your own code anyways? * Normalize data? There is lots of data that is repeated. My current approach is to put these elements into separate files for distributing large amounts of data, but allowing them to be inlined for situations such as users downloading small sets from the web (e.g. query results). Is this strategy clever, or simply confusing? In swissprot.xml: <keyword id="25"/> In keywords.xml OR swissprot.xml: <keyword id="25"> <name>x</name> <category>y</category> ... </keyword> * Go ahead vs. wait. Some people think: why wait with the first release in XML? Since it's XML, we can always change the format, right? Any comments would be greatly appreciated! (A Schema and some example data are available at http://viralgenomics.org/xml/.) -- Eric Jain
Received on Monday, 14 October 2002 04:08:46 UTC