Re: [BIORDF] Re: Unstructured vs. Structured (was: HL7 and patient records in RDF/OWL?) from Eric Neumann on 2006-02-19 (public-semweb-lifesci@w3.org from February 2006)

From: Eric Neumann <eneumann@alum.mit.edu>
Date: Sun, 19 Feb 2006 18:54:18 -0500 (EST)
To: Alf Eaton <lists@hubmed.org>
Cc: public-semweb-lifesci@w3.org
Message-ID: <19756382.1140393258332.JavaMail.gbourne@brunch.mit.edu>
Alf,

Not sure about any existing XHTML standards for tables, however I personally am not that fond of using list of lists structures, since they are representations of representations, a tables being already a short-hand representation.

If I may, let me suggest going back a few steps and consider that cells in tables are defined by (at least) 2 sets of ordinates, arranged either as rows or columns. These always must represent something that describes the cells, and the cell is uniquely linked by one column identifier (e.g., expt condition) and one row indentifier (e.g., gene probe).

Using RDF, one obvious graph model is to make each cell a bnode of some type (e.g., "gene expression measurement"), and link it to one column node and one row node. The result is not directly a list of lists, but a unique projection mapping of two ordinate nodes: a web of cells to be exact. 

In effect, each cell "knows" that it belongs to a row and a column identifier. This structure has the added advantage that any algorithm that processes cell values, can also evaluate the set of values linked to from each row and column identifier (e.f., GO, pathways, annotations, tissue states). Most clustering tools currently use non-standard ways of accessing such info; using RDF it could be standardized. For obvious reasons I've called these Hypersheets...

BTW, the order usually seen in table is now determined directly by the list ordering of the rows and columns-- this I've found useful in quickly rearranging tables just be changing the row order, for example.

Eric



--- Alf Eaton <lists@hubmed.org> wrote:

> 
> I've been trying to decide on a good way to provide
> tabular data in  
> papers using XHTML, for presentation online. The
> best options seem to  
> be either just embedding the data as an array using
> JSON, or using  
> tables with class and id markup and allowing them to
> be processed  
> with GRDDL or Javascript to transform the data. Has
> there been any  
> work on presenting spreadsheets in XHTML?
> 
> alf.
> 
> On 19 Feb 2006, at 12:17, Eric Neumann wrote:
> 
> >
> > Matt,
> >
> > Spreadsheets are indeed useful as formatted
> sources that can be  
> > readily converted into RDF. We've used them as the
> primary source  
> > of expression data for BioDash (see attached
> averages; full  
> > GeneLogic data at
> http://www.samsi.info/200304/dmml/web-internal/ 
> > bio/data/data_rsvd.xls ). It almost seems a
> mapping tool could be  
> > written to take any excel files, a GRDDL-like
> conversion of column  
> > headers, row-headers, and cells, to produce RDF
> from these (see the  
> > example).
> >
> > In our example, we wrote the conversion scripts
> directly into the  
> > excel file. The resulting (adenine/N3) file is
> show as well, with  
> > symbols strings mapped to URI's. The cool thing
> here is that if you  
> > add a DB query using the symbols strings (we did
> this within  
> > BioDash), you can take the returned gene
> information, convert it to  
> > RDF, and conenct it to the expression graph
> through the probes for  
> > each the row (see resulting adenine file).
> >
> > Perhaps the BIORDF group should include using sdf
> sources as part  
> > of their overall strategy for producing RDF from
> current structured  
> > files (e.g.,  gene expression, screening, and
> clinical data in  
> > sdf). Many published papers have data tables, and
> this would be a  
> > great way to auto convert them to RDF!
> >
> > Eric
> >
> > --- Matthew Cockerill <matt@biomedcentral.com>
> wrote:
> >
> >>
> >> I couldn't agree more.
> >>
> >> Spreadsheets (and equivalently, CSV files) are a
> >> large fraction of
> >> the 'additional datafiles' that BioMed Central
> >> receives from authors.
> >>
> >> What would be great would be to be able to define
> >> some simple
> >> standards and/or templates which authors could
> >> follow in their
> >> spreadsheets, to allow the automatic recognition
> of
> >> key life science
> >> identifiers, and quantitative attributes,  and so
> >> the generation of RDF.
> >>
> >>  From my point of view, that's the most basic,
> >> practical and
> >> prevalent example of the whole semi-structured
> data,
> >> and so seems
> >> like a good starting point.
> >>
> >> Matt
> >>
> >> On 15 Feb 2006, at 5:42, Cutler, Roger
> (RogerCutler)
> >> wrote:
> >>
> >>>
> >>> That's too deep for me.  I'll be satisfied, at
> >> least in an immediate
> >>> sense, with a demonstration of how to generate
> RDF
> >> from an Excel
> >>> spreadsheet.  I think I'll just start saying
> >> "Excel spreadsheet" and
> >>> forget about the term that we use internally to
> >> categorize the
> >>> kinds of
> >>> problems we have.  Spreadsheets are pretty much
> >> the 80-20 of that
> >>> problem, so why not call a spade a spade.  I'm
> >> really not very good at
> >>> generalizing and categorizing.
> >>>
> >>> -----Original Message-----
> >>> From: public-semweb-lifesci-request@w3.org
> >>> [mailto:public-semweb-lifesci-request@w3.org] On
> >> Behalf Of Christopher
> >>> Cavnor
> >>> Sent: Tuesday, February 14, 2006 3:54 PM
> >>> To: public-semweb-lifesci@w3.org
> >>> Subject: Re: Unstructured vs. Structured (was:
> HL7
> >> and patient records
> >>> in RDF/OWL?)
> >>>
> >>>
> >>> I'd argue that most information resources are
> >> indeed semi-structured.
> >>> The human brain is only able to meta-categorize
> >> resources based on its
> >>> structured aspects (markup and structural
> >> metadata), its informational
> >>> content (its aboutness), and context
> >> (environmental metadata).
> >>>
> >>> "Structured" data is only structured once we
> have
> >> a common
> >>> understanding
> >>> of its meaning. In this regard, data is never
> >> "raw" (except for
> >>> randomly
> >>> generated data) - as even structured database
> >> tables have metadata to
> >>> add meaning. So the term "semi-structured" is
> >> always adequate as
> >>> far as
> >>> I am concerned. You'd have to prove that there
> is
> >> any other type of
> >>> data
> >>> to me ;)
> >>>
> >>>
> >>> --
> >>> Christopher Cavnor
> >>>
> >>>
> >>> On 2/14/06 10:54 AM, "Cutler, Roger
> (RogerCutler)"
> >>> <RogerCutler@chevron.com>
> >>> wrote:
> >>>
> >>>>
> >>>> OK, then is there a preferred term for what we
> >> call "semi-structured
> >>>> data"?  That is, information that is structured
> >> but where the
> >>> structure
> >>>> is not easily determined and perhaps has not
> been
> >> formalized at all,
> >>> but
> >>>> for which a formalized structure could be
> >> defined?  For example,
> >>> tables
> >>>> in a spreadsheet?  We really care about this
> kind
> >> of thing, but I
> >>> don't
> >>>> want to confuse the issue by using terms that
> >> most people understand
> >>>> differently.
> >>>>
> >>>> Incidentally, from my personal experience the
> >> usage of the term
> >>>> semi-structured, that is, binary blobs in
> >> structured databases, is
> >>>> not
> >>>> very common.  Frankly, this is the first I have
> >> heard the term
> >>>> used in
> >>>> that sense, but maybe I just don't run in the
> >> right circles.
> >>>>
> >>>> -----Original Message-----
> >>>> From: public-semweb-lifesci-request@w3.org
> >>>> [mailto:public-semweb-lifesci-request@w3.org]
> On
> >> Behalf Of Jim
> >>>> Hendler
> 
=== message truncated ===
Received on Sunday, 19 February 2006 23:54:28 UTC