[BIORDF] Re: Unstructured vs. Structured (was: HL7 and patient records in RDF/OWL?)

Matt, 

Spreadsheets are indeed useful as formatted sources that can be readily converted into RDF. We've used them as the primary source of expression data for BioDash (see attached averages; full GeneLogic data at http://www.samsi.info/200304/dmml/web-internal/bio/data/data_rsvd.xls ). It almost seems a mapping tool could be written to take any excel files, a GRDDL-like conversion of column headers, row-headers, and cells, to produce RDF from these (see the example).

In our example, we wrote the conversion scripts directly into the excel file. The resulting (adenine/N3) file is show as well, with symbols strings mapped to URI's. The cool thing here is that if you add a DB query using the symbols strings (we did this within BioDash), you can take the returned gene information, convert it to RDF, and conenct it to the expression graph through the probes for each the row (see resulting adenine file).

Perhaps the BIORDF group should include using sdf sources as part of their overall strategy for producing RDF from current structured files (e.g.,  gene expression, screening, and clinical data in sdf). Many published papers have data tables, and this would be a great way to auto convert them to RDF!

Eric

--- Matthew Cockerill <matt@biomedcentral.com> wrote:

> 
> I couldn't agree more.
> 
> Spreadsheets (and equivalently, CSV files) are a
> large fraction of  
> the 'additional datafiles' that BioMed Central
> receives from authors.
> 
> What would be great would be to be able to define
> some simple  
> standards and/or templates which authors could
> follow in their  
> spreadsheets, to allow the automatic recognition of
> key life science  
> identifiers, and quantitative attributes,  and so
> the generation of RDF.
> 
>  From my point of view, that's the most basic,
> practical and  
> prevalent example of the whole semi-structured data,
> and so seems  
> like a good starting point.
> 
> Matt
> 
> On 15 Feb 2006, at 5:42, Cutler, Roger (RogerCutler)
> wrote:
> 
> >
> > That's too deep for me.  I'll be satisfied, at
> least in an immediate
> > sense, with a demonstration of how to generate RDF
> from an Excel
> > spreadsheet.  I think I'll just start saying
> "Excel spreadsheet" and
> > forget about the term that we use internally to
> categorize the  
> > kinds of
> > problems we have.  Spreadsheets are pretty much
> the 80-20 of that
> > problem, so why not call a spade a spade.  I'm
> really not very good at
> > generalizing and categorizing.
> >
> > -----Original Message-----
> > From: public-semweb-lifesci-request@w3.org
> > [mailto:public-semweb-lifesci-request@w3.org] On
> Behalf Of Christopher
> > Cavnor
> > Sent: Tuesday, February 14, 2006 3:54 PM
> > To: public-semweb-lifesci@w3.org
> > Subject: Re: Unstructured vs. Structured (was: HL7
> and patient records
> > in RDF/OWL?)
> >
> >
> > I'd argue that most information resources are
> indeed semi-structured.
> > The human brain is only able to meta-categorize
> resources based on its
> > structured aspects (markup and structural
> metadata), its informational
> > content (its aboutness), and context
> (environmental metadata).
> >
> > "Structured" data is only structured once we have
> a common  
> > understanding
> > of its meaning. In this regard, data is never
> "raw" (except for  
> > randomly
> > generated data) - as even structured database
> tables have metadata to
> > add meaning. So the term "semi-structured" is
> always adequate as  
> > far as
> > I am concerned. You'd have to prove that there is
> any other type of  
> > data
> > to me ;)
> >
> >
> > --
> > Christopher Cavnor
> >
> >
> > On 2/14/06 10:54 AM, "Cutler, Roger (RogerCutler)"
> > <RogerCutler@chevron.com>
> > wrote:
> >
> >>
> >> OK, then is there a preferred term for what we
> call "semi-structured
> >> data"?  That is, information that is structured
> but where the
> > structure
> >> is not easily determined and perhaps has not been
> formalized at all,
> > but
> >> for which a formalized structure could be
> defined?  For example,
> > tables
> >> in a spreadsheet?  We really care about this kind
> of thing, but I
> > don't
> >> want to confuse the issue by using terms that
> most people understand
> >> differently.
> >>
> >> Incidentally, from my personal experience the
> usage of the term
> >> semi-structured, that is, binary blobs in
> structured databases, is  
> >> not
> >> very common.  Frankly, this is the first I have
> heard the term  
> >> used in
> >> that sense, but maybe I just don't run in the
> right circles.
> >>
> >> -----Original Message-----
> >> From: public-semweb-lifesci-request@w3.org
> >> [mailto:public-semweb-lifesci-request@w3.org] On
> Behalf Of Jim  
> >> Hendler
> >> Sent: Monday, February 13, 2006 3:43 PM
> >> To: Pat Hayes; Gao, Yong
> >> Cc: public-semweb-lifesci@w3.org
> >> Subject: Re: Unstructured vs. Structured (was:
> HL7 and patient  
> >> records
> >> in RDF/OWL?)
> >>
> >>
> >> At 14:46 -0600 2/13/06, Pat Hayes wrote:
> >>>>
> >>>> The point I'm trying to make is this: The
> concept of
> > "structuredness"
> >>>> is relative and context-sensitive.
> >>>
> >>> Hear, hear. Well said.
> >>>
> >>> Pat Hayes
> >>>
> >>
> >>
> >> FWIW, Structured, unstructured and
> semi-structured, although
> > non-precise
> >> concepts in common language and (esp) philosophy,
> have well-defined
> > and
> >> precise meanings in database jargon" -- most
> database books have
> > decent
> >> definitions that are consistent with:
> >>   unstructured - NL text
> >>   semi-structured - unstructured fields within a
> structured DB  
> >> context
> >>   structured - relational model (or similar)
> (those papers with
> >> technical definitions tend to get ugly and
> recourse to relational
> >> calculus, so these overly simplified definitions
> should suffice for
> > now)
> >> that said, in the spirit of this particular
> thread, I think we should
> > be
> >> careful and, if we mean to use it in a DB
> context, make it clear in
> > any
> >> document that uses the term (i.e. "structured
> database" v.
> >> "structured data" which are very different in
> some contexts)
> >>     -JH
> >

Received on Sunday, 19 February 2006 17:18:02 UTC