RE: Unstructured vs. Structured (was: HL7 and patient records in RDF/OWL?) from Cutler, Roger (RogerCutler) on 2006-02-13 (public-semweb-lifesci@w3.org from February 2006)

From: Cutler, Roger (RogerCutler) <RogerCutler@chevron.com>
Date: Mon, 13 Feb 2006 10:23:45 -0600
To: "Gao, Yong" <YGAO@PARTNERS.ORG>, public-semweb-lifesci@w3.org
Message-ID: <0C237C50B244FD44BE47B8DCE23A3052011C62DA@HOU150NTXC2MC.hou150.chevrontexaco.net>

Welll ... Maybe.  I see your point, but I think nonetheless that there
are some important distinctions to be made within what you are calling
non-RDF.  On one extreme one has highly structured data in relational
databases.  One key here is that the data definitions are contained in
machine readable, standardized schemas.  Another is that at least some
of the relationships and keying of the data are explicit.  Slightly less
structured are XML documents that have schemas. Intermediate are data
that have internal structure but the definition of that structure is not
easily determined by a machine.  XML documennts sans schema, HTML
documents and spreadsheets come to mind, probably in decreasing order of
"structuredness".  We in CVX call these "semi-structured data", but I'm
not sure whether this usage is widespread.  Then on the other end of the
spectrum is text, in which, as you point out, a structure certainly
exists, but even a human being may find it really hard to figure out and
formalize that structure.

We are pretty interested in the "semi-structured" realm, as defined
above, particularly because we have a lot of business critical
information in spreadsheets, and I noted at the F2F that a number of
other representatives were, too.  

-----Original Message-----
From: public-semweb-lifesci-request@w3.org
[mailto:public-semweb-lifesci-request@w3.org] On Behalf Of Gao, Yong
Sent: Friday, February 10, 2006 12:02 PM
To: public-semweb-lifesci@w3.org
Subject: Unstructured vs. Structured (was: HL7 and patient records in
RDF/OWL?)


Having trained as a computational linguist, one thing I remember vividly
is the debate among linguists on the issue of semantics vs. syntax. One
of the wisdoms I gained from that experience is the saying "One man's
semantics is another man's syntax." (I'll need to dig deeper to find its
origin.)

Having worked on building practical tools for data extraction and
integration, I've learned the lesson on the importance of NOT getting
too boggled down on labeling what's "structured" and what's not. Here I
quote another saying "One Man's Ceiling is Another Man's Floor"


The point I'm trying to make is this: The concept of "structuredness" is
relative and context-sensitive. For example, natural language texts are
highly structured, it's just we still have a long way to fully discover
and understand its structures and use them to find meanings
mechanically.
Another example, HTML pages are structured so that web browsers can
display them properly. XML and RDF data can as well be "unstructured" if
you put a blob of text, say abstract, between a pair of tags.

I would almost suggest the term "non-RDF", rather than "unstructured",
be used in the context of transforming some data into RDF format.

---
Yong Gao, PH.D.
MassGeneral Institute for Neurodegenerative Disease (MIND)

Received on Monday, 13 February 2006 16:24:24 UTC