Re: From strings to things: ClinicalTrials.gov from Alan Ruttenberg on 2013-02-19 (public-semweb-lifesci@w3.org from February 2013)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Tue, 19 Feb 2013 13:22:48 -0500
To: Oktie Hassanzadeh <oktie@cs.toronto.edu>
Cc: Kerstin Forsberg <kerstin.l.forsberg@gmail.com>, public-semweb-lifesci@w3.org, em@zepheira.com, cdsouthan@hotmail.com, brendan.kelleher@karmadata.com
Message-ID: <CAFKQJ8kT+Uhoc0xC4Z=52Lescgt-cAuYKjvLjRG6NdyvmfCtKA@mail.gmail.com>
Hello Oktie

With respect, I think you have it backwards. The easiest way to show
people things is on web pages, not custom queries. The usual pattern is
to first get them interested and then decide whether we might work
together on a project.

In order to implement the labeling I suggest the following strategy:
First, create an array of arrays each element of which has (for the moment)
the only element being a property name, initially with three values
(highest priority first):
[[linkedct:outcome_measure],[skos:prefLabel],[rdfs:label].
In the interface code, getting  a label for an object will amount to
walking down this list and taking the value of the first property
that has a value on the object. For property values that are entities
in LinkedCT, use the label of the entity. This is the strategy used by protege.

Then start browsing. Click on a numbered link, look at the data, and
pick the property that looks best. Add it to the front of your list. Refresh
the page, which should no longer show the link as numbered.
Repeat for a couple of hours.

I'll even offer to split the labor. You implement the label property
priority list I describe and give me a way to modify it. then I'll
spend a couple of hours browsing and adding to the list.

Once you have an adequate list, a series of SPARQL update queries can
be used to assert all the found labels on to rdfs:label or
skos:prefLabel so that folks that do queries also have an easy time
finding a readable label.

There are easy additions to the procedure. For example, to enable
simple rewrites of property values, in the structure above, allow an
arbitrary number of regex pattern/rewrites expressions after each
property. When constructing the label iterate through applying those
until the first time the substitution
result differs from the original label. Use the substitution results
for the label. With this modification also include:

[linkedct:has_provenance,
s'http://data.linkedct.org/resource/provenance/httpclinicaltrialsgovshow(nct\d+)displayxmltrue'XML
record for study $1 from clinicaltrials.gov']

That's not as good as including the study title, but a bit better. To
get the study title it would make most sense to extract and use it as
a label for http://data.linkedct.org/resource/provenance/httpclinicaltrialsgovshownct00003553displayxmltrue

We should get together some time. I'm not far away, in Buffalo.

Best,
Alan

On Tue, Feb 19, 2013 at 12:00 PM, Oktie Hassanzadeh
<oktie@cs.toronto.edu> wrote:
> Alan,
>
> One of the benefits of Linked Data is that you can always use an alternative
> browser or write your own app if you are not happy with the HTML
> presentation that the source provides. I agree that some of the labels are
> far from useful, but most of LinkedCT's data transformation process is
> automatic, and coming up with a single pattern for labels that can work for
> all the entities is not easy. I also admit that we can and should do a
> better job on the data browse interface in LinkedCT, but trying to engage
> your clinical colleagues by browsing through HTML pages doesn't seem like
> the right approach to me. For example, you could write a SPARQL query to
> construct the labels you suggest and ignore the labels in the source data.
>
> Cheers,
> Oktie
>
> ========================
> Oktie Hassanzadeh
> oktie@cs.toronto.edu
> http://www.cs.toronto.edu/~oktie
>
>
> On Sun, Feb 17, 2013 at 1:48 PM, Alan Ruttenberg <alanruttenberg@gmail.com>
> wrote:
>>
>> Oktie,
>>
>> One thing I think would be helpful is attending more to using human
>> readable labels for terms. For example, if we browse directly at
>> linkedct.org we see a lot of long strings of numbers. But for most of
>> these there is a reasonable label. For example, under outcomes, the
>> first element is printed as 92f8444723382d2b6f2c06f69f3fe6f8, but if
>> we browse to
>> http://linkedct.org/resource/outcome/92f8444723382d2b6f2c06f69f3fe6f8/
>> we see the property measure "Graft vs tumor effect as measured by CT
>> scan at days 30, 60, and 100 following transplant", which in this case
>> is a reasonable label. Similarly on this page we see provenances as
>> http://clinicaltrials.gov/show/NCT00003553?displayxml=true, whereas:
>> "Clinicaltrials.org record for the study: Peripheral Stem Cell
>> Transplant in Treating Patients With Metastatic Kidney Cancer" is much
>> more inviting. That would link to the same place, but give the viewer
>> a reason to hit the link.
>>
>> One thing that's happened when I've tried to engage clinical
>> colleagues with linkedct is that it is hard for them to get past this
>> (and frankly for me too).
>>
>> Contact me off list if you want to understand the issue with the
>> licensing you've chosen.
>>
>> hth,
>> Alan
>>
>> On Sat, Feb 16, 2013 at 6:47 PM, Oktie Hassanzadeh <oktie@cs.toronto.edu>
>> wrote:
>> > Dear Kerstin,,,
>> >
>> > LinkedCT provides many external links including the seeAlso links you
>> > have
>> > pointed out, so the data is clearly 5-star Linked Data.
>> >
>> > Regarding the type of the links, there were long discussions at some
>> > point
>> > on this same list I believe on whether or not we should use sameAs to
>> > link
>> > to other resources, and the conclusion was that it's safer to use
>> > seeAlso
>> > since stating that an intervention on LinkedCT is the same as a drug on
>> > DBpedia for example, may be inaccurate.
>> >
>> > Regarding the quality and the quantity of the external links, we clearly
>> > can
>> > do better (and that's what we are planning to do), but existing links
>> > have
>> > already proven useful in a couple of use cases that take advantage of
>> > the
>> > links to PubMed, DrugBank, and DBpedia. One example is the LinkedSPLs
>> > work
>> > lead by Rich Boyce:
>> >
>> > Dynamic enhancement of drug product labels to support drug safety,
>> > efficacy,
>> > and effectiveness R.D. Boyce et al. Journal of Biomedical Semantics
>> > 4(1), 5,
>> > BioMed Central Ltd, 2013
>> >
>> > Cheers,
>> > Oktie
>> >
>> >
>> >
>> > On Sat, Feb 16, 2013 at 11:36 AM, Kerstin Forsberg
>> > <kerstin.l.forsberg@gmail.com> wrote:
>> >>
>> >> Dear Oktie,
>> >>
>> >> Yes, and I'm also pointing colleagues to this great dataset part of
>> >> LODD
>> >> (http://linkedct.org).
>> >>
>> >> Two reflections:
>> >> 1) My understanding is that colleagues are more comfortable with going
>> >> directly to the source and use the XML download
>> >> (http://clinicaltrials.gov/ct2/resources/download )
>> >> 2) I meant 5-star linked data in terms of linking outwards to existing
>> >> identifiers instead of "internal" URIs like
>> >>
>> >> http://linkedct.org/resource/intervention/a0e0900a02a9fa5501b51b95c281e3f9/
>> >> for Atorvastatin (Intervention).
>> >>
>> >> Looks like you do a good job with your See also links, e.g.
>> >> http://dbpedia.org/resource/Atorvastatin and
>> >> http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB01076)
>> >> However, my
>> >> understanding is that some of the types of things are quite
>> >> challenging, see
>> >> for example Drug Identification Links: Connecting Up,
>> >> http://www.citeulike.org/user/cdsouthan/article/10423875
>> >>
>> >> Kerstin
>> >>
>> >>
>> >>
>> >> 2013/2/16 Oktie Hassanzadeh <oktie@cs.toronto.edu>
>> >>>
>> >>> Dear Kerstin,
>> >>>
>> >>> Have you ever looked at http://linkedct.org ?
>> >>>
>> >>> LinkedCT uses a complex process to turn ClinicalTrials.gov into
>> >>> high-quality 5-start Linked Data. And yes it does provide HTTP URIs
>> >>> for all
>> >>> the "things" on ClinicalTrials.gov, provides HTML or RDF, SPARQL
>> >>> endpoint,
>> >>> etc.
>> >>>
>> >>> Please take a look at http://linkedct.org , http://linkedct.org/stats/
>> >>> ,
>> >>> and http://linkedct.org/faq/ , and the following articles for any
>> >>> questions
>> >>> you might have.
>> >>>
>> >>> Oktie Hassanzadeh, Soheil Hassas Yeganeh, Renée J. Miller: Linking
>> >>> Semistructured Data on the Web. WebDB 2011
>> >>> Oktie Hassanzadeh, Anastasios Kementsietsidis, Lipyeow Lim, Renée J.
>> >>> Miller, Min Wang: LinkedCT: A Linked Data Space for Clinical Trials.
>> >>> CoRR
>> >>> abs/0908.0567 2009
>> >>>
>> >>>
>> >>> Cheers,
>> >>> Oktie
>> >>>
>> >>> ========================
>> >>> Oktie Hassanzadeh
>> >>> oktie@cs.toronto.edu
>> >>> http://www.cs.toronto.edu/~oktie
>> >>>
>> >>>
>> >>> On Sat, Feb 16, 2013 at 7:58 AM, Kerstin Forsberg
>> >>> <kerstin.l.forsberg@gmail.com> wrote:
>> >>>>
>> >>>> Hi,
>> >>>> a couple of tweets, blog post comments 1) and email exchanges during
>> >>>> the
>> >>>> week on moving ClinicalTrials.gov "from strings to things" made me
>> >>>> think
>> >>>> this could be a topic for discussion at the upcoming CSHALS. As I'll
>> >>>> not be
>> >>>> able to be there in person I'm using this email list to hear your
>> >>>> thoughts.
>> >>>>
>> >>>> Background:
>> >>>> We see many nice examples of curated/standardized feeds of CT.gov
>> >>>> data,
>> >>>> such as http://linkedct.org,
>> >>>> http://www.patientslikeme.com/clinical_trials
>> >>>> and http://www.clinicalcollections.org/trials/ etc.. Most of them do
>> >>>> a good
>> >>>> job in turning “strings into things” and a few of them apply the
>> >>>> Linked Data
>> >>>> principles. However, I don’t think any of them use http-based URIs to
>> >>>> identify things such as sponsor organization, clinical sites,
>> >>>> clinical
>> >>>> investigators, geography, disease, drug, and time.
>> >>>>
>> >>>> I argue that we as a community caring for clinical trials data should
>> >>>> push back to FDA and NLM to get an official, standardized, linked
>> >>>> data
>> >>>> interface directly to the CT.gov at source. And yes, also for FDA and
>> >>>> NLM to
>> >>>> push back to pharma companies to provide standardized data about our
>> >>>> trials
>> >>>> with URIs to identify things instead of all these text strings. And
>> >>>> also if
>> >>>> pharma company websites such as
>> >>>> http://www.gsk-clinicalstudyregister.com/
>> >>>> and http://www.astrazenecaclinicaltrials.com/ did the same.
>> >>>>
>> >>>> Given the current movement for clinical trial data transparency 2) I
>> >>>> may
>> >>>> think the timing is good. But, potentially challenging both for FDA,
>> >>>> NLM and
>> >>>> for the pharma companies. They (we) will all look for practical
>> >>>> advice on
>> >>>> what URIs to use for things such as drugs and organizations.
>> >>>>
>> >>>> Thoughts?
>> >>>> Kerstin
>> >>>>
>> >>>>
>> >>>> 1)
>> >>>>
>> >>>> http://blog.karmadata.com/2013/02/11/loading-clinical-trials-data-in-ten-minutes-flat/comment-page-1/#comment-20
>> >>>> 2)
>> >>>>
>> >>>> http://www.placebocontrol.com/2013/02/our-new-glass-house-gsks-commitment-to.html
>> >>>
>> >>>
>> >>
>> >
>
>
Received on Tuesday, 19 February 2013 18:23:53 UTC