RE: OpenJurist, semantic web/linked data from Sam Deskin on 2009-11-25 (public-egov-ig@w3.org from November 2009)

From: Sam Deskin <sam@openjurist.org>
Date: Wed, 25 Nov 2009 07:36:39 -0800
To: "'Sam Deskin'" <sam@openjurist.org>, <public-egov-ig@w3.org>
Message-ID: <084b01ca6de5$14ecc4c0$3ec64e40$@org>
Thank you for your time on the call today.  This is the email that I sent
out about OpenJurist and the semantic web/linked data.

 

Your guidance is appreciated.

 

Sam Deskin

OpenJurist.org

 

 

From: public-egov-ig-request@w3.org [mailto:public-egov-ig-request@w3.org]
On Behalf Of Sam Deskin
Sent: Thursday, November 05, 2009 10:31 AM
To: public-egov-ig@w3.org
Cc: 'Sam Deskin'
Subject: OpenJurist, semantic web/linked data

 


Hello Participants in the eGovernment Interest Group,


Glad to be a (new) member of the eGovernment Interest Group. I was invited
because of the project that I am working on. I could use some guidance to
learn best practices and to benefit from your experience.


The project I am working on is called OpenJurist.org
<http://openjurist.org/> . It is a website with 647,000+ US Supreme Court
and Appellate Court cases that we gotten from resource.org.  We currently
offer the cases for public consumption like several other websites. Our
website is a source of legal information useful to attorneys/laypeople
looking to understand a legal issue.  We are starting with case law and are
working on getting more information organized as time goes by.

 

Our next major initiative is to use semantic tags / linked data to organize
the cases, making them accessible in new and different ways than ever
before. Right now it is quite crude.  But we are just beginning.  We have a
lot of work ahead of us cleaning up the semantic data.

 

We have spent the past several months doing automated tagging of each case.
We just finished our first run at this process and currently have 14,628,730
tags for these cases; 2M+ unique tags.  To give you a sense of the scale we
are working and the vastness of the data we are working with, within the
cases we have identified discussion of:

.         900 different medical treatments for 

.         3300 different medical conditions;

.         3000 different terms describing industries;

.         41,400 cities names;

.         504,000 company names; and

.         1.3M individual people's names.

 

Now, the data requires A LOT of scrubbing. We have made headway on the easy
ones: Continents, Countries, Presidents, and a few more. But the big ones
need work. This could be a work in process for some time and will require
help of a devoted volunteer army or paid staff to make it happen. Or, maybe,
as researchers want to determine certain correlations, they will need/want
to scrub the data to be able to make it useful for them - in the process,
making it more useful for others. 

 


In the near future, we plan on making it possible for people to use and
organize the data in simple ways. For example, a site about the Presidents
of the United States could list all cases that involve each president
during/after his tenure (by date). Soon after, we plan on making it easy for
people to link to the cases (en mass) on our site or take the data off our
site and apply it as they would like on their own site using a widget or an
API.

 

In the slightly more distant future we would like people to be able to
manipulate the data against itself and in relation to other data. For
example, one day people will be able to determine the following on the fly:

a.       In the 1960's, 

b.      The American Civil Liberties Union brought cases to

c.       Secure the release of inmates in overcrowded prisons,

d.      And won those cases 

e.      In certain states,

f.        How did these case affect crime in 

                                                               i.      those
states?

                                                             ii.      other
states?

 

We are pretty close to being able to allow people to organize the data, at
least on our own site (the API will need to be built), but have LOTS of work
to be able to do let people truly manipulate the data.  To give you an idea,
we are working on extracting dates that cases were heard/decided right now,
we have identified which cases the ACLU is mentioned in, we need to
determine which cases are about prison overcrowding, whether those cases
resulted in the court ordering a release of prisoners, we know which states
are mentioned in each case (but not yet whether the case is specifically
about that state), and have not yet incorporated any outside data. We have a
lot of work to do to make this information truly useful in the way we
envision and to make the data useful in ways we cannot imagine.

 

Your guidance/ideas are appreciated.  Feel free to ask me questions and I
will try my best to answer them. 

 

Sincerely,

 

Sam Deskin

OpenJurist.org
Received on Wednesday, 25 November 2009 15:37:09 UTC