Re: OpenJurist, semantic web/linked data

Ah, now I remember the other thing I wanted to mention during the call today.
Everything was looking a bit more complicated until you mentioned that you were using Drupal.
Drupal has facilities for publishing RDF.
So I think it would be a good idea to look into that to see if you can do it all using Drupal.
Best
Hugh


On 25/11/2009 15:36, "Sam Deskin" <sam@openjurist.org> wrote:

Thank you for your time on the call today.  This is the email that I sent out about OpenJurist and the semantic web/linked data.

Your guidance is appreciated.


Sam Deskin
OpenJurist.org



From: public-egov-ig-request@w3.org [mailto:public-egov-ig-request@w3.org] On Behalf Of Sam Deskin
Sent: Thursday, November 05, 2009 10:31 AM
To: public-egov-ig@w3.org
Cc: 'Sam Deskin'
Subject: OpenJurist, semantic web/linked data

Hello Participants in the eGovernment Interest Group,
Glad to be a (new) member of the eGovernment Interest Group. I was invited because of the project that I am working on. I could use some guidance to learn best practices and to benefit from your experience.
The project I am working on is called OpenJurist.org <http://openjurist.org/> . It is a website with 647,000+ US Supreme Court and Appellate Court cases that we gotten from resource.org. We currently offer the cases for public consumption like several other websites. Our website is a source of legal information useful to attorneys/laypeople looking to understand a legal issue.  We are starting with case law and are working on getting more information organized as time goes by.

Our next major initiative is to use semantic tags / linked data to organize the cases, making them accessible in new and different ways than ever before. Right now it is quite crude.  But we are just beginning.  We have a lot of work ahead of us cleaning up the semantic data.

We have spent the past several months doing automated tagging of each case. We just finished our first run at this process and currently have 14,628,730 tags for these cases; 2M+ unique tags.  To give you a sense of the scale we are working and the vastness of the data we are working with, within the cases we have identified discussion of:
*        900 different medical treatments for
*        3300 different medical conditions;
*        3000 different terms describing industries;
*        41,400 cities names;
*        504,000 company names; and
*        1.3M individual people's names.

Now, the data requires A LOT of scrubbing. We have made headway on the easy ones: Continents, Countries, Presidents, and a few more. But the big ones need work. This could be a work in process for some time and will require help of a devoted volunteer army or paid staff to make it happen. Or, maybe, as researchers want to determine certain correlations, they will need/want to scrub the data to be able to make it useful for them - in the process, making it more useful for others.

In the near future, we plan on making it possible for people to use and organize the data in simple ways. For example, a site about the Presidents of the United States could list all cases that involve each president during/after his tenure (by date). Soon after, we plan on making it easy for people to link to the cases (en mass) on our site or take the data off our site and apply it as they would like on their own site using a widget or an API.

In the slightly more distant future we would like people to be able to manipulate the data against itself and in relation to other data. For example, one day people will be able to determine the following on the fly:
a.      In the 1960's,
b.     The American Civil Liberties Union brought cases to
c.      Secure the release of inmates in overcrowded prisons,
d.     And won those cases
e.     In certain states,
f.       How did these case affect crime in
                                                              i.     those states?
                                                            ii.     other states?

We are pretty close to being able to allow people to organize the data, at least on our own site (the API will need to be built), but have LOTS of work to be able to do let people truly manipulate the data.  To give you an idea, we are working on extracting dates that cases were heard/decided right now, we have identified which cases the ACLU is mentioned in, we need to determine which cases are about prison overcrowding, whether those cases resulted in the court ordering a release of prisoners, we know which states are mentioned in each case (but not yet whether the case is specifically about that state), and have not yet incorporated any outside data. We have a lot of work to do to make this information truly useful in the way we envision and to make the data useful in ways we cannot imagine.

Your guidance/ideas are appreciated.  Feel free to ask me questions and I will try my best to answer them.

Sincerely,

Sam Deskin
OpenJurist.org

Received on Wednesday, 25 November 2009 22:53:15 UTC