U.S. Census/Congress datasets: 1 billion triples

Hi, everyone. (This is a revised/combined reannouncement for what was 
originally posted on the Linking Open Data list.)

Last November, Chris Bizer wrote, "[T]he DBLP server increases the size 
of the Semantic Web by around 10 percent ;-)" [1] Based on the same 
logic, I have recently increased the size of the semantic web by 200%! 
(in terms of the number of triples; and of course I'm also just joking 
here w.r.t. size of the semantic web)

I'm announcing here a new U.S. 2000 Census dataset of 1 billion triples, 
accessible over SPARQL and browsable by linked data [2] principles, and 
re-announcing my U.S. Congress dataset which is newly browsable with 
linked data principles. These two datasets are interconnected, and the 
Census dataset is linked up via owl:sameAs to Geonames [3].

I like the Census data set a lot for three reasons--- first, if you live 
in the U.S. it has something for you, since it has detailed statistics 
on geographic entities down to the level of small towns/villages, and 
everyone lives somewhere; second, it meshes up with two other data sets; 
and third, it's rich enough on its own to support a wide array of 
interesting and real-world useful queries (if, say, you were doing 
research).

The OpenLink guys were kind enough to host the data set previously, but 
I wanted to push the limits of my own semweb C# library [4] and I wanted 
to be able to revise the data set as needed, so I've wanted to host it 
myself, which only recently I was able to do (even though I've had the 
triples laying around for nearly a year).

A complete description of the data set and how it was constructed and 
exposed is here:

    http://www.rdfabout.com/demo/census/

Some features of the data set:

Data on 3,200 U.S. counties, 36,000 "towns", 16,000 "villages", 33,000 
ZCTAs (something like zip-codes), and 435 congressional districts.

Each of those locations contains around 10 thousand population 
statistics, as well as a dc:title, a basic hierarchical structure 
between regions, and latitude/longitude.

Very basic geographic/name/lat-lng data (1 million triples) can be 
downloaded in N3.

All of the 1 billion triples are accessible via SPARQL. See: 
http://www.rdfabout.com/demo/census/sparql.xpd which has a few sample 
queries. An example query is "List the states in the United States that 
have more students in dorms than prisoners."

The URIs for the geographic regions are dereferencable http: URIs. (The 
URIs for the predicates in the data set will be updated to be 
dereferencable in the future.) For example, you can visit the URI for 
New York State:

     http://www.rdfabout.com/rdf/usgov/geo/us/ny

(Some URIs return very large pages that take Firefox quite a while to 
render. That one's OK.)

The dereferencable URIs return 303's to SPARQL DESCRIBE pages describing 
those URIs.

There is a sitemap.xml file based on the latest draft circulated [5], 
referenced from robots.txt: http://rdfabout.com/robots.txt

And, source code to generate the triples from the Census download files 
are posted. It's too large for me to provide the whole RDF myself, for 
now at least.


The U.S. Congress data set, which I originally made SPARQL-accessible in 
December 2005 but is now revised to follow the new linked data 
principles, has 12 million triples containing brief biographical data 
for all members of Congress, and mainly data for federal legislation and 
voting records going back a number of years. Here are two example 
dereferencable URIs:

     http://www.rdfabout.com/rdf/usgov/congress/people/M000303
     (= Senator John McCain)

     http://www.rdfabout.com/rdf/usgov/congress/109/bills/h867
     (= a bill in Congress)

Some example Congress-related queries are posted here:
     http://www.govtrack.us/sparql.xpd
And dump files are here:
     http://www.govtrack.us/data/rdf/

An example I like to use is that one could fairly easily create a table 
using SPARQL aligning votes on a particular bill by congressmen with, 
for instance, the median commuting time to work of their constituents, 
as reported by the Census.


Thanks to those who gave feedback on the LOD list --- I haven't been 
able to address all of it yet (like how to deal with backlinks on the 
dereferenced pages).

[1] http://lists.w3.org/Archives/Public/semantic-web/2006Nov/0008.html
[2] http://linkeddata.org/
[3] http://www.geonames.org/
[4] http://razor.occams.info/code/semweb
[5] http://sw.deri.org/2007/07/sitemapextension/

-- 
- Josh Tauberer

http://razor.occams.info

"Yields falsehood when preceded by its quotation!  Yields
falsehood when preceded by its quotation!" Achilles to
Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)

Received on Sunday, 30 September 2007 16:41:32 UTC