- From: Joshua Tauberer <jt@occams.info>
- Date: Sun, 30 Sep 2007 12:41:21 -0400
- To: semantic-web at W3C <semantic-web@w3c.org>
Hi, everyone. (This is a revised/combined reannouncement for what was originally posted on the Linking Open Data list.) Last November, Chris Bizer wrote, "[T]he DBLP server increases the size of the Semantic Web by around 10 percent ;-)" [1] Based on the same logic, I have recently increased the size of the semantic web by 200%! (in terms of the number of triples; and of course I'm also just joking here w.r.t. size of the semantic web) I'm announcing here a new U.S. 2000 Census dataset of 1 billion triples, accessible over SPARQL and browsable by linked data [2] principles, and re-announcing my U.S. Congress dataset which is newly browsable with linked data principles. These two datasets are interconnected, and the Census dataset is linked up via owl:sameAs to Geonames [3]. I like the Census data set a lot for three reasons--- first, if you live in the U.S. it has something for you, since it has detailed statistics on geographic entities down to the level of small towns/villages, and everyone lives somewhere; second, it meshes up with two other data sets; and third, it's rich enough on its own to support a wide array of interesting and real-world useful queries (if, say, you were doing research). The OpenLink guys were kind enough to host the data set previously, but I wanted to push the limits of my own semweb C# library [4] and I wanted to be able to revise the data set as needed, so I've wanted to host it myself, which only recently I was able to do (even though I've had the triples laying around for nearly a year). A complete description of the data set and how it was constructed and exposed is here: http://www.rdfabout.com/demo/census/ Some features of the data set: Data on 3,200 U.S. counties, 36,000 "towns", 16,000 "villages", 33,000 ZCTAs (something like zip-codes), and 435 congressional districts. Each of those locations contains around 10 thousand population statistics, as well as a dc:title, a basic hierarchical structure between regions, and latitude/longitude. Very basic geographic/name/lat-lng data (1 million triples) can be downloaded in N3. All of the 1 billion triples are accessible via SPARQL. See: http://www.rdfabout.com/demo/census/sparql.xpd which has a few sample queries. An example query is "List the states in the United States that have more students in dorms than prisoners." The URIs for the geographic regions are dereferencable http: URIs. (The URIs for the predicates in the data set will be updated to be dereferencable in the future.) For example, you can visit the URI for New York State: http://www.rdfabout.com/rdf/usgov/geo/us/ny (Some URIs return very large pages that take Firefox quite a while to render. That one's OK.) The dereferencable URIs return 303's to SPARQL DESCRIBE pages describing those URIs. There is a sitemap.xml file based on the latest draft circulated [5], referenced from robots.txt: http://rdfabout.com/robots.txt And, source code to generate the triples from the Census download files are posted. It's too large for me to provide the whole RDF myself, for now at least. The U.S. Congress data set, which I originally made SPARQL-accessible in December 2005 but is now revised to follow the new linked data principles, has 12 million triples containing brief biographical data for all members of Congress, and mainly data for federal legislation and voting records going back a number of years. Here are two example dereferencable URIs: http://www.rdfabout.com/rdf/usgov/congress/people/M000303 (= Senator John McCain) http://www.rdfabout.com/rdf/usgov/congress/109/bills/h867 (= a bill in Congress) Some example Congress-related queries are posted here: http://www.govtrack.us/sparql.xpd And dump files are here: http://www.govtrack.us/data/rdf/ An example I like to use is that one could fairly easily create a table using SPARQL aligning votes on a particular bill by congressmen with, for instance, the median commuting time to work of their constituents, as reported by the Census. Thanks to those who gave feedback on the LOD list --- I haven't been able to address all of it yet (like how to deal with backlinks on the dereferenced pages). [1] http://lists.w3.org/Archives/Public/semantic-web/2006Nov/0008.html [2] http://linkeddata.org/ [3] http://www.geonames.org/ [4] http://razor.occams.info/code/semweb [5] http://sw.deri.org/2007/07/sitemapextension/ -- - Josh Tauberer http://razor.occams.info "Yields falsehood when preceded by its quotation! Yields falsehood when preceded by its quotation!" Achilles to Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)
Received on Sunday, 30 September 2007 16:41:32 UTC