Re: DBpedia citations & references challenge

* Data Update for
http://wiki.dbpedia.org/ideas/idea/261/dbpedia-citations-reference-challenge/
*

Thanks to your feedback (and especially from the WikiCite community), we
managed to fix a few bugs and extended the coverage of the extracted
citations.
The new citation dumps come from the upcoming 2016-04 release and provide
*14x more citation data*  (from 7.1M triples to 97.5M triples)

We share the results early for the DBpedia challenge here
http://downloads.dbpedia.org/temporary/citations/

For those still not sure what they can do with our data, here's what we
managed to calculate at the airport while travelling, imagine what you can
do with more time and a normal desk;)

Did you know that the most cited Wikipedia...

books are about Football, WW2 and British songs?:
 * (4853 articles) SEN Encyclopedia of AFL Footballers: Every AFL/VFL
Player Since 1897 -> http://books.google.com/books?vid=ISBN978-1-921496-32-5
 * (3191 articles) Die Ritterkreuzträger: 1939 - 1945 ->
http://books.google.com/books?vid=ISBN978-3-938845-17-2
 * (2927 articles) Die Träger des Ritterkreuzes des Eisernen Kreuzes ->
http://books.google.com/books?vid=ISBN978-3-7909-0284-6
 * (1958 articles) British Hit Singles & Albums ->
http://books.google.com/books?vid=ISBN1-904994-10-5
 * (1694 articles) Das Deutsche Kreuz ->
http://books.google.com/books?vid=ISBN978-3-931533-45-8

Scientific articles are about biology & astronomy?:
 * 5210 http://doi.org/10.1073/pnas.242603899
 * 3757 http://doi.org/10.1101/gr.2596504
 * 2449 http://doi.org/10.1038/ng1285
 * 1667 http://doi.org/10.1051/0004-6361:20078357
 * 1445 http://doi.org/10.1007/bf00171763

websites mostly about census?:
 * 51328 http://www.stat.gov.pl/broker/access/prefile/listPreFiles.jspa
 * 21758 http://www.census.gov/geo/www/gazetteer/gazette.html
 * 21741 http://www.census.gov/prod/www/decennial.html
 * 11954
http://www.census.gov/popest/data/cities/totals/2014/SUB-EST2014.html
 * 10680 http://globiz.pyraloidea.org/Pages/Reports/TaxonReport.aspx


Dates (citations with only dates and a reference needed):
 * February 2007, 5463 times
 * October 2010, 5245 times
 * July 2015, 3919 times
 * October 2015, 3916 times
 * August 2015, 3885 times
(comes from http://citation.dbpedia.org/hash/* IRIs)

see the following lists for complete lists
http://downloads.dbpedia.org/temporary/citations/results.same-citations.different-articles-no-hash.count
(we count only references from different pages)
http://downloads.dbpedia.org/temporary/citations/results.same-citations.all-articles-no-hash.count
(we count all references, even from same page)


the top 10 domains from wikipedia references are:
 * 1561315 books.google.com
 * 1540250 citation.dbpedia.org
 *  836371 doi.org
 *  154664 news.bbc.co.uk
 *  132997 nytimes.com
 *  129410 bbc.co.uk
 *  101807 census.gov
 *  101125 worldcat.org
 *   89082 news.google.com
 *   76503 ncbi.nlm.nih.gov
see a complete list in:
http://downloads.dbpedia.org/temporary/citations/results.domains.count
http://downloads.dbpedia.org/temporary/citations/results.domains-distinct.count
(counts distinct citations)

Articles with the most needed citations are:
 * Football_records_in_Spain (41 citations needed)
 * Ahmed_Belbachir_Haskouri (29 citations needed)
 * Tree_model (24 citations needed)
 * Immigration_to_Chile (21 citations needed)
 * Larry_Ryckman (18 citations needed)
see here for a full list:
http://downloads.dbpedia.org/temporary/citations/results.articles-with-citations-neededd.count


We extract data from many templates. Here's the top 10 and a complete list
can be found here:
http://downloads.dbpedia.org/temporary/citations/results.template.count
 * 9348109 Cite_web
 * 2821628 Cite_news
 * 1958270 Cite_book
 * 1294760 Cite_journal
 *  467933 Citation
 *  317309 Citation_needed
 *   46264 Cite_press_release
 *   37315 Cn
 *   36258 Cite_encyclopedia
 *   33754 Cite_episode

We also have some basic statistics for templates with properties and
properties alone
http://downloads.dbpedia.org/temporary/citations/results.template.count
http://downloads.dbpedia.org/temporary/citations/results.template-property.count

Note that the statistics we provide are meant only as a proof of concept
and are based on the enwiki-20160305 dump
you can regenerate them using this shell script:
http://downloads.dbpedia.org/temporary/citations/generate-basic-citation-stats.bash


Cheers,
Dimitris on behalf of the OC



On Tue, Jun 7, 2016 at 10:51 AM, Dimitris Kontokostas <
kontokostas@informatik.uni-leipzig.de> wrote:

> In the latest release (2015-10) DBpedia started exploring the citation and
> reference data from Wikipedia and we were pleasantly surprised by the
> rich data
> <http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_data_en.ttl.bz2>
> we managed to extract.
>
>    -
>
>    citation_data_en.ttl.bz2
>    <http://downloads.dbpedia.org/2015-10/core-i18n/en/citation_data_en.ttl.bz2>
>    (sample
>    <http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_data_en.ttl.bz2>
>    )
>    -
>
>    citation_links_en.ttl.bz2
>    <http://downloads.dbpedia.org/2015-10/core-i18n/en/citation_links_en.ttl.bz2>
>    (sample
>    <http://downloads.dbpedia.org/preview.php?file=2015-10_sl_core-i18n_sl_en_sl_citation_links_en.ttl.bz2>
>    )
>
>
> This data holds huge potential, especially for the Wikidata challenge of providing
> a reference source for every statement. It describes not only a lot of
> bibliographical data, but also a lot of web pages and many other sources
> around the web.
>
> The data we extract at the moment is quite raw and can be improved in many
> different ways. Some of the potential improvements are:
>
>    -
>
>    Extend the citation extractor to handle other Wikipedia language
>    editions <https://github.com/dbpedia/extraction-framework/issues/451>;
>    currently only English Wikipedia is supported.
>    -
>
>    Map the data to a relevant Bibliographic ontology
>    <https://github.com/dbpedia/mappings-tracker/issues/79> (there are
>    many candidates and, although BIBO got most votes, we are open to other
>    ontologies)
>    -
>
>    Map the data to existing Bibliographic LOD (eg TEL has 100M records,
>    Worldcat 300M) or online books (eg Google Books). See the citationIri
>    issue <https://github.com/dbpedia/extraction-framework/issues/452>.
>    -
>
>    Ways to merge / fuse identical citations from multiple articles
>    -
>
>    Use the citation data in the Wikidata primary sources tool
>    <https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool>
>    -
>
>    Surprise us with your ideas!
>
>
> We welcome contributions that improve the existing citation dataset in any
> way; and we are open to collaboration and helping. Results will be
> presented at the next DBpedia meeting: 15 September 2016 in Leipzig,
> co-located with SEMANTiCS 2016. Each participant should submit a short
> description of his/her contribution by Monday 12 September 2016 and present
> his/her work at the meeting. Comments, questions can be posted on the
> DBpedia discussion & developer lists or in our new DBpedia ideas page
> <http://wiki.dbpedia.org/ideas/idea/261/dbpedia-citations-reference-challenge/>
> .
>
> Submissions will be judged by the Organizing Committee and the best two
> will receive a prize.
>
> Organizing Committee
>
>    -
>
>    Vladimir Alexiev, Ontotext and DBpedia BG
>    -
>
>    Anastasia Dimou, Ghent University, iMinds
>    - Dimitris Kontokostas, KILT/AKSW, DBpedia Association
>
>
>
> --
> Dimitris Kontokostas
> Department of Computer Science, University of Leipzig & DBpedia
> Association
> Projects: http://dbpedia.org, http://rdfunit.aksw.org,
> http://aligned-project.eu
> Homepage: http://aksw.org/DimitrisKontokostas
> Research Group: AKSW/KILT http://aksw.org/Groups/KILT
>
>


-- 
Kontokostas Dimitris

Received on Sunday, 26 June 2016 10:37:45 UTC