project to benchmark SPARQL engines on Wikidata from Peter F. Patel-Schneider on 2024-11-12 (public-rdf-star-wg@w3.org from November 2024)

From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
Date: Tue, 12 Nov 2024 13:45:48 -0500
To: RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-ID: <feba3b79-aaae-469a-851d-a1380faca952@gmail.com>

TL;DR:  Please contact me if you want your SPARQL implementation evaluated on 
Wikidata.


I have started work on a small project whose funded goal is to help determine 
how well QLever could work as a SPARQL query service for Wikidata as Wikidata 
grows.

The current methodology of the project is to:
- gather queries from both existing benchmarks and users of Wikidata query 
services;
- run queries against the current full Wikidata RDF dump, concentrating on 
hard queries and queries that involve parts of the graph of varying size;
- create synthetic extensions of the current full Wikidata RDF dump and run 
queries on them; and
- analyze results to determine how well QLever would work as Wikidata grows.

The project will run queries against both the public QLever Wikidata service 
and a local version of the service running on high-end hardware with looser 
resource limits.  The current hardware for the local service is a Ryzen 9 
9950X with 192GB memory, 10TB of fast SSDs, and an 8TB hard drive.

Information on the progress of the project is kept at 
https://www.wikidata.org/wiki/Wikidata:Scaling_Wikidata/Benchmarking


I want to expand this project to other SPARQL implementations so I am 
soliciting information on suitable implementations.   A suitable 
implementation must either be open source or be used in a public service 
running queries against the full Wikidata RDF dump.  For any public service 
there must be information available on the hardware the service runs on and 
any resource limitations on it.  If there is no public service there must be 
good instructions for building the system and loading the current full 
Wikidata RDF dump into it in under 3 days on a high-end desktop.

There should be, both for public services and local builds, a way to run 
queries without interference from other queries (including clearing any 
caches) and a way to extract resource consumption of evaluating a query 
(including both compute time and memory needed).  There should also be 
information on the best parameters to use for graphs of size about 20 billion 
triples or information on the best parameters to use for Wikidata.

I believe that Blazegraph, QLever, Virtuoso Open Source, and Millenium DB 
currently satisfy (or nearly satisfy) these requirements.


If you are interested in having your system evaluated in this expanded project 
I would like to hear from you.  I will collaborate with you to find out how to 
set up your system to make it as capable as possible on data of the size and 
scope of Wikidata.  I welcome input on setting up the benchmarks and on other 
useful activities that would fit within the expanded project.


Peter F. Patel-Schneider

Received on Tuesday, 12 November 2024 18:45:53 UTC