DBpedia SPARQL Benchmark: a pure RDF benchmark based on actually posed queries

DBPSB is a general SPARQL benchmark procedure, which we apply to the DBpedia knowledge base. The benchmark is based on query-log mining, clustering and SPARQL feature analysis. In contrast to other benchmarks, we perform measurements on actually posed queries against existing RDF data.

Homepage Source Code

The project has been deprecated and work on a predecessor is ongoing: https://github.com/AKSW/mosquito.

Triple stores are the backbone of increasingly many Data Web applications. It is evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Web in general. Assessing the performance of current triple stores is, therefore, important to observe weaknesses and strengths of current implementations. DBPSB is a general SPARQL benchmark procedure, which we apply to the DBpedia knowledge base. The benchmark is based on query-log mining, clustering and SPARQL feature analysis. In contrast to other benchmarks, we perform measurements on actually posed queries against existing RDF data. Previous approaches often compared relational and triple stores and, thus, settled on measuring performance against a relational database, which has been converted to RDF, using SQL like queries. We argue that a pure RDF benchmark is more useful to compare between existing triple stores and provide results for Virtuoso, Sesame, Jena-TDB, and BigOWLIM.

Here we provide an overview of steps required to create the benchmark. The methodology can in principle be applied to all RDF knowledge bases. It allows the benchmark to be updated as the knowledge bases and queries to it evolve

Dataset Generation

Base Data: DBpedia 3.5.1 with all data sets mentioned are available here.

In order to generate a dataset of specific size, do the following steps:

  1. In benchmark.xml file set the tag to the dataset size required e.g. 10, 20, ....
  2. In benchmark.xml file set the tag to either 'RandomInstance' or 'RandomTriple'.
  3. In benchmark.xml file set the tag to the name of the required output file.
  4. In benchmark.xml file set the tag to the address of your endpoint.
  5. Call function generateData() from function main, and it will generate the data.

Generating data using Random Triple method is much faster than generation by Random Instance.

Datasets are available for download here.

Query Generation

Query Log: here.

In order sort the query log by frequency of queries, do the following steps: -

  1. In benchmark.xml file set the tag to the folder in which the query log files exist. Log files are assumed to have .log extension.
  2. In benchmark.xml file set the tag to the output file in which the sorted queries along with the frequency of each query.
  3. In benchmark.xml file set the tag to the value below which the query is discarded.
  4. Call function 'sortQueries' in 'ClusterProcessor' class in order to sort the queries and write the output to the file specified in benchmark.xml, and it will also discard the queries whose frequencies are below the specified value.
  5. (Optional) this step is optional but it is recommended in order to accelerate the clustering step. Call function 'makeIDs' in 'ClusterProcessor' class in order to give each query a unique ID and remove the common keywords e.g. from, and where, that will not affect the clustering process and will also make the clustering faster, as the strings will be shorter.

The approximate time of that step is 2.5 hours.

Clustered Query Log:

  1. Feed the file containing the queries along with their IDs to 'BorderFlow' application. More information about Border Flow can be found [here]( an overview of steps required to create the benchmark. The methodology can in principle be applied to all RDF knowledge bases. It allows the benchmark to be updated as the knowledge bases and queries to it evolve. "here").
  2. From 'BorderFlow' you will get a file containing the 'clusterID' and the queries belonging to that cluster.
  3. By clustering you get several clusters, each cluster contains a similar set of queries, i.e. queries belonging to the same cluster should to some extent contain similar SPARQL features.

List of Benchmark Queries:

To avoid caching of query results we should introduce a small difference in each run of the query in order to force the triplestore not fetch the query result from its cache. The required steps to get our set of queries are as follows: -

  1. Identify SPARQL-features of interest e.g. UNION, DISTINCT, OPTIONAL, ...
  2. Select one of the queries satisfying the feature of interest using the following methodology:
  • Traverse the clusters in order of their total frequency, for each cluster in order select the query with the highest frequency within the cluster with the highest total frequency on condition that this query satisfies one of SPARQL feature or combination of features.
  • Inspect each one of the selected queries, in order to identify some static term(s), i.e. URI or literal, that can be used as a content placeholder(s).
  • Replace the selected static term in each query with a variable(s) and make a select query for that variable(s) only with DISTINCT in order to avoid repetition and with LIMIT to get a specific number of allowed values, this query is called 'Auxiliary Query'.
  • Perform the auxiliary query to get a list of allowed values for placeholder(s). During each run of the original query replace its placeholder(s) with some random value out of its list of allowed values.
  • Sort the clusters descendingly by their total frequency, i.e. the cumulative sum of frequencies of all queries belong to that cluster.

The time for selecting the queries and their auxiliary queries is approx. 3 days.

Sample Query

Following is a sample query with a variable part, that will be used as a placeholder during the hot run phase. The placeholder is indicated by %%v%%.

SELECT * WHERE { {?v2 a dbp-owl:Settlement; rdfs:label %%v%%. ?v6 a dbp-owl:Airport.} {?v6 dbp-owl:city ?v2.} UNION {?v6 dbp-owl:location ?v2.} {?v6 dbp-prop:iata ?v5.} UNION {?v6 dbp-owl:iataLocationIdentifier ?v5.} OPTIONAL {?v6 foaf:homepage ?v7.} OPTIONAL {?v6 dbp-prop:nativename ?v8.} }

We use another query called the auxiliary query in order to get a list of possible values for that placeholder. During the hot-run phase, the application selects a random value out of the list of possible values of the placeholder.

The auxiliary query used to fill that list is as follows:

SELECT DISTINCT ?v WHERE{ {?v2 a dbp-owl:Settlement; rdfs:label ?v. ?v6 a dbp-owl:Airport.} {?v6 dbp-owl:city ?v2.} UNION {?v6 dbp-owl:location ?v2.} {?v6 dbp-prop:iata ?v5.} UNION {?v6 dbp-owl:iataLocationIdentifier ?v5.} OPTIONAL {?v6 foaf:homepage ?v7.} OPTIONAL {?v6 dbp-prop:nativename ?v8.} } LIMIT 1000

Benchmark Execution

Loading Procedures:

In order to upload data into the 3 different triplestores do a step of the following: -

  1. For loading data to Virtuoso, there is a shell script called 'virtload.sh'. Call this shell script as './virtload.sh inputfile.nt http://dbpedia.org 1111 dba pwd' .
  2. For loading data to Sesame, set the 'sesameInputTriplesFile' tag in benchmark.xml file to the name of the N-TRIPLES file to read from, set 'sesameServerAddress' tag to the address of the Sesame server, and set 'sesameRepositoryID' tag to the ID of the repository to which you want to upload data. Call Sesame Data Loader.load Data(), and it will load the data to the store specified.
  3. For loading data to Jena-TDB, set the 'jenaTDBInputTriplesFile' tag in benchmark.xml file to the name of the N-TRIPLES file to read from, set 'jenaTDBDatasetGraph' tag to the path of 'JenaTDB' dataset. Call Jena TDB Data Loader.load Data(), and it will load the data to the dataset specified.
  4. Since Big OWLIM, uses Tomcat and Open RDF- Workbench like Sesame, so same settings can be used for loading data to Big OWLIM store.

For loading dataset of size 100% into virtuoso it takes approx 10 Hrs, 8 Hrs for Jena-TDB, 14 Hrs for Sesame, and 8 Hrs for Big OWLIM.

Benchmark Procedures

There are 4 classes called 'VirtuosoQueryExecutor', 'JenaTDBQueryExecutor', 'SesameQueryExecutor', and 'BigOWLIMQueryExecutor' for each type of triplestore, each one of them contains a function called 'executeQuery' that takes the SPARQL query as parameter and returns the execution time of that query against the triplestore of interest in micro-seconds. This function is called within a loop that works for 20 minutes for warm-up, and then for 60 minutes for actual calculation .

Benchmark Metrics

The main metrics used in DBPSB for performance measurement are:

  1. Query Mixes per Hour (QMpH), which denotes the number of query mixes posed to the test store in one hour.
  2. Queries per Second (Qp S), which is the number of queries (query variations of a specific query) the test store can answer in one second.

Project Team

Publications

by (Editors: ) [BibTex of ]

News

AKSW Colloquium, 18.07.2016, AEGLE and node2vec ( 2016-07-18T14:56:44+02:00 TommasoSoru)

2016-07-18T14:56:44+02:00 TommasoSoru

On Monday 18.07.2016, Kleanthi Georgala will give her Colloquium presentation for her paper “An Efficient Approach for the Generation of Allen Relations”, that was accepted at the European Conference on Artificial Intelligence (ECAI) 2016. Read more about "AKSW Colloquium, 18.07.2016, AEGLE and node2vec"

AKSW Colloquium, 04.07.2016. Big Data, Code Quality. ( 2016-06-29T12:34:35+02:00 by Ivan Ermilov)

2016-06-29T12:34:35+02:00 by Ivan Ermilov

On the upcoming Monday (04.07.2016), AKSW group will discuss topics related to Semantic Web and Big Data as well as programming languages and code quality. Read more about "AKSW Colloquium, 04.07.2016. Big Data, Code Quality."

Accepted Papers of AKSW Members @ Semantics 2016 ( 2016-06-27T12:50:01+02:00 by Sandra Bartsch)

2016-06-27T12:50:01+02:00 by Sandra Bartsch

This year’s SEMANTiCS conference which is taking place between September 12 – 15, 2016 in Leipzig recently invited for the submission of research papers on semantic technologies. Read more about "Accepted Papers of AKSW Members @ Semantics 2016"

AKSW Colloquium, 27.06.2016, When owl:sameAs isn’t the Same + Towards Versioning for Arbitrary RDF Data ( 2016-06-26T15:46:24+02:00 by Marvin Frommhold)

2016-06-26T15:46:24+02:00 by Marvin Frommhold

In the next Colloquium, June the 27th at 3 PM, two papers will be presented: When owl:sameAs isn’t the Same: An Analysis of Identity in Linked Data André Valdestilhas will present the paper “When owl:sameAs isn’t the Same: An Analysis of Identity … Continue reading → Read more about "AKSW Colloquium, 27.06.2016, When owl:sameAs isn’t the Same + Towards Versioning for Arbitrary RDF Data"

Should I publish my dataset under an open license? ( 2016-06-22T11:41:28+02:00 by Dr.-Ing. Sebastian Hellmann)

2016-06-22T11:41:28+02:00 by Dr.-Ing. Sebastian Hellmann

Undecided, stand back we know flowcharts:   Taken from my slides for my keynote  at TKE: Linguistic Linked Open Data, Challenges, Approaches, Future Work from Sebastian Hellmann Read more about "Should I publish my dataset under an open license?"