DBpedia SPARQL Benchmark (DBPSB)

RDFauthor logo

a pure RDF benchmark based on actually posed queries

Triple stores are the backbone of increasingly many Data Web applications. It is evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Web in general. Assessing the performance of current triple stores is, therefore, important to observe weaknesses and strengths of current implementations. DBPSB is a general SPARQL benchmark procedure, which we apply to the DBpedia knowledge base. The benchmark is based on query-log mining, clustering and SPARQL feature analysis. In contrast to other benchmarks, we perform measurements on actually posed queries against existing RDF data. Previous approaches often compared relational and triple stores and, thus, settled on measuring performance against a relational database, which has been converted to RDF, using SQL like queries. We argue that a pure RDF benchmark is more useful to compare between existing triple stores and provide results for Virtuoso, Sesame, Jena TDB, and BigOWLIM.



Below, we provide an overview of steps required to create the benchmark. The methodology can in principle be applied to all RDF knowledge bases. It allows the benchmark to be updated as the knowledge bases and queries to it evolve.


Contents

Source Code

The source code of DBpedia SPARQL Benchmark can be found at https://akswbenchmark.svn.sourceforge.net

Dataset Generation

Base Data: DBpedia 3.5.1 with all data sets mentioned at http://wiki.dbpedia.org/DatasetsLoaded
In order to generate a dataset of specific size, do the following steps: –

  1. In benchmark.xml file set the tag <percentageOfDataRequired> to the dataset size required e.g. 10, 20, ....
  2. In benchmark.xml file set the tag <extractionMethod> to either 'RandomInstance' or 'RandomTriple'.
  3. In benchmark.xml file set the tag <outputFile type="N-TRIPLE"> to the name of the required output file.
  4. In benchmark.xml file set the tag <sparqlEndpoint> to the address of your endpoint.
  5. Call function generateData() from function main, and it will generate the data.

Generating data using Random Triple method is much faster than generation by Random Instance.


Datasets are available for download here

Query Generation


Query Log: ftp://download.openlinksw.com/support/dbpedia/


In order sort the query log by frequency of queries, do the following steps: -

  1. In benchmark.xml file set the tag <queryLogFolder> to the folder in which the query log files exist. Log files are assumed to have .log extension.
  2. In benchmark.xml file set the tag <sortedQueriesOutputFile> to the output file in which the sorted queries along with the frequency of each query.
  3. In benchmark.xml file set the tag <leastFrequencyForQuery> to the value below which the query is discarded.
  4. Call function 'sortQueries' in 'ClusterProcessor' class in order to sort the queries and write the output to the file specified in benchmark.xml, and it will also discard the queries whose frequencies are below the specified value.
  5. (Optional) this step is optional but it is recommended in order to accelerate the clustering step. Call function 'makeIDs' in 'ClusterProcessor' class in order to give each query a unique ID and remove the common keywords e.g. from, and where, that will not affect the clustering process and will also make the clustering faster, as the strings will be shorter.

Approximate time for that step is 2.5 hours.

Clustered Query Log:


  1. Feed the file containing the queries along with their IDs to 'BorderFlow' application.
  2. From 'BorderFlow' you will get a file containing the 'clusterID' and the queries belonging to that cluster.
  3. By clustering you get several clusters, each cluster contains a similar set of queries, i.e. queries belonging to the same cluster should to some extent contain similar SPARQL features.

I'm not sure of the time this step takes, probably Axel knows.


List of Benchmark Queries:


To avoid caching of query results we should introduce a small difference in each run of the query in order to force the triplestore not fetch the query result from its cache.
The required steps to get our set of queries are as follows: -

  1. Identify SPARQL-features of interest e.g. UNION, DISTINCT, OPTIONAL, ...
  2. Select one of the queries satisfying the feature of interest using the following methodology: –
    1. Sort the clusters descendingly by their total frequency, i.e. the cumulative sum of frequencies of all queries belong to that cluster.
    2. Traverse the clusters in order of their total frequency, for each cluster in order select the query with the highest frequency within the cluster with the highest total frequency on condition that this query satisfies one of SPARQL feature or combination of features.
    3. Inspect each one of the selected queries, in order to identify some static term(s), i.e. URI or literal, that can be used as a content placeholder(s).
    4. Replace the selected static term in each query with a variable(s) and make a select query for that variable(s) only with DISTINCT in order to avoid repetition and with LIMIT to get a specific number of allowed values, this query is called 'Auxiliary Query'.
    5. Perform the auxiliary query to get a list of allowed values for placeholder(s).
    6. During each run of the original query replace its placeholder(s) with some random value out of its list of allowed values.

The time for selecting the queries and their auxiliary queries is approx. 3 days


The list of queries can be found in file Queries.txt, available here


Benchmark Execution


Loading Procedures:


In order to upload data into the 3 different triplestores do a step of the following: -

  1. For loading data to Virtuoso, there is a shell script called 'virtload.sh'. Call this shell script as './virtload.sh inputfile.nt http://dbpedia.org 1111 dba pwd' .
  2. For loading data to Sesame, set the 'sesameInputTriplesFile' tag in benchmark.xml file to the name of the N-TRIPLES file to read from, set 'sesameServerAddress' tag to the address of the Sesame server, and set 'sesameRepositoryID' tag to the ID of the repository to which you want to upload data. Call Sesame Data Loader.load Data(), and it will load the data to the store specified.
  3. For loading data to Jena TDB, set the 'jenaTDBInputTriplesFile' tag in benchmark.xml file to the name of the N-TRIPLES file to read from, set 'jenaTDBDatasetGraph' tag to the path of 'JenaTDB' dataset. Call Jena TDB Data Loader.load Data(), and it will load the data to the dataset specified.
  4. Since Big OWLIM, uses Tomcat and Open RDF- Workbench like Sesame, so same settings can be used for loading data to Big OWLIM store.

For loading dataset of size 100% into virtuoso it takes approx 10 Hrs, 8 Hrs for Jena TDB, 14 Hrs for Sesame, and 8 Hrs for Big OWLIM.


Benchmark Procedures:


There are 4 classes called 'VirtuosoQueryExecutor', 'JenaTDBQueryExecutor', 'SesameQueryExecutor', and 'BigOWLIMQueryExecutor' for each type of triplestore, each one of them contains a function called 'executeQuery' that takes the SPARQL query as parameter and returns the execution time of that query against the triplestore of interest in micro-seconds.
This function is called within a loop that works for 20 minutes for warm-up, adn then for 60 minutes for actual calculation .


Results

June 2011


10%
Virtuoso Sesame Jena-TDB Big OWLIM
Query # Qp S SD  GM  Qp S SD  GM  Qp S SD  GM  Qp S SD  GM
1 261.6 45.3 250.1 466.3 136.2 428.8 330.4 155.5 258.9 63 8.9 61.9
2 450.9 59 445.6 427.7 15.5 427.4 255.1 80.4 236.5 64.7 4.1 64.4
3 82.8 16.2 81.3 348.3 97.3 320.7 1.4 1.9 0.6 55.3 14.8 52.4
4 138.1 48.9 122.6 10 60 0.2 71.3 61.1 52.7 20.6 21.4 11.6
5 67.7 10.9 67 287.9 65.3 269.6 116.1 70.5 93.3 46.6 17.5 42.1
6 60.5 17.9 58 49.4 5.5 48.2 82.5 58.1 65.2 19.2 5.1 18.5
7 28.5 8.5 26.7 207.1 79 183.7 1.6 2.5 0.5 26.5 13.9 22.7
8 52.8 67.4 24.4 65.7 112.9 23.6 134 75.1 108 18 21.9 9.1
9 22.9 3.9 22.7 226.9 86.8 197.8 0.6 0.6 0.4 48.9 9.1 47.4
10 8.1 0.4 8.1 1.4 0.4 1.4 0.1 0.04 0.1 2.8 0.1 2.8
11 176 36.2 171 289.7 80 265.8 125.3 67 104.9 51.5 12.6 49.3
12 124.8 20.2 123.1 309.9 118 264.8 1 3 0.1 59.6 12.3 57.2
13 129.3 16.5 128.3 367.2 101.4 337.3 190.1 77.3 157.9 46.7 16.4 43
14 83.1 29.8 74.1 179.2 134.6 116 96.2 69 74 25.8 20.5 16.8
15 128.1 67.6 90.3 162.6 148.5 72.6 97.3 100.2 43.9 43.1 23.9 28.4
16 121.5 23.6 118.7 249.9 67.6 236.4 28.8 30.5 19.9 39.8 15.7 35.5
17 102.2 29.8 95.7 186.2 109.5 135.8 115.6 63.6 97.6 42.8 18.9 36.1
18 182.8 33.1 179.1 0.5 0.1 0.5 178.3 52.6 168.1 23.3 15.5 18.3
19 199.8 47.9 191.1 302.5 69.9 286.7 200.8 88.9 174.5 62.2 4.7 61.9
20 18.9 4.1 18.6 221 63.3 203.7 0.1 0.3 0.02 50.6 8.1 49.5
21 483 48 480.6 459.4 16.7 459.1 289.6 92.5 224.9 66.1 1.8 66.1
22 206.1 60.6 190.2 241.4 173.8 140.6 38 86 8.4 52.8 19.3 44.7
23 140.3 27.4 137.1 354.7 116.9 315.3 23.9 39.7 12.6 64.1 6.1 63.6
24 1.2 0.7 0.8 32.4 100.2 3.7 173 88.1 146.4 0.3 0.2 0.2
25 62.8 19 59.2 259.8 120.6 209.5 110.5 99.8 68 52.3 12.5 49.7
Average number of queries per second (Qp S), standard deviation for queries per second (SD), and geometric mean for queries per second (GM), for all triplestores for 10% dataset size.


50%
Virtuoso Sesame Jena-TDB Big OWLIM
Query # Qp S SD  GM  Qp S SD  GM  Qp S SD  GM  Qp S SD  GM
1 264.5 76.1 242 153.5 136.3 116.9 64.8 13.6 63.2 56.7 10.1 55.5
2 22.4 3.2 22.2 137 98 109.9 87.6 87.9 68.6 27.7 12.7 25.6
3 55.4 19.7 51.8 81.1 92 52.6 0.1 0.1 0.1 29.4 18.9 23.3
4 44 62.7 17.6 0 0 0 41.7 19.5 38.4 3.9 3.3 3
5 60.3 14 58 46.6 70 24.8 40.2 32 30.6 21.5 15.4 16.9
6 14 7.3 12.3 6.2 2 5 162.5 69.8 141.3 5 2.3 4.3
7 23.1 7.9 21.8 62.2 52.3 42.3 0.2 0.5 0.02 10.7 5.8 9.2
8 32.5 46.9 5 15.6 39.4 2.7 88.2 67.3 67.8 8.2 13.9 1.7
9 20.4 5.3 19.9 42.3 35.3 29.1 2.3 3.8 0.8 27.2 15.1 22.4
10 1 0.1 1 0.2 0 0.2 4.3 3 1 0.28 0.02 0.28
11 97.7 56.9 73.9 45.2 48.1 36.3 37 7.1 36.3 27 10.1 24
12 62.3 22.2 57.6 56.3 58.2 42.4 0.02 0.02 0.02 34.1 18.9 28.7
13 105.1 52.1 91.7 97.5 60.4 86.8 105 94.3 44.9 19.9 10.4 17.9
14 53.9 40 38 20.4 17.6 12.3 42.8 13.2 40.9 7.9 10.6 3.3
15 51.9 32.7 33.4 33.5 66.1 11.5 43.1 38.9 32.8 8.6 13.7 2.2
16 106.7 25.4 103.4 87.7 75.4 64.8 73.2 48.3 63 38.7 12.9 36.7
17 33 11.2 30.6 20.9 19.9 13.3 31.9 9.5 30.5 10.3 8.5 6
18 203.9 57.9 190.4 0.1 0 0.1 57.6 9.6 56.9 29 12.7 26.4
19 106.6 53.6 93.5 46.1 38.1 39.7 26.8 10.2 25.3 33.7 16.9 29.9
20 15 4 14.5 37.7 30.6 28.9 0.01 0 0.01 20.5 16.2 15.2
21 189.1 138.3 135 105.6 87.3 84.3 50 16.2 46.4 28.1 13.4 25.7
22 109.1 43.9 90.3 29.5 41.4 14 1.2 1.5 0.7 15.7 15.9 6.1
23 81.2 31.4 74.4 78.3 97.4 47 1.5 2.4 0.4 32.4 18.2 27.9
24 0.2 0.1 0.2 0.9 0.4 0.7 53.7 39.3 43.6 0.06 0.05 0.04
25 35.6 10.9 33.3 45.2 57.4 26.4 37.1 16 32.4 11.5 7.6 8.6
Average number of queries per second (Qp S), standard deviation for queries per second (SD), and geometric mean for queries per second (GM), for all triplestores for 50% dataset size.


100%
Virtuoso Sesame Jena-TDB Big OWLIM
Query # Qp S SD  GM  Qp S SD  GM  Qp S SD  GM  Qp S SD  GM
1 245.9 30.9 240.9 112.7 47.1 103.9 54.5 5.9 54.2 58.3 6.1 57.9
2 3.6 0.1 3.6 81.1 45.9 69.8 67.1 40 60 31 11.8 28.7
3 42.8 21.8 37.8 32.1 14.5 28.9 0.02 0.03 0.01 23.9 7.7 22.8
4 34 50.2 14.4 0.03 0.01 0.02 4.6 9.1 0.3 29.4 22.4 18.8
5 47.9 19.7 41.7 10.7 10.6 7.6 12.5 4.9 11.5 16.1 14.5 11.1
6 8.6 2.3 8.3 4.2 2.4 2.8 45.5 46.9 9.6 4.9 2.3 4.1
7 21 12.1 18.2 21.5 37.4 13.8 0.04 0.03 0.03 7.4 4.5 6.1
8 38.3 47.7 4.8 17.8 33.7 2.7 27.1 36.7 1.3 8.3 14.5 1
9 17.8 5.7 17.1 23.3 15.9 19.5 0.01 0 0.01 15.9 6.3 14.9
10 1 0.1 0.9 0.1 0 0.1 0.01 0 0.01 0.16 0.01 0.16
11 115.7 31.7 100.2 48.5 41.9 39.4 2.9 5.7 0.2 36.2 14.4 33.2
12 47.1 25.4 41.7 25.4 12.9 21.7 0.01 0 0.01 24.6 9.7 22.9
13 89.5 73.6 64.6 43.1 41.4 28.6 0.2 0.2 0.1 38.1 17.9 33.6
14 25.1 37.3 6 3.4 5.2 1 26.3 15.7 23 28.1 26.8 8.5
15 48.4 33.2 31.1 3.1 4.7 1.3 0.04 0.04 0.02 7.9 12.2 2.5
16 137.3 14.4 136.7 98.8 49.4 89.7 29.3 30.8 6.2 48.5 10.4 47.1
17 32.8 9.3 31.2 6.3 6.5 2.7 21.7 8.1 20.3 21.5 14.8 12.3
18 208.3 27.1 205.6 0.1 0 0 44.8 34.8 9.2 47 8.3 46.1
19 99.2 67.3 81.8 40.3 20.6 35.8 46.5 24.6 40.9 34.9 7 34.2
20 14.1 4.1 13.5 8.2 6.6 6.8 0.01 0 0.01 15.1 8.3 13.5
21 98.8 76.2 78.3 85.7 69 69.2 0.1 0.1 0.1 31.3 10.5 29.7
22 115.1 52.9 93.6 5.5 8.5 2 9.5 13.3 0.6 9.5 10.6 4.6
23 76.7 38.9 67.8 41 17.8 37.6 20.2 34.4 0.8 29.1 8.4 28.2
24 0.2 0.1 0.1 0.7 0.4 0.4 17.4 15 3.7 0.05 0.05 0.03
25 30.1 11 27.1 16.5 14.5 10.3 18.1 22.5 1.2 14.3 9.4 11
Average number of queries per second (Qp S), standard deviation for queries per second (SD), and geometric mean for queries per second (GM), for all triplestores for 100% dataset size.


200%
Virtuoso Sesame Jena-TDB Big OWLIM
Query # Qp S SD  GM  Qp S SD  GM  Qp S SD  GM  Qp S SD  GM
1 247.1 44.4 229.9 93.2 39.2 84.4 226.9 465.5 36.6 54.5 13.4 47.7
2 1.8 0.1 1.8 45.7 27.4 39.9 17.6 6.4 16.6 26.8 11.2 24.5
3 42.8 21.1 36.6 19.8 9.1 17.7 0.03 0.02 0.02 18.1 8.2 16.7
4 26.8 46.1 12.1 0 0 0 45.9 33.6 34.4 6.8 9 4.1
5 52.7 19.8 47.4 5.5 2.7 4.7 5.7 4.3 3 11.8 10.4 8.7
6 9.4 6.1 7.6 2.4 2 1.4 34.6 47 15.1 2.9 1.3 2.6
7 14.3 9.3 11.7 7.1 2.4 6.6 0.01 0 0.01 8.7 5.3 7
8 33.6 42.9 2.8 3.8 6.7 0.4 21.8 19.9 13.9 8.8 15.7 1
9 16.1 6 15.1 9.9 3.2 9.5 0.02 0.04 0.01 14.5 5.1 13.6
10 0.5 0 0.5 0.1 0 0 0.01 0 0.01 0.16 0.02 0.16
11 114.7 57.5 73 34.3 17.2 30.5 15.9 2.7 15.6 21 7.9 19.2
12 43.8 31.8 36.1 18 8.1 15.7 0.01 0 0.01 21.1 5.4 20.4
13 64.7 51.8 48.6 21.7 13.2 18.5 13.6 9.5 11.1 20.1 10.3 18.2
14 23.9 29.5 11.4 1.7 2.8 0.4 7.2 2.9 6.5 6.3 6.5 3.4
15 65.1 51.5 36.9 3.3 4.5 1.4 1.8 1.6 1.2 3.8 8.8 0.8
16 191.2 24.1 189.8 83.6 40.3 74.9 11.1 3.1 10.6 45.4 10.6 44
17 24.1 12.2 19.4 1.6 1.8 0.8 6.9 1.7 6.8 9.1 6.7 6.8
18 212 35.8 207.3 0 0 0 19.7 7.9 17.5 44 8.3 43.1
19 84.8 62.5 69 15.9 4.3 15.3 22.4 13 19.6 34 8.9 32.7
20 13.6 4.4 12.9 8.6 17.5 4.5 0.01 0 0.01 13.1 8.2 11.5
21 93.8 77 73.7 53.6 20.8 50 15.9 4.9 15.2 26.5 6.3 25.7
22 120 80.5 74.3 20.7 66.2 2.1 0.6 0.6 0.4 11.1 11.5 5.9
23 67.6 33.3 60.1 30.3 16.7 26.9 9.2 5.9 6.9 24.1 5.5 23.6
24 0.1 0 0.1 0.3 0.2 0.2 8.3 4.8 6.4 0.05 0.05 0.02
25 23.5 15.2 19 8.1 7.8 5.2 9.2 1.6 9 11.6 9.5 8.6

Average number of queries per second (Qp S), standard deviation for queries per second (SD), and geometric mean for queries per second (GM), for all triplestores for 200% dataset size.


 
There is one file on this page. [Display files/form]
There is no comment on this page. [Display comments/form]

Information

Last Modification: 2012-04-12 17:11:10 by Access deniedMohamed Mabrouk