FAIR Data Digest #22

How I solved data integration performance problems by rewriting SPARQL queries or changing the approach. Food for thought that might help you too!

Nov 28, 2023

Hi everyone,

in today’s slightly more technical edition I will focus on my recent encounters with the Linked Data query language SPARQL. Many of you may know it as a language to retrieve data (possibly from Wikidata). I mainly use it for a project where I recently found some performance issues that could be resolved by either rewriting the query or changing the approach.

What is SPARQL again?

In short: a declarative query language for the Resource Description Framework (RDF).

SPARQL (Q54871) can be used to retrieve (subsets of) data stored in RDF format, by specifying a search query. This can be done without actual programming. So you don’t have to specify how the data should be retrieved, you only specify what you want, hence it is a so-called declarative query language.

After you have specified what you want, the used SPARQL engine creates an internal query plan consisting of several steps and executes it. The query planner uses your query and some statistics about the data to create the most efficient plan to compute the results. It is still possible that the way you have wrote your query does not leave much room for the planner to optimize. Understanding your query and your data is important, because with a few changes in your query you can achieve better plans and faster results.

undefined — Logo of the BELTRANS research project in which we use SPARQL for data integration.

Using SPARQL in a book translation project

Let’s have a look at two use cases where I could significantly speed up our data processing by rewriting a SPARQL query.

Both cases are about speeding up the data integration of the BELTRANS project (Q114789456). In this project we are studying the history of book translations between Dutch and French in Belgium between 1970 and 2020. Therefore we had to collect book metadata from different national libraries and other data sources to get the complete picture. We decided to store the data using RDF such that we can make use of powerful SPARQL features to explore the data.

Actually, besides exploring the finalized data with SPARQL, it also can be used to perform the initial linking and data integration. I have explained how we do this as part of a presentation for the Semantic Web in Libraries conference in 2022 (DOI: 10.5281/zenodo.7372985). I still think it is a good solution, but the way it is implemented has a huge negative impact on the performance. Thus in the following I provide an example of switching from our SPARQL-only implementation to an implementation with in-memory clustering (that still uses some SPARQL parts).

Divide and conquer for performance optimization

Our data integration consists of several steps and the current implementation roughly runs for one hour. We are particularly likely to run into problems when processing new data whose quality has not been checked, or after implementing new features. Running into problems in any of the steps requires that we start the process again. Thus eventually the whole process takes way more than one hour and needs our attention. This is inefficient and causes stress.

The main issue: we use a few SPARQL queries to do many things simultaneously which eventually takes a lot of time. Remember: SPARQL is declarative, you can easily write very complex queries, but at some point a piece of software has to execute the SPARQL Query you came up with. Our SPARQL queries had to check if data from a local data source already exist in the project database, if not create it and if yes update the project database with (links to) local data. Without going too much into detail: we had to use many OPTIONAL statements in the queries and some FILTER NOT EXISTS expressions that cause the whole process to take more than one hour (if nothing goes wrong).

It can be done quicker by dividing the process into different steps that each for itself can be executed (very) fast. In Computer Science, the Single Responsibility Principle (Q197857) states that every piece of software should have responsibility over a single part of that software. Our implementation where one SPARQL query does several things violates that principle which in our case causes performance issues.

Instead of one SPARQL query that performs the linking and the data updates, we change our approach to have two software components: one for the linking and one for the data updates. To quickly link the data we can reuse a Python script for clustering data in-memory which I described in an earlier blog post (DOI: 10.59350/4hd4r-1tk44). Afterwards we still can use SPARQL to perform updates. But again, instead of having one update query with many OPTIONAL statements, we can split this process into several updates that each for itself are quicker to achieve the same result.

The new approach not only is faster but apparently solves a data quality problem we used to have. Now we are able to avoid duplicates also within the same data sources as they are clustered together. With the previous SPARQL-only approach we were only able to avoid duplicates between data sources.

Query rewriting for performance optimization

The second case covers the rewriting of a SPARQL query that did not seem to finish … Query optimization is a broad topic and I will not cover it completely. I only want to provide one example for a query I have changed to be a lot faster (even though the query looks way more complicated now).

The clustering I mentioned above needs a list of all book records together with its different identifiers such as ISBN numbers. This data can be provided via a small SPARQL query. The query shown in the image below looks straightforward, but was anything but performant!

A seemingly harmless SPARQL query that just keeps running.

The query returns the identifier of a book (?mID), a related identifier such as ISBN (?identifier) and the type of identifier we need as context for the identifier (?identifierName). Because we execute this query against all our data, we additionally select the source of the data: its context-giving named-graph (?g). In an earlier blog post I described what named-graphs are and how you can query them in different ways with SPARQL (DOI: 10.59350/n84f2-8q140).

I don’t know exactly why, but querying the identifier literal in line 18 makes the query super slow. Slow like in “I stopped it after a minute myself”. The query is fast if I only fetch the name of the identifier in line 17 (omitting line 18). Each “identifierEntity” has a label and a value, I didn’t see anything odd in the data when checking it separately. So I analyzed the query plan for the version where I query only the name of the identifier and the version where I query the value of the identifier.

Below is an image of the query plan of the long-running query. For test purposes I only kept FROM NAMED <http://kbr-syracuse>, such that the graph variable will never be bound to another named graph. The steps (operators) of the queries are the same. But one of the steps comes up with many “solutions”. Too many comparisons which cause a bottleneck. The other version of the query that only queries the rdfs:label (which is also a literal) has a join ratio of 2.8 instead of more than 60k and finishes in a few seconds. There seems to be an issue, even though I cannot explain why.

A SPARQL query plan showing a performance issue: step 2 produces too many results that are forwarded to step 3.

I was experimenting with different ways to query the data differently and noticed that the issue is somehow related to using a variable for the named graph. The solution shown below is longer, but because I explicitly indicate the named graphs by using a UNION in the triple patterns, the query only takes 5 seconds (instead of I-don’t-know-but-at-least-several-minutes-or-even-hours).

An alternative SPARQL query that is longer, but avoids the performance issues: the join-ratio for the variable in line 14 is 1 according to the query plan.

I also tried to keep the named graph variable ?g, but instead of using FROM NAMED I used a FILTER expression specifying that ?g can only have one of four values. But it had the same bad performance, so the FROM NAMED statements was not the problem, but the variable ?g.

Did you ever have performance issues with SPARQL queries? Leave a comment! Do you know someone who is struggling with performance issues? Forward this post!

Thank you for reading FAIR Data Digest. This post is public so feel free to share it.

That’s it for this week of the FAIR Data Digest. If you found the content interesting, please share and subscribe. See you in two weeks!

Sven

FAIR Data Digest