FAIR Data Digest #2

How to deal with author pseudonyms in data integration? How does fundamental research benefits Knowledge Graphs and why do Wikidata identifiers actually start with a Q? Here you get the answers!

Jun 13, 2023

Dear subscriber,

welcome to the second edition of the newsletter and also a warm welcome to all new subscribers. It has been an interesting week. In this edition I will talk about some work updates, a one-day workshop I’ve attended last week and I have a video recommendation.

🏢 As you may know, the F in FAIR stands for findability of data, something that usually can be achieved with unique identifiers for data. At the Royal Library we have around 900,000 records about persons such as authors or illustrators, each with an identifier, e.g. Willy Vandersteen. In my work update I will tell you about a recent issue about cases where persons publish under several pseudonyms.

📅 I also have been on the road again! Last week I have attended a one day workshop about Knowledge Graphs and Data Integration at the University of Hasselt, Belgium.

🎥 Last but not least a video recommendation: did you ever wonder how Wikidata was created or why the identifier of each entity starts with a Q? Find out in a short video about the history of Wikidata.

🏢 Work updates

In the BELTRANS project we create a data corpus of book translations between NL-FR and FR-NL between 1970 and 2020, where Belgians were involved. Especially for the last criteria we need to know the nationality of the book contributors. But what if in the data the contributor occurs with different names?

We collect data from various data sources: our own catalog (Royal Library of Belgium), but also from other data sources such as the Royal Library of the Netherlands or the National Library of France. Whereas some of these data sources have dedicated identifiers for each pseudonym (and therefore link a book to the correct identity), at KBR we usually just have one record for a person/pseudonym (depending on which of the two is more famous). Possible alternate spellings of the name or pseudonyms are only recorded as additional text field at the KBR person record.

Usually we link person records from different data sources automatically via third-party identifiers such as VIAF, ISNI or Wikidata. In case that there are no identifiers or at least no match, we have to rely on other methods to link the records. One such method is to try link person records by the name of the person. Yet some data sources may use different spellings of the name or even a pseudonym.

To solve this problem in BELTRANS, I extracted the different name spellings and pseudonym information from the KBR records and created unique identifiers for each such identity. Like this we can compare persons of different data sources to all possible name spellings or pseudonyms of a person. Of course, possibly found candidates still should be checked by a human. But the technical solution at least provides the heavy lifting by proposing way less candidates to check compared to the whole corpus. I documented this progress also with some examples in a GitHub issue.

Check out the issue at GitHub

📅 Events

Knowledge Graphs and Data Integration

When you hear Computer Science you probably think of zeros and ones and about formulas and math. Fundamental research on algorithms and logics are indeed at the core of software systems and can make the difference between a fast or a slow system.

A lot of such fundamental research is performed in database, programming languages or knowledge representation communities. All smaller sub-communities of the general field of Computer Science.

The FWO-founded Knowledge Graphs For Data Integration research network aims to bridge the gaps between the mentioned communities and knowledge graph engineering. This allows a knowledge transfer within this interdisciplinary research network that is beneficial to to formalize concepts and optimize algorithms related to Knowledge Graphs.

In order words: it can make your SPARQL queries quicker or allow you to use something like SPARQL on some data in the first place! Last week I have attended a one day workshop of the mentioned research network. My main takeaway is that by thorough fundamental research both the performance of Knowledge Graphs as well as their usability can be improved.

Check out my new blog post to get a brief overview about the workshop.

Read the blog post

🎥 Videos not to miss

Wikidata history

Wikidata has become a crucial platform for different research disciplines. In the last edition of the newsletter I already mentioned why Knowledge Bases such as Wikidata are even more important in a world of large language models such as ChatGPT. But also that from a Digital Humanities point of view, there are issues to be tackled.

The following 15 minutes video by Denny Vrandečić, one of the founders of Wikidata, guides you through the history of it. How did it emerge into a world where we already had Wikipedia? And why does the identifier of each entity in Wikidata starts with a Q? Grab a cup of coffee or tea and enjoy the explanations on YouTube.

That’s it for this week of the FAIR Data Digest. I hope you found the content interesting. Don’t forget to share or subscribe. See you next week!

Sven

PS: I have just seen on Twitter that a few days ago the International Semantic Web Research Summer School (ISWS) started in Bertinoro, Italy. I’d like to wish all students a pleasant stay! I participated myself in 2018 and found. How it was for me back then you can read in my trip report from back then.

FAIR Data Digest