FAIR Data Digest #17
FAIR success story: Institutional history of the EU translation service. Also reporting on recent endeavors to fill a gap in Wikidata.
Hi everyone,
welcome to today’s edition on the the European Day of Languages (Q496423).
🔍 The European Union has 24 official languages and therefore constantly a lot of official documents need to be translated. What is the history of the EU translation department (the EU translation service) and how is it related to FAIR data? Find out in today’s insights section.
🌐 Probably you know that: you start adding some information to one Wikidata item to create some nice visualizations and then you realize that the entity to which you would like to link does not yet exist. Last week while adding a few entities related to the conference I have been to recently, I noticed that some important entity was missing. Checkout today’s Wikidata spotlight to learn more about it.
🔍 Insights: EU translations
Did you know that today is the European Day of Languages (Q496423)? It was established in 2001 to alert the public about the importance of language learning, promote the linguistic diversity of Europe and encourage lifelong learning. I found a piece of fascinating institutional history, but before I share the link with you I would like to focus on how this relates to FAIR data.
As you can imagine, the beginning of translation services in the EU (1951 onwards) was accompanied by a lot of manual translation work. Back then probably by employees with typewriters. Even after computers became a commodity, human experts stayed necessary. After all, we are not talking about internal communication, but drafting high quality translations for official EU documents such as legislation. All that labor, all those documents, they must have some value right?
I am not an expert in Machine Learning, but I know that good training data is essential! The tremendous amount of translated texts from official EU bodies is the kind of FAIR data I am talking about. One early example, the following research paper from 2006 introduced a large corpus of by then 20 EU languages with parallel paragraph alignment of 190+ language pairs, classified with the EUROVOC vocabulary (Q1370467): The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages (DOI: 10.48550/arXiv.cs/0609058).
You can find EU legislation on the EUR-Lex platform (Q1276282) and via the N-Lex interface (Q16465232) you can search in national databases. Content in the EUR-Lex platform is identified among others with the CELEX identifier (P476). There are open licenses: the metadata on EUR-Lex has a CC0 license, the editorial content, summaries of EU legislation and consolidated texts under a CC BY 4.0 license. And additionally:
“The Commission’s document reuse policy is based on Decision 2011/833/EU. Unless otherwise specified, you can re-use the legal documents published in EUR-Lex for commercial or non-commercial purposes.”
Thus this data really is FAIR! Openly available corpora of high-quality translations. And I’m not only talking about the text data. There was also a substantial amount of work to build and align terminologies between all the diverse language pairs, including not only languages like English, French or Spanish that are spoken in other parts of the world.
Curious and up for a bit of fascinating history? Check out the tools and workflows of the EU commission and additionally learn about the work of translators in the 1950s, 1960s and so on. Check out the open access publication Translation at the European Commission – A history (DOI: 10.2782/898949)
🌐 Wikidata spotlight: conference proceedings
Last week I spent some time on Wikidata to add some niche information. Then I realized that an important entity was missing.
As I mentioned last week, I have been at a conference organized at the Karlsruhe Institute of Technology (KIT) in Karlsruhe, Germany. Obviously on Wikidata you have entities for the city Karlsruhe (Q1040), the country Germany (Q183) and the KIT (Q309988). A while ago I also created already an item for the conference series (Q120753642) and the 2023 edition of the conference (Q120753666). But what about the conference proceedings?
The image above (created by this query) shows conferences in Germany with and without proceedings. But how are proceedings actually represented in Wikidata? I looked at existing conferences and it turns out that it is similar to conferences: there is a distinction between between a proceedings series and particular proceedings.
A proceedings series usually links to a publisher. The CoRDI proceedings were published Open Access by the German National Library of Science and Technology (TIB) (Q2399120). However, the particular publishing entity TIB Open Publishing was not yet on Wikidata! I was surprised that there was no entity for it yet. I cross checked to be sure and afterwards I created the entity with information found on their website (Q122704468).
Thus, job done and I am satisfied. The basic structure for the proceedings of the CoRDI conference is available:
The conference series: Q120753642 (and its proceedings series Q122704392)
The 2023 edition of the conference: Q120753666 (and its proceedings Q122704150)
That’s it for this week of the FAIR Data Digest. If you found the content interesting, please share and subscribe. See you next week!
Sven