Wikidata is arguably the world's best-known open structured knowledge base. Generally speaking, all things stored in Wikidata are entities, which are either real-world objects such as Sky Tower or more abstract topics such as JavaScript. Such "common" entities are known as concepts or Q-items, and account for the majority of data stored in Wikidata.
In May 2018, Wikidata added support for a new kind of entity called lexemes or L-items. An important concept in lexicography and linguistic analysis, lexemes are the units of language that are used to group together words that are related through inflection. For example, the English verb run is a lexeme that refers to the set of words which includes run, runs, ran, and running—all of which share the same meaning. Capturing such lexicographical information is important, and doing so using linked data greatly increases the utility of the resulting dataset.
Each L-item in Wikidata has a headword and is linked to the language and lexical category to which it belongs and its lexical forms and senses. For example, the English noun table has multiple senses including "item of furniture" and "arrangement of data in rows and columns" and lists the following forms: table, tables, table's, and tables'.
As with all Wikidata entities, the statements about lexemes can be queried using the Wikidata Query Service.
For example, this query returns the canonical forms (lemmas) of all English noun lexemes:
SELECT ?lexeme ?lemma WHERE { ?lexeme dct:language wd:Q1860 ; wikibase:lexicalCategory wd:Q1084 ; wikibase:lemma ?lemma . }
Query results (truncated):
lexeme | lemma |
---|---|
wd:L11845 | gut |
wd:L12158 | pollution |
wd:L12190 | trap |
wd:L12204 | transit |
wd:L12212 | freight |
... | ... |
To get all the words in all languages that mean "(liquid) water", the following query can be used:
SELECT ?languageLabel ?lemma WHERE { ?lexeme dct:language ?language ; wikibase:lemma ?lemma ; ontolex:sense [ wdt:P5137 wd:Q29053744 ] . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } ORDER BY ?languageLabel
Query results (truncated):
languageLabel | lemma |
---|---|
'Are'are | wai |
Abaza | дзы |
Abkhaz | аӡы |
Acehnese | ie |
Achi | yaʼ |
... | ... |
High-quality structured linguistic datasets are the sine qua non of computational linguistics and its applications. As the need for more comprehensive linguistic datasets is constantly growing, Wikidata lexemes can offer a better alternative for systems that tackle problems such as word-sense disambiguation, machine translation, and text classification and summarisation.
Since its first release in 2018, the Lexicographical Data on Wikidata has grown to include over 300,000 lexemes in over 700 languages.
There is little doubt that Wikidata lexemes will become the go-to source of structured lexicographical data for language researchers and technologists.
A bilingual dictionary for learning and translating Russian to English.
A guide to English pronunciation. Features broad phonemic transcriptions indicated using IPA.
A poster featuring the phonetic transcription of the word proverb in the International Phonetic Alphabet (IPA).
An overview of the available RDF datasets and discovery tools for COVID-19.
Awesome links and resources on linguistics.
A guide to getting data from Wikidata using SPARQL in JavaScript.
Inspecting the ontology of Wikidata.
An overview of COLID, the data asset management platform built using semantic technologies.
All prices listed are in United States Dollars (USD). Visual representations of products are intended for illustrative purposes. Actual products may exhibit variations in color, texture, or other characteristics inherent to the manufacturing process. The products' design and underlying technology are protected by applicable intellectual property laws. Unauthorized reproduction or distribution is prohibited.