ONTOLOGY. All nodes in this graph represent either a particular Wikidata language item or a Wikidata class that encompasses different languages in the ontology. The relations between languages and language classes in Wikidata are organized through different properties: P31 (instance of), P279 (subclass of), and P361 (part of). The relational structure of languages in the ontology is not always systematic (e.g. sometimes a language is both a P279 (subclass of) and a P361 (part of) of a language class or another language). The Language/Class tab in this Dashboard can help you inspect these relationships closer and decide if a change in the ontology needs to be introduced.
Use the toolbox in the top-right corner of the graph to zoom or pan the graph. Hovering over a particular node will reveal the details: the respective Wikidata item alongside the number of relations (P31/P27/P361) that it enters in the graph. If a node represents a particular language additional information will be displayed: the number of labels for that language in Wikidata, the percent of the items that have a label in that language and are also reused across the Wikies, the WDCM (Wikidata Concepts Monitor) reuse statistic, and the number of sitelinks for the respective language's item. For the definition of the WDCM reuse statistic see the Description tab.
The Fruchterman-Reingold algorithm in {igraph} is used to visualize the network.
Note. This Dashboard focuses on the language items that have any items in Wikidata and whose items are reused in our Wikies. All properties used across this Dashboard are obtained by searching the Wikidata starting from a set of language items thus defined. This implies that the depiction of the language ontology presented here is not necessarily complete.


Loading...



WD Languages Landscape :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm




LANGUAGE/CLASS. This tab introduces a visual browser for Wikidata languages and related language classes. Upon selecting a particular entity (language or language class) from the drop-down menu on the left, the Dashboard will generate a graph of its immediate relational context taking into account any of the P31 instance of/P279 subclass of/P361 part of properties. While most of the organization of languages in Wikidata makes use of P31 instance of and P279 subclass of properties, for some languages also P361 part of is used, and not always in a consistent way. By inspecting the Wikidata languages of your interest in this visual browser you can decide if their properties are consistenly structure and maybe introduce a change in the Wikidata ontology later on if you find it necessary or desireable. Hovering over a particular node will reveal the details: (a) the respective Wikidata item, (b) the number of labels for that language in Wikidata, (c) the percent of the items that have a label in that language and are also reused across the Wikies, the WDCM (Wikidata Concepts Monitor) reuse statistic, (d) the number of sitelinks for the respective language's item, and (d) the UNESCO/Ethnologue language status for the respective language (if the respective data are present in Wikidata). For the definition of the WDCM reuse statistic see the Description tab.
Note. This Dashboard focuses on the language items that have any items in Wikidata and whose items are reused in our Wikies. All properties used across this Dashboard are obtained by searching the Wikidata starting from a set of language items thus defined. This implies that the depiction of the language ontology presented here is not necessarily complete.



Select a language, a language class, or a linguistic category.
The dashboard will generate an interactive graph of the selected entity with its immediate relational context.
Note. The P361 Part of relations are represented by dashed links, while solid links represent P31 Instance of or P279 Subclass of relations.


Loading...



WD Languages Landscape :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm




LABEL SHARING. Each bubble in the graph represents a Wikidata language. We first look at each of ~60M Wikidata items to see what languages label them. Than we look at the languages, pairwise, and determine the similarity of the way they are used in Wikidata by assessing the items that both languages in a pair refer to (the language overlap) the items which one of them refers to but the other does not (the mismatch). From these data we compute a similarity index between any two Wikidata languages.
Two visualizations of the similarity data are provided in this tab. The Static/Clusters graph represents each Wikidata language by its Wikimedia code, and each language points towards the language to which it is most similar in the above described sense. A clustering algoritm in {igraph} is used to group the languages according to their overall similarity, and the cluster boundaries are overlayed across the graph.
The second visualization, Interactive/Clusters presents exactly the same data in a different way: the languages are represented by bubbles whose size corresponds to the number of labels in Wikidata for the respective language. Hovering over any language in the graph will reveal more detailed info. Again, each language is connected to the one to which it is most similar in terms of its usage across the Wikidata entities. While the previous two Dashboard tabs (Ontology and Language/Classes) focused on the representation of the way the Wikidata language items are connected in Wikidata itself, this tab represents the empirical similarity relations between languages, based on the way they are used to refer to any Wikidata entities. By comparing the structure of languages in the ontology with the usage similarity patterns here we can study if the languages with similar properties also tend to refer to the same sets of entities or they are used in a way which is irrespective of their properties.




WD Languages Landscape :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm




LANGUAGE STATUS. We use the UNESCO language endangerment categories and the Ethnologue language status as (and when) reported in Wikidata to study several indicators of how does a particular language stand in Wikidata and across the WMF's projects in general. Each chart in this tab receives a closer description immediately beneath it.
The horizontal axes in the following interactive charts always represent the language status category. Each bubble represents a Wikidata language, but different variables are mapped onto the size of the bubble and the charts' vertical axis in each panel. We focus on the following language usage related indicators here: (a) the number of sitelinks for the respective Wikidata language, (b) the number of labels that it has (i.e. the the number of entities to which it refers in Wikidata), and (c) the WDCM reuse statistic for all the items referred to from a particular language (For the definition of the WDCM reuse statistic see the Description tab).
By following the usage indicators (sitelinks, number of labels, and reuse across the WMF projects) for Wikidata languages of less favorable status we can recognize what languages we need to focus on in order to represent them in Wikidata and across the Wikimedia universe and thus help their preservation. When combined with structural and empirical similarity data provided on the previous tabs in this Dashboard these data can help formulate strategies to improve the digital representation of underrepresented languages in general.


Loading...

UNESCO status and sitelinks. The vertical axis and the size of the bubble represent the number of sitelinks for each particular language with the UNESCO language status data found in Wikidata.


Loading...

Ethnologue status and sitelinks. The vertical axis and the size of the bubble represent the number of sitelinks for each particular language with the Ethnologue language status data found in Wikidata.


Loading...

UNESCO status and number of labels. The vertical axis and the size of the bubble represent the number of labels (log scale) present in Wikidata for each particular language with the UNESCO language status data found in Wikidata.


Loading...

Ethnologue status and number of labels. The vertical axis and the size of the bubble represent the number of labels (log scale) present in Wikidata for each particular language with the Ethnologue language status data found in Wikidata.


Loading...

UNESCO status and the WDCM reuse statistic. The vertical axis and the size of the bubble represent the total WDCM reuse statistic (log scale) for all items referred to from each particular language with the UNESCO language status data found in Wikidata.


Loading...

Ethnologue status and the WDCM reuse statistic. The vertical axis and the size of the bubble represent the total WDCM reuse statistic (log scale) for all items referred to from each particular language with the Ethnologue language status data found in Wikidata.




WD Languages Landscape :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm




LANGUAGE USAGE. The charts in this Dashboard tab represent various indicators of language usage in Wikidata and across the Wikimedia projects and are meant for a more general assessment of the Wikidata languages in comparison to the detailed analytics provided in the previous tabs. Each chart in this tab receives a closer description immediately beneath it.
For the definition of the WDCM reuse statistic see the Description tab


Loading...

Number of labels and the total WDCM reuse statistic. The horizontal axis in this chart represents the number of items (i.e. labels) for each particular Wikidata language that is ever used across the Wikimedia projects. The vertical axis and the size of the bubble represent the total WDCM reuse statistic for all items referred to from each particular language.


Loading...

Number of labels and the total WDCM reuse statistic. The horizontal axis in this chart represents the number of items (i.e. labels, log scale) for each particular Wikidata language that is ever used across the Wikimedia projects. The vertical axis and the size of the bubble represent the total WDCM reuse statistic log scale for all items referred to from each particular language.


Loading...

Number of labels and the average WDCM reuse statistic. The horizontal axis in this chart represents the number of items (i.e. labels, log scale) for each particular Wikidata language that is ever used across the Wikimedia projects. The vertical axis and the size of the bubble represent the average WDCM reuse statistic (log scale) computed across all items referred to from each particular language.


Loading...

Number of labels and the average WDCM reuse statistic. The horizontal axis in this chart represents the number of items (i.e. labels, log scale) for each particular Wikidata language that is ever used across the Wikimedia projects. The vertical axis and the size of the bubble represent the percent of items referred to from each particular language that are ever reused across the Wikimedia projects.




WD Languages Landscape :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



Wikidata Languages Landscape Dashboard

Dashboard description


Introduction


This dashboard is developed by WMDE in response to the Wikidata Languages Landscape Phabricator task and in the scope of preparations for the WikidataCon 2019 (the main topic of the conference: languages and Wikidata. The WD Languages Landscape dashboard relies on different data sources to provide a comprehensive picture of how different languages are used in Wikidata and - via the entities that they refer to - how they are mapped across the universe of Wikimedia projects:

In addition, The List of All Wikimedia Language Codes - as maintained at Wikidata and periodically updated by a bot - is used to verify the language codes obtained from these data sources. The goal of this dashboard is to provide insights into the various aspects of language use in Wikidata. We study the Wikidata language items, and then the entities that any of the Wikidata languages is referring to, to infer the total reuse of a language across the Wikimedia projects. We derive similarity metrics from language x language matrices of overlap in their labels across almost 60 millions of Wikidata items. We study the way Wikidata languages are represented in its ontology and compare this representation to the similarity across languages inferred from empirical data on their overlap in signification across the Wikidata entities. We also look at the UNESCO and Ethnologue language status categories and compare the status of language against various indicators of its use and reuse in Wikidata and Wikimedia projects.
Several means of data visualization are employed in this dashboard, relying on {plotly}, {ggplot2}, {igraph}, and {visNetwork} in R to visualize the complex relationship discovered in our study of the Wikidata languages. Apache Spark and Pyspark are used for ETL purposes. All computation is, of course, done in R; the dashboard itself is developed in {shiny} and deployed on an open-source RStudio Shiny Server instance in CloudVPS.
N.B. All presented results are relative to the latest version of the WD Dump processed and copied to the WMD Data Lake (see: Phabricator).


Definitions


N.B. The current Wikidata item usage statistic (or WDCM reuse statistic) definition is the count of the number of pages in a particular client project where the respective Wikidata item is used. Thus, the current definition ignores the usage aspects completely. This definition is motivated by the currently present constraints in Wikidata usage tracking across the client projects (see Wikibase/Schema/wbc entity usage). With more mature Wikidata usage tracking systems, the definition will become a subject of change.
In this dashboard we use the WDCM reuse statistic to represent the extent upon which a particular language is dominant in the Wikimedia projects. It is done in the following way: we look at all the Wikidata entities that have a label in a particular language, and we take the total (the sum) of their respective WDCM reuse statistics to represent the reuse statistic of a language. Also, some of the charts use the average WDCM reuse statistic: that is the mean of the WDCM reuse statistics taken from all Wikidata entities that have a label in a particular language. The motivation for this measure is fundamental: it assigns a higher rank to languages that can refer to highly popular Wikidata entities. Of course, it is all relative to the Wikidata/Wikimedia universe: a language might be able to refer to some particular referent (i.e. have a word for something), but the respective sign (word) might not be yet present in Wikidata as a label.




WD Languages Landscape :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm