The Wikidata Identifiers Landscape keeps track of the WD external identifier usage and the overlap of their usage across the WD items.
To get an insight into the dataset browse the Similarity Map tab, which presents a global overview of the overlap in the usage of WD identifiers, and the Tables section where the respective data can be found. The Overlap Network tab visualizes all Wikidata external identifiers in a network of nearest neighbors. On the Identifier Classes tab we present insights into the relationships between the WD external identifiers belonging to the same class of WD identifiers. The Particular Identifier tab provides insights into the data for any WD external identifier of choice. tab.


WD External Identifiers Similarity Map. NOTE: The selected class (drop-down menu to the left) of identifiers will be highlighted. Each bubble represents an external identifier. The size of the bubble (NOTE: a logarithmic scale is used) reflects how many WD items are described by the respective identifier. Identifiers that have a large overlap across the WD items that they describe are grouped together. Use the tools next to the plot legend to explore the plot and hover over bubbles for details: the identifier label and the number of WD items that make use of it. The similarity map is produced by a two-dimensional t-distributed stochastic neighbor embedding (t-SNE) of the Jaccard distance identifier x identifier matrix.

Loading...


Wikidata Identifier Landscape :: WMDE 2019.

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



WD External Identifiers Overlap and Usage Data. In the table to the left (Overlap Data), select a WD External Identifier and the dashboard shows how many items does it share with all other identifiers. In the table to the right (Usage Data) all WD External Identifiers are listed alongside the number of items which they describe.
N.B. Having data on a particular identifier's usage i.e. the number of WD items that use a particular identifier does not necessarily imply that the identifier has any overlap with other identifiers.


Overlap Data

Usage Data

Loading...
Loading...



Wikidata Identifier Landscape :: WMDE 2019.

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



WD External Identifiers Overlap Network. Each bubble in the network represents an WD external identifier. Nearest neighbors in terms of the number of shared items (i.e. overlap) are connected: each identifier points a link towards its own nearest neighbour. The size of the bubble corresponds to the total number of items across which the respective identifier overlaps with other identifiers. Hover over the bubble to obtain the details (identifier ID and the measure of total overal) on the respective identifier and use the toolbox (in the top-right corner of the network) to zoom, pan, or download.
The Fruchterman-Reingold algorithm in {igraph} is used to visualize the identifier network. Please be patient: we are rendering a huge network of Wikidata identifieres.


Loading...



Wikidata Identifier Landscape :: WMDE 2019.

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



External Identifiers Wikidata Classes. WD external identifiers belong (in terms of P31 (Instance of)) to one or more WD identifier classes. Here you can generate a network of identifiers from any WD identifier class and inspect its neighborhood structure. Neighbors are those identifiers that share a large number of items that they describe. Each bubble represents an identifier, and each identifier is connected to its nearest neighbors (note: their neighbors do not necessarily belong to the same class). Hover over the bubbles to reveal the respective identifiers. The Fruchterman-Reingold algorithm in {igraph} is used to visualize the identifier network. For identifier classes that encompass a large number of identifiers you will most probably need to zoom into the dense region of the graph in order to discover its structure. The table to the right lists all WD external identifiers that belong (in a sense of: P31 Instance of) to the selected class.


Loading...
Loading...


Wikidata Identifier Landscape :: WMDE 2019.

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



Wikidata External Identifiers. Select a particular WD external identifier from a drop-down menu and the dashboard will generate a concise info on the type of resources that it describes and an (approximate) set of exemplar WD items or classes. If there is enough overlap between the selected identifier and other WD identifiers, a network of its neighbors (and their neighbors, to provide some context) will be generated. The Fruchterman-Reingold algorithm in {igraph} is used to visualize the identifier network. The table on the right lists several examples of WD items that are described by the selected identifier, alongside the value assigned to them and a list of classes (in a sense of P31 Instance of ) they belong to. N.B. It is possible that the current dashboard update does not have enough data to visualize the neighbourhood structure for particular identifiers - especially for the ones less frequently used.



Loading...

Typical usage of this identifier (Examples)

The result is based on the first 100 triplets fetched from WDQS (i.e. LIMIT 100).

Loading...


Wikidata Identifier Landscape :: WMDE 2019.

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



Wikidata Identifier Landscape


Introduction


This dashboard is developed by WMDE in response to the analyze and visualize the identifier landscape of Wikidata Phabricator task.

The WD Identifier Landscape dashboard relies on the dataset obtained by performing ETL from Apache Spark w. Pyspark against the copy of the Wikidata dump in HDFS (WMF Data Lake) and post-processed in R. All machine learning procedures are performed in R (the t-SNE dimensionality reduction is handled by {Rtsne}).

The goal of this dashboard is to provide insight into the structure of the overlap in usage of various WD external identifiers. Several means of data visualization are employed in that cause, relying on {plotly} to visualize semantic maps and networks.

N.B. All presented results that rely on overlap in identifier usage across the WD items are relative to the latest version of the WD Dump processed and copied to the WMD Data Lake (see: Phabricator). The identifier usage statistics (i.e. the number of WD items that use a particular identifier) are fetched from WDQS.


Dashboard Functionality


Visualizations. Visualizing the overlap structure of the WD external identifiers is a challenging task. In order to provide an as thorough as possible insight into the similarity in usage of various identifiers across the WD items, we employ several different approaches to data visualization. The landing, Similarity Map tab, presents a two-dimensional map in which each WD external identifier (that is used at all) is represented by a bubble. In this map, distance between the identifiers represents the similarity in their usage: the higher the overlap across the WD items which are described by a particular pair of identifiers - the closer the respective bubbles stand in the map. The size of each bubble corresponds to the number of items that the respective identifier describes. Since any WD external identifier can fall in more than one WD class of identifiers, we have avoided setting the color scheme to mark identifiers belonging to the same WD class in the map. Our approach was to let user select a particular class of identifiers of interest and then color the respective bubbles in the map to ease recognition.

A more straightforward (and probably more popular) approach to visualize datasets as the one at hand is to employ graphs. On the Overlap Network each identifier is again represented by a bubble and points towards its nearest neighbor: the identifier with which it shares the highest number of items that they both describe. The size of the bubble in this visualization does not represent the number of items described by a particular identifier, but the extent of its total overlap with other identifiers, making the hubs in the network easier to spot.

The Tables tab enables the user to browse for specific identifiers and inspect the exact extent of their overlap with all other WD external identifiers. Another table is produced under the same tab: one providing for the counts of items described by all considered identifiers.

As already explained, WD external identifiers belong to one or more WD identifier classes. The Identifier Classes tab provides a browser of these classes, generates a local neighborhood similarity structure (based on an overlap in identifier usage) for all the identifiers found in the selected class, and lists all of the class identifiers. Similarly, the Particular Identifier tab provides information about a particular WD external identifier: its local neighborhood structure, and a set of exemplar uses of that identifier (represented by a selection of items that it describes, the WD classes (in a sense of instance of (P31)) to which these items belong, and the values of the selected identifier associated to them).




Wikidata Identifier Landscape :: WMDE 2019.

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm