Semantic Models

This page provides an oportunity to study the WDCM semantic models. The WDCM organizes its knowledge of Wikidata usage into semantic categories and currently uses 14 of them. Each semantic category encompasses a set of Wikidata items that match a particular intuitive, natural concept (e.g. "Human", "Geographical Object", "Event", etc).
The WDCM develops a semantic topic model (see: Topic Model) for each semantic category. Each semantic model encompasses a number of topics, or semantic themes. Each topic is characterized by the importance of Wikidata items from the respective semantic category in that topic. Here you can browse the semantic categories and inspect the structure of topics that are encompassed by the respective semantic model. You can also learn about the most important projects in a given category for a given topic from its semantic model.
The Dashboard will initialize a random choice of Category and pick the first Topic from its semantic model. Use the drop-down menus to select a category and a topic from its semantic model. Three outputs will be generated on this page: the Top 50 items chart, the topic similarity network, and the top 50 projects in this topic chart (scroll down)




Top 50 items in this topic

The chart represents the top 50 most important items in this topic. The importance of each item is given by its probability of being generated by this particular semantic topic (horizontal axis). The items are ranked; the rank numbers next to the labels on the vertical axis correspond to the rank numbers in parentheses next to data labels that show the item Wikidata IDs. There's a game that you can play here: ask yourself what makes this 50 items go together, what makes them similar, what unifying principle holds them together in the same semantic topic? Do not forget: it is not only about what you know about the World, but also about how our communities use Wikidata on their respective projects.



Loading...

Topic similarity network

Each bubble represents one among the top 50 most important items in this semantic topic. Each item points towards the the item to which it is most similar. Similarity between items is derived not only from item importances (i.e. probabilities) in this topic, but from all topics that are encompasses by this category's semantic model. In interpreting the similarities, do not forget that game is not only about what you know about the World, but also about how different communities use Wikidata. The more similarly the items are used across the sister projects, the more likely they will group together in this network. You can drag the network and the nodes around and zoom in and out by your mouse wheel.



Loading...

Top 50 projects in this topic

To put it in a nutshell: here you can see what projects use the selected topic from the respective semantic category the most. The chart represents the top 50 projects in respect to the prominence of the selected topic. In the WDCM topic models, the usage pattern of any particular semantic category of Wikidata items, in a particular project, can be viewed as a mixture of semantic topics from the respective category's semantic model. Thus, each project's usage pattern in a particular semantic category can be expressed as a set of proportions up to which each topic contributes to it. The horizontal axis represents the proportion (e.g. the probability) of the selected topic's presence in a particular project. Projects are found on the vertical axis, with the rank numbers corresponding to those near the data points in the chart.



Loading...



WDCM Semantics :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



Project Semantics

Here you can make a selection of projects and learn about the importance of all available semantic topics from each semantic category in the project(s) of your choice. Note: You can search and add projects into the Search projects field by using (a) project names (e.g. enwiki, dewiki, sawikiquote, and similar or (b) by using project types that start with "_" (underscore, e.g. _Wikipedia, _Wikisource, _Commons, and similar; try typing anything into the Select projects field that starts with an underscore). Please note that by selecting a project type (again: _Wikipedia, _Wikiquote, and similar) you are selecting all client projects of the respective type, and that's potentially a lot of data. The Dashboard will pick unique projects from whatever you have inserted into the Search projects field.
Note: The Dashboard will initialize with a choice of all Wikipedia projects. Then you can make a selection of projects of your own and hit Apply Selection to obtain the result.



Semantic Topics in Wikimedia Projects

The vertical axes represent the % of topic engagement in this particular selection of Wikimedia projects.
Note: Please be remindided that semantic topics are category-specific: each category has its own semantic model, and each semantic model encompasses a number of topics. To clarify: e.g. Topic 1 is not the same thing in two different categories. You can learn about the content of any semantic topic in any of the semantic categories on the Semantic Models tab - and in fact that is what one should do before any attempt to interpret the data that are provided here.

Loading...



WDCM Semantics :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



Similarity Maps

Select a semantic category of Wikidata items to take a look at. A 2D map will be generated where each project is represented by a bubble, and where the distance between the projects corresponds with the similarity in their usage of Wikidata items from the selected category. Think about semantic categories as perspectives from which you can take a look at the structure of similarity that holds among the Wikimedia projects in respect to their usage of Wikidata items.



Similarity Map

Each bubble represents a client project. The size of the bubble reflects the volume of Wikidata usage in the respective project; a logarithmic scale is used in this plot.
Projects similar in respect to their usage of Wikidata items from the selected category are grouped together. Use the tools next to the plot legend to explore the plot and hover over bubbles for details.


Loading...



WDCM Semantics :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



WDCM Semantics Dashboard

Description


Introduction

This Dashboard is a part of the Wikidata Concepts Monitor (WDMC). The WDCM system provides analytics on Wikidata usage across the Wikimedia sister projects. The WDCM Semantics Dashboard is probably the central and the analytically most complicated of all WDCM Dashboards. Here we provide only the necessary basics of distributional semantics needed in order to understand the results of semantic topic modeling presented on this WDCM dashboard. A user who needs to dive deep into the similarity structures between the Wikimedia sister projects in respect to their Wikidata usage patterns will most probably have to do some additional reading first. However, the Dashboard simplifies the presentation of the results as much as possible to make them accessible to any Wikidata user or Wikipedia editor who is not necessarily involved in Data or Cognitive Science. Reading through the WDCM Semantic Topic Models section in this page is highly advised to anyone who has never met semantic topic models or distributional semantics before. Before that, our next stop: Definitions.


Definitions

N.B. The current Wikidata item usage statistic definition is the count of the number of pages in a particular client project where the respective Wikidata item is used. Thus, the current definition ignores the usage aspects completely. This definition is motivated by the currently present constraints in Wikidata usage tracking across the client projects (see Wikibase/Schema/wbc entity usage). With more mature Wikidata usage tracking systems, the definition will become a subject of change. The term Wikidata usage volume is reserved for total Wikidata usage (i.e. the sum of usage statistics) in a particular client project, group of client projects, or semantic categories. By a Wikidata semantic category we mean a selection of Wikidata items that is that is operationally defined by a respective SPARQL query returning a selection of items that intuitivelly match a human, natural semantic category. The structure of Wikidata does not necessarily match any intuitive human semantics. In WDCM, an effort is made to select the semantic categories so to match the intuitive, everyday semantics as much as possible, in order to assist anyone involved in analytical work with this system. However, the choice of semantic categories in WDCM is not necessarily exhaustive (i.e. they do not necessarily cover all Wikidata items), neither the categories are necessarily mutually exclusive. The Wikidata ontology is very complex and a product of work of many people, so there is an optimization price to be paid in every attempt to adapt or simplify its present structure to the needs of a statistical analytical system such as WDCM. The current set of WDCM semantic categories is thus not normative in any sense and a subject of change in any moment, depending upon the analytical needs of the community.

The currently used WDCM Taxonomy of Wikidata items encompasses the following 14 semantic categories: Geographical Object, Organization, Architectural Structure, Human, Wikimedia, Work of Art, Book, Gene, Scientific Article, Chemical Entities, Astronomical Object, Thoroughfare, Event, and Taxon.


WDCM Semantic Topic Models

Suggested Readings

While Wikidata itself is a semantic ontology with pre-defined and evolving normative rules of description and inference, Wikidata usage is essentialy a social, behavioral phenomenon, suitable for study by means of machine learning in the field of distributional semantics: the analysis and modeling of statistical patterns of occurrence and co-occurence of Wikidata item and property usage across the client projects (e.g. enwiki, frwiki, ruwiki, etc). WDCM thus employes various statistical approaches in an attempt to describe and provide insights from the observable Wikidata usage statistics (e.g. topic modeling, clustering, dimensionality reduction, all beyond providing elementary descriptive statistics of Wikidata usage, of course).
Wikidata Usage Patterns. The “golden line” that connects the reasoning behind all WDCM functions can be non-technically described in the following way. Imagine observing the number of times a set of size N of particular Wikidata items was used across some project (enwiki, for example). Imagine having the same data or other projects as well: for example, if 200 projects are under analysis, then we have 200 counts for N items in a set, and the data can be desribed by a N x 200 matrix (items x projects). Each column of counts, representing the frequency of occurence of all Wikidata entities under consideration across one of the 200 projects under discussion - a vector, obviously - represents a particular Wikidata usage pattern. By inspecting and modeling statistically the usage pattern matrix - a matrix that encompasses all such usage patterns across the projects, or the derived covariance/correlation matrix - many insigths into the similarities between Wikimedia projects items projects (or, more precisely, the similarities between their usage patterns) can be found.
In essence, the technology and mathematics behind WDCM relies on the same set of practical tools and ideas that support the development of semantic search engines and recommendation systems, only applied to a specific dataset that encompasses the usage patterns for tens of millions of Wikidata entities across its client projects.


Dashboard: Semantic Models

Each of the 14 currently used semantic categories in the WDCM Taxonomy of Wikidata items receives a separate topic model. Each topic model encompasses two or more topics, or semantic themes. Here you can select a semantic category (e.g. "Geographical Object", "Human") and a particular topic from its model. The page will produce three outputs: (1) the Top 50 items in this topic chart, which presents the 50 most important items in the select topic of the selected category's topic model, (2) the Topic similarity network, which presents the similarity structure among the 50 most important items in the selected topic, and (c) the Top 50 projects in this topic chart, where 50 Wikimedia projects in which the selected topic plays a prominent role in the selected semantic category.


Dashboard: Project Semantics

Make a selection of Wikimedia projects here and hit Apply Selection. The Dashboard will produce a series of charts, one per each Wikidata semantic category that is present in your selection of projects, and compute the relative importance (%) of each topic in the given selection and for each semantic category. Do not forget that category specific semantic models do not necessarily encompass the same number of topics (in fact, they rarely do); also, Topic n in one category is obviously not the same thing as Topic n in some other category.


Dashboard: Similarity Maps

Upon a selection of semantic category, the Dashboard will present a 2D map which represents the similarities between the Wikimedia projects computed from the selected category's semantic model only. Here you can learn how similar or dissimilar are the sister projects in respect to their usage Wikidata items from a single semantic category.




WDCM Semantics :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm



WDCM Navigation

Your orientation in the WDCM Dashboards System


  • WDCM Portal
    The entry point to WDCM Dashboards.

  • WDCM Overview
    The big picture. Fundamental insights in how Wikidata is used across the client projects.

  • WDCM Semantics
    Detailed insights into the WDCM Taxonomy (a selection of semantic categories from Wikidata), its distributional semantics, and the way it is used across the client projects. If you are looking for Topic Models - that’s where they live.

  • WDCM Usage
    Fine-grained information on Wikidata usage across client projects and project types. Cross-tabulations and similar.

  • WDCM Geo
    Wikidata items interactive maps.

  • WDCM Structure
    A method to investigate the WDCM Taxonomy and improve the choice of items that undergo analyses.

  • WDCM Biases
    The WDCM gender bias and north-south divide statistics.

  • WDCM (S)itelinks
    The WDCM (S)itelinks usage aspect statistics.

  • WDCM (T)itles
    The WDCM (T)itles usage aspect statistics.


  • WDCM System Technical Documentation
    The WDCM Wikitech Page.

  • WDCM Wikidata Project Page
    The WDCM Wikidata Project Page.

  • The WDCM Journal
    A regularly update selection of the most interesting empirical findings from wmdeanalytics.




WDCM Semantics :: Wikidata, WMDE 2019

Contact: Goran S. Milovanovic, Data Scientist, WMDE
e-mail: goran.milovanovic_ext@wikimedia.de
IRC: goransm