A large number of Wikipedia articles are geocoded. This means that when an article pertains to a location, its latitude and longitude are linked to the article. As you can imagine, this can be useful to generate insightful and eye-catching infographics. A while ago, a team at Oxford built this magnificent tool to illustrate the language boundaries in Wikipedia articles. This led me to wonder if it would be possible to extract the different topics in Wikipedia.
This is exactly what I managed to do in the past few days. I downloaded all of Wikipedia, extracted 300 different topics using a powerful clustering algorithm, projected all the geocoded articles on a map and highlighted the different clusters (or topics) in red. The results were much more interesting than I thought. For example, the map on the left shows all the articles related to mountains, peaks, summits, etc. in red on a blue base map. The highlighted articles from this topic match the main mountain ranges exactly.
Read on for more details, pretty pictures and slideshows.
A bit about the process
You can skip this section if you don’t really care about the nitty-gritty of the production of the maps. Scroll down to get to the slideshows.
Getting the data
The first the step to create these map was to retrieve all Wikipedia articles. There are 1.5 million of them and only a portion (400,000) are geocoded, but this doesn’t matter, because it’s an all or nothing deal: everything must be downloaded. I had to download the raw data from this page. It’s quite a large download at 9GB compressed and it expands to about 40GB once it is uncompressed. I then parsed this very large file to extract the article content, links and geographical coordinates.
To extract topics from this huge corpus, I used Latent Dirichlet Allocation. This algorithm can extract a given number of topics from a large corpus. Usually the optimal number of topics can be inferred from the likelihood values over several topic runs. However, in this case, since the corpus is very large and each run is very time consuming (50 hours on the most powerful AWS cluster instance), I chose a number relying on an educated guess and my LDA experience.
I ran the LDA algorithm using Yahoo’s LDA implementation since it’s quite fast and can be parallelized. After 50 hours, I got 300 different topics linked to 1.5 million articles, but because only 400,000 of them are geocoded, the rest of this post only pertains to these 400,000. You can download the topic descriptions here. The topics are very varied and range from geographical regions, ethnic groups, science, sports (including both kinds of football!), historical sites and even archeological dig sites.
Drawing the maps
The maps were generated from an array of tools ranging from standard Unix utilities to custom developed tools in Python and Java. The previous steps provided me with two datasets, the first one was the geographical coordinates of all Wikipedia articles and the second was a linking table between the topics and the articles.
From there, using a custom tool, I mapped all the points using the Robinson projection. A map with all the articles was rendered in shades of blue and would serve as a base map.
Next, I generated 300 different datasets and rendered the same number of maps where the articles were in red; these were the topic maps. I then overlaid all these maps onto the base map using ImageMagick and added a caption at the bottom of each map to identify each topic.
You can download all the maps in high-resolution (18M pixels – 3MB per map – 900MB total) here.
This slideshow illustrates the interesting topics that I found while checking the finalized maps. Most of these maps are not related to political boundaries, but to subjects that are geographically interesting.
For example, all articles relating to football, climate, music, royal dynasties, naval bases, religions, etc. are highlighted. You can click on the maps to enlarge them and read the captions describing the highlighted articles.
Geographical, Colonial and Ethnic Boundaries
In this slideshow, you can see all the maps with strong geographical topics. Since geography is never far from history, a lot of maps show the colonial past of many countries. As ethnic groups don’t always fall inside political borders, several maps reveal the presence of multiple ethnic or cultural groups within a country or of groups stretching across borders. Other maps show old empires like the Ottoman, Roman or Persian empires.
You can download the geocoded data here. This file includes the topic id, the probability of the article to belong this topic, an internal id, the name of the article, it’s latitude and longitude, and the pagerank of the article.
The raw LDA (including non-geocoded articles) is really massive. If you want it, post a comment (or contact me by email) and I’ll upload it.