[caption id="attachment_685" align="alignleft" width="150"] Mountains, peaks, summits, etc.[/caption]
A large number of Wikipedia articles are geocoded. This means that when
an article pertains to a location, its latitude and longitude are linked
to the article. As you can imagine, this can be useful to generate
insightful and eye-catching infographics. A while ago, a team at Oxford
built this magnificent
illustrate the language boundaries in Wikipedia articles. This led me to
wonder if it would be possible to extract the different topics in
This is exactly what I managed to do in the past few days. I downloaded
all of Wikipedia, extracted 300 different topics using a powerful
clustering algorithm, projected all the geocoded articles on a map and
highlighted the different clusters (or topics) in red. The results were
much more interesting than I thought. For example, the map on the left
shows all the articles related to mountains, peaks, summits, etc. in red
on a blue base map. The highlighted articles from this topic match the
main mountain ranges exactly.
Read on for more details, pretty pictures and slideshows.
A bit about the process
You can skip this section if you don't really care about the
nitty-gritty of the production of the maps. Scroll down to get to the
Getting the data
[caption id="attachment_655" align="alignleft" width="150"] Trains, stations, platforms,
The first the step to create these map was to retrieve all Wikipedia
articles. There are 1.5 million of them and only a portion (400,000) are
geocoded, but this doesn't matter, because it's an all or nothing deal:
everything must be downloaded. I had to download the raw data from this
It's quite a large download at 9GB compressed and it expands to about
40GB once it is uncompressed. I then parsed this very large file to
extract the article content, links and geographical coordinates.
[caption id="attachment_663" align="alignright" width="150"] Islands, coasts, beaches, oceans,
To extract topics from this huge corpus, I used Latent Dirichlet
This algorithm can extract a given number of topics from a large corpus.
Usually the optimal number of topics can be inferred from the likelihood
values over several topic runs. However, in this case, since the corpus
is very large and each run is very time consuming (50 hours on the most
powerful AWS cluster instance), I chose a number relying on an educated
guess and my LDA experience.
I ran the LDA algorithm using Yahoo's LDA
implementation since it's
quite fast and can be parallelized. After 50 hours, I got 300 different
topics linked to 1.5 million articles, but because only 400,000 of them
are geocoded, the rest of this post only pertains to these 400,000. You
can download the topic descriptions
The topics are very varied and range from geographical regions, ethnic
groups, science, sports (including both kinds of football!), historical
sites and even archeological dig sites.
Drawing the maps
[caption id="attachment_671" align="alignleft"
width="150"] Archaeological, stone, site, ancient, remains, bronze,
The maps were generated from an array of tools ranging from standard
Unix utilities to custom developed tools in Python and Java. The
previous steps provided me with two datasets, the first one was the
geographical coordinates of all Wikipedia articles and the second was a
linking table between the topics and the articles.
From there, using a custom tool, I mapped all the points using the
Robinson projection. A map with all the articles was rendered in shades
of blue and would serve as a base map.
Next, I generated 300 different datasets and rendered the same number of
maps where the articles were in red; these were the topic maps. I then
overlaid all these maps onto the base map using ImageMagick and added a
caption at the bottom of each map to identify each topic.
You can download all the maps in high-resolution (18M pixels - 3MB per
map - 900MB total) here.
This slideshow illustrates the interesting topics that I found while
checking the finalized maps. Most of these maps are not related to
political boundaries, but to subjects that are geographically
For example, all articles relating to football, climate, music, royal
dynasties, naval bases, religions, etc. are highlighted. You can click
on the maps to enlarge them and read the captions describing the
Geographical, Colonial and Ethnic Boundaries
In this slideshow, you can see all the maps with strong geographical
topics. Since geography is never far from history, a lot of maps show
the colonial past of many countries. As ethnic groups don't always fall
inside political borders, several maps reveal the presence of multiple
ethnic or cultural groups within a country or of groups stretching
across borders. Other maps show old empires like the Ottoman, Roman or
You can download the geocoded data
here. This file includes the
topic id, the probability of the article to belong this topic, an
internal id, the name of the article, it's latitude and longitude, and
the pagerank of the article.
The raw LDA (including non-geocoded articles) is really massive. If you
want it, post a comment (or contact me by email) and I'll upload it.