A Map of the Geographic Structure of Wikipedia Topics

Wikipedia Topic 260

Mountains, peaks, summits, etc.

A large number of Wikipedia articles are geocoded. This means that when an article pertains to a location, its latitude and longitude are linked to the article. As you can imagine, this can be useful to generate insightful and eye-catching infographics. A while ago, a team at Oxford built this magnificent tool to illustrate the language boundaries in Wikipedia articles. This led me to wonder if it would be possible to extract the different topics in Wikipedia.

This is exactly what I managed to do in the past few days. I downloaded all of Wikipedia, extracted 300 different topics using a powerful clustering algorithm, projected all the geocoded articles on a map and highlighted the different clusters (or topics) in red. The results were much more interesting than I thought. For example, the map on the left shows all the articles related to mountains, peaks, summits, etc. in red on a blue base map.  The highlighted articles from this topic match the main mountain ranges exactly.

Read on for more details, pretty pictures and slideshows.

A bit about the process

You can skip this section if you don’t really care about the nitty-gritty of the production of the maps. Scroll down to get to the slideshows.

Getting the data

Trains, stations, platforms, railways, etc.

Trains, stations, platforms, railways, etc.

The first the step to create these map was to retrieve all Wikipedia articles. There are 1.5 million of them and only a portion (400,000) are geocoded, but this doesn’t matter, because it’s an all or nothing deal: everything must be downloaded. I had to download the raw data from this page. It’s quite a large download at 9GB compressed and it expands to about 40GB once it is uncompressed. I then parsed this very large file to extract the article content, links and geographical coordinates.

Identifying topics

Islands, coasts, beaches, oceans, etc.

Islands, coasts, beaches, oceans, etc.

To extract topics from this huge corpus, I used Latent Dirichlet Allocation. This algorithm can extract a given number of topics from a large corpus. Usually the optimal number of topics can be inferred from the likelihood values over several topic runs. However, in this case, since the corpus is very large and each run is very time consuming (50 hours on the most powerful AWS cluster instance), I chose a number relying on an educated guess and my LDA experience.

I ran the LDA algorithm using Yahoo’s LDA implementation since it’s quite fast and can be parallelized. After 50 hours, I got 300 different topics linked to 1.5 million articles, but because only 400,000 of them are geocoded, the rest of this post only pertains to these 400,000. You can download the topic descriptions here. The topics are very varied and range from geographical regions, ethnic groups, science, sports (including both kinds of football!), historical sites and even archeological dig sites.

Drawing the maps

Archaeological, stone, site, ancient, remains, bronze, etc.

Archaeological, stone, site, ancient, remains, bronze, etc.

The maps were generated from an array of tools ranging from standard Unix utilities to custom developed tools in Python and Java. The previous steps provided me with two datasets, the first one was the geographical coordinates of all Wikipedia articles and the second was a linking table between the topics and the articles.

From there, using a custom tool, I mapped all the points using the Robinson projection. A map with all the articles was rendered in shades of blue and would serve as a base map.

Next, I generated 300 different datasets and rendered the same number of maps where the articles were in red; these were the topic maps. I then overlaid all these maps onto the base map using ImageMagick and added a caption at the bottom of each map to identify each topic.

You can download all the maps in high-resolution (18M pixels – 3MB per map – 900MB total) here.

Interesting Topics

This slideshow illustrates the interesting topics that I found while checking the finalized maps. Most of these maps are not related to political boundaries, but to subjects that are geographically interesting.

Wikipedia Topic 4
Wikipedia Topic 16
Wikipedia Topic 17
Wikipedia Topic 20
Wikipedia Topic 23
Wikipedia Topic 27
Wikipedia Topic 38
Wikipedia Topic 39
Wikipedia Topic 48
Wikipedia Topic 49
Wikipedia Topic 60
Wikipedia Topic 77
Wikipedia Topic 85
Wikipedia Topic 86
Wikipedia Topic 95
Wikipedia Topic 97
Wikipedia Topic 109
Wikipedia Topic 111
Wikipedia Topic 127
Wikipedia Topic 130
Wikipedia Topic 137
Wikipedia Topic 141
Wikipedia Topic 147
Wikipedia Topic 149
Wikipedia Topic 154
Wikipedia Topic 156
Wikipedia Topic 174
Wikipedia Topic 175
Wikipedia Topic 193
Wikipedia Topic 194
Wikipedia Topic 202
Wikipedia Topic 216
Wikipedia Topic 227
Wikipedia Topic 231
Wikipedia Topic 241
Wikipedia Topic 242
Wikipedia Topic 245
Wikipedia Topic 246
Wikipedia Topic 247
Wikipedia Topic 251
Wikipedia Topic 259
Wikipedia Topic 260
Wikipedia Topic 262
Wikipedia Topic 263
Wikipedia Topic 291
Wikipedia Topic 299

Wikipedia Topic 4

Wikipedia Topic 16

Wikipedia Topic 17

Wikipedia Topic 20

Wikipedia Topic 23

Wikipedia Topic 27

Wikipedia Topic 38

Wikipedia Topic 39

Wikipedia Topic 48

Wikipedia Topic 49

Wikipedia Topic 60

Wikipedia Topic 77

Wikipedia Topic 85

Wikipedia Topic 86

Wikipedia Topic 95

Wikipedia Topic 97

Wikipedia Topic 109

Wikipedia Topic 111

Wikipedia Topic 127

Wikipedia Topic 130

Wikipedia Topic 137

Wikipedia Topic 141

Wikipedia Topic 147

Wikipedia Topic 149

Wikipedia Topic 154

Wikipedia Topic 156

Wikipedia Topic 174

Wikipedia Topic 175

Wikipedia Topic 193

Wikipedia Topic 194

Wikipedia Topic 202

Wikipedia Topic 216

Wikipedia Topic 227

Wikipedia Topic 231

Wikipedia Topic 241

Wikipedia Topic 242

Wikipedia Topic 245

Wikipedia Topic 246

Wikipedia Topic 247

Wikipedia Topic 251

Wikipedia Topic 259

Wikipedia Topic 260

Wikipedia Topic 262

Wikipedia Topic 263

Wikipedia Topic 291

Wikipedia Topic 299

For example, all articles relating to football, climate, music, royal dynasties, naval bases, religions, etc. are highlighted. You can click on the maps to enlarge them and read the captions describing the highlighted articles.

Geographical, Colonial and Ethnic Boundaries

In this slideshow, you can see all the maps with strong geographical topics. Since geography is never far from history, a lot of maps show the colonial past of many countries. As ethnic groups don’t always fall inside political borders, several maps reveal the presence of multiple ethnic or cultural groups within a country or of groups stretching across borders. Other maps show old empires like the Ottoman, Roman or Persian empires.

Wikipedia Topic 3
Wikipedia Topic 6
Wikipedia Topic 8
Wikipedia Topic 14
Wikipedia Topic 15
Wikipedia Topic 28
Wikipedia Topic 31
Wikipedia Topic 42
Wikipedia Topic 46
Wikipedia Topic 53
Wikipedia Topic 56
Wikipedia Topic 62
Wikipedia Topic 69
Wikipedia Topic 70
Wikipedia Topic 73
Wikipedia Topic 82
Wikipedia Topic 83
Wikipedia Topic 90
Wikipedia Topic 91
Wikipedia Topic 93
Wikipedia Topic 98
Wikipedia Topic 113
Wikipedia Topic 121
Wikipedia Topic 125
Wikipedia Topic 134
Wikipedia Topic 140
Wikipedia Topic 142
Wikipedia Topic 148
Wikipedia Topic 155
Wikipedia Topic 161
Wikipedia Topic 170
Wikipedia Topic 178
Wikipedia Topic 184
Wikipedia Topic 187
Wikipedia Topic 191
Wikipedia Topic 195
Wikipedia Topic 196
Wikipedia Topic 211
Wikipedia Topic 239
Wikipedia Topic 228
Wikipedia Topic 252
Wikipedia Topic 255
Wikipedia Topic 264
Wikipedia Topic 274
Wikipedia Topic 282
Wikipedia Topic 283
Wikipedia Topic 286
Wikipedia Topic 289
Wikipedia Topic 294
Wikipedia Topic 296

Wikipedia Topic 3

Wikipedia Topic 6

Wikipedia Topic 8

Wikipedia Topic 14

Wikipedia Topic 15

Wikipedia Topic 28

Wikipedia Topic 31

Wikipedia Topic 42

Wikipedia Topic 46

Wikipedia Topic 53

Wikipedia Topic 56

Wikipedia Topic 62

Wikipedia Topic 69

Wikipedia Topic 70

Wikipedia Topic 73

Wikipedia Topic 82

Wikipedia Topic 83

Wikipedia Topic 90

Wikipedia Topic 91

Wikipedia Topic 93

Wikipedia Topic 98

Wikipedia Topic 113

Wikipedia Topic 121

Wikipedia Topic 125

Wikipedia Topic 134

Wikipedia Topic 140

Wikipedia Topic 142

Wikipedia Topic 148

Wikipedia Topic 155

Wikipedia Topic 161

Wikipedia Topic 170

Wikipedia Topic 178

Wikipedia Topic 184

Wikipedia Topic 187

Wikipedia Topic 191

Wikipedia Topic 195

Wikipedia Topic 196

Wikipedia Topic 211

Wikipedia Topic 239

Wikipedia Topic 228

Wikipedia Topic 252

Wikipedia Topic 255

Wikipedia Topic 264

Wikipedia Topic 274

Wikipedia Topic 282

Wikipedia Topic 283

Wikipedia Topic 286

Wikipedia Topic 289

Wikipedia Topic 294

Wikipedia Topic 296

Open Data

You can download the geocoded data here. This file includes the topic id, the probability of the article to belong this topic, an internal id, the name of the article, it’s latitude and longitude, and the pagerank of the article.

The raw LDA (including non-geocoded articles) is really massive. If you want it, post a comment (or contact me by email) and I’ll upload it.

15 Thoughts on “A Map of the Geographic Structure of Wikipedia Topics

  1. Pingback: Wikimedia Research Newsletter, January 2013 — Wikimedia blog

  2. Pingback: Infographic: An Amazing, Invisible Truth About Wikipedia « NYC Real Estate News

  3. I think you should and MUST publish this as a section on wikipedia. Besides this, there might be a business idea here with smaller and more reduced data sets. I am a designer and active in design politics. My natural interest would be to see where other people who are interested in design live to make my network better and streamline my advertising for my products. What do you think?

  4. Logan Rhyne on February 13, 2013 at 10:07 pm said:

    Hey Olivier! This is absolutely fantastic work; I especially appreciate the amount of effort you went to in programmatically determining the clustering of topics and the resulting maps are really striking as well as informative. If you are willing to share your data, I’d love to play around with it on my own. I’d be happy to share anything interesting that pops up. Any thoughts on the easiest way to share?
    Cheers!
    Logan

    • Hi Logan,
      I’ve updated the blog post with a link to the raw geocoded data. The data file should be pretty self-explanatory. The column named “prob” is the probability of an article to belong the a certain topic. The Wikipedia page about Latent Dirichlet Allocation will have more details about the clustering algorithm. Also, please note that an article can have many topics. The topics are described in a text file linked in the “Identifying topics” section.

      Best,
      Olivier

  5. Pingback: A Visual Map of Wikipedia « Bayes Craze

  6. Pingback: In short: Looking for love during chemo, Kierkegaard’s love letter to a pen | Krantenkoppen Tech

  7. Pingback: BThaber » Wikipedia’nın haritası çıkarıldı

  8. Pingback: TechBalloon.net | Географска карта на човешкото знание

  9. Pingback: Here’s what Wikipedia looks like as a map | Ezspk Tech

  10. Pingback: Here’s what Wikipedia looks like as a map | Stock Market News - Business & Tech News

  11. Pingback: Mapas que visualizan los contenidos de la Wikipedia | Noticias CEU

  12. Pingback: goCart e.V. - connecting cartography | Wikipedia Mapping

  13. This publication shows outline maps of each county with parishes and clerical districts in Norway. It also contains a list of regions (districts) of Norway and shows which parishes belong to which region. The names of these regions are historical. Their boundaries determined by geological features.

  14. Pingback: A very cool example of data visualization - SuccessfulWorkplace

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Post Navigation