A Map of the Geographical Structure of Wikipedia Links

Wikipedia
Click to enlarge!

There are a lot of Wikipedia visualizations. Some concentrate on article contents, others on the links between articles and some use the geocoded content (like in my previous blog post).
This new visualization is novel because it uses the geographical content of Wikipedia in conjunction with the links between articles. In other words, if a geocoded article (that is, an article associated with a location like a city) links to another geocoded article, a line will be drawn between these two points. The result can be found on the map on the left.

Read on for zoomed views, slideshows, browsable maps, etc.

Methodology

Scroll down to see the slideshows, pretty pictures and interactive maps.

Click to enlarge!

The first thing I had to do was to extract the geographical data included in the articles and the links between the articles. Instead of parsing the very complicated Wikipedia markup, I chose to use the good work done by the folks at GeoNames. In the download section, there a SQL file with the name of every geocoded Wikipedia article. Then, I downloaded all English articles in Wikipedia (9GB compressed, about 40GB uncompressed) and used a bit of Regex magic to extract reentrant links (that is, hyperlinks that link to geocoded articles). After these steps, I was left with two datasets: a list of all geocoded articles and a list of all links between articles.
To draw the map, I used the same technology I developed for my map of scientific collaborations. I had to adjust the tool to add features like other geographical projections (the Mercator projection, while simple, makes Greenland seems as large as Africa), linear transformations, etc. The datasets computed in the previous steps were then parsed and drawn by my mapping tool. I then played with the colors in Photoshop to convert the outputted grayscale map to color. To build the browsable and overlay maps, I used the fantastic MapTiller tool. By the way, the input projection for this tool is Equidistant Cylindrical – knowing this would have saved me a lot of time!

Slideshow

[huge_it_gallery id="4"]
This slideshow contains zoomed parts of the map of different countries, continents and regions. Click on a picture to enlarge it. Browse to the bottom of this blog post to download the full size map (200M pixels – 18MB JPEG file).

Browsable Map

This map is projected using a Robinson projection; it is a “compromise” projection meaning that while it doesn’t resolve all the problems found in many projections, it minimizes most distortions.

Click here to open this map in a new window

Google Map Overlay

Like the title suggests, this map is overlaid onto a Google Map so that cities, countries and other landmarks can be easily situated. Obviously, populated areas contain a lot of Wikipedia articles.

Click here to open this map in a new window

Data & High Resolution Files

There’s also a high resolution file, but Amazon was charging me a pretty penny to host and serve it, so I removed it. Let me know if you want it; I’ll send you the file. I also have a 1.7G pixels file, but it is too large to host here, so let me know if you need it. It uses the Equidistant Cylindrical projection, not the Robinson one like the other high resolution file.
The input dataset (30MB compressed, around 95MB uncompressed) can be downloaded here, the fields should be self-explanatory.
The drawing tool will be eventually open sourced, but I need time to clean it up.

14 thoughts on “A Map of the Geographical Structure of Wikipedia Links”

  1. Is it possible to extract the article names from the nodes?
    There’s a dot in northern Finland, N 65° 51′ 0” E 29° 54′ 0”, that has numerous connections radiating toward south-west, but… there’s nothing there.
    Only one article nearby, https://en.wikipedia.org/wiki/Joukamojärvi, but the article has only two outward links to Finnish domains, and no internal links apart from the infobox. The area is just empty taiga forest.
    I’m really intrigued about what can be so important up there.

  2. Amazing work, kudos!
    With regards to what Chris says, I was intrigued enough to try to find the culprit by following the links from nearby Bothnian Bay and studying the linked pages. I would bet that the point corresponds to the article for the Baltic Sea, which has loads of links and is not where one would expect:
    http://en.wikipedia.org/wiki/Baltic_Sea
    I have been perusing the article history, but can’t find any recent location-related vandalism. So it’s a mistery 😉

    1. Thanks.
      Yeah, the first render I did had a bit of weird stuff that I had to correct. For example, the FCC was located in Russia and all radio related pages would like to the FCC. I think the geodata in wikipedia is less clean than we think.

  3. I’m intrigued by the big line across the south Atlantic. The only thing I can find there with a brief search is Bouvet Island but that’s uninhabited and glacial, so likely not a Tuvalu .tv domain situation.
    Perhaps that’s links that relate to the ocean itself?

  4. “A Map of the Geographical Structure of Wikipedia Links” -> “A Map of the Geographical Structure of English Wikipedia Links”

Leave a Reply

Your email address will not be published. Required fields are marked *