This summer I was contacted by Goodby, Silverstein & Partners, a Californian Ad Agency in San Francisco working with Adobe Systems. GS&P hired me to design an interactive visualization for their Museum of Digital Media to illustrate how people contribute to Wikipedia and how these contributors form communities.
This was a challenge I could not refuse.
To design this visualization, three steps were necessary: data collection, data analysis and visualization.
The data collection seemed simple: I just had to download all the edits from a couple of thousand Wikipedia pages. Alas, it was more complicated than I thought. The database is massive and downloading it all and decompressing it would have taken more time than I had. With a bit of bash scripting, curl and a lot of bandwidth (about 100 GB), I implemented a script to download all edits from the Special:Export page. After a couple of days and nights of downloading, I had more than a thousand large XML files. I needed to parse these files to extract the names of the users, their edits and the time of these edits. Using a small custom Java program, I parsed this data and dumped it in a relational database.
Now that the data was in the database, the real work could begin. Since I wanted to analyze data from real and committed users, I got rid of edits made by IP addresses (since there is no way to identify users from their IP addresses) and by bots. Bots are automated tools that maintain Wikipedia by reverting bad edits or vandalism. The resulting data set was a list of Wikipedia pages, their edits, and when those edits occurred. The next step was to find a way to extract communities based on the edits that users had made.
The best way to identify communities is to look at collaboration networks. Since collaboration is such a vague word, I determined that when two Wikipedia contributors edited the same page, they were collaborating. Based on this assumption, I built a large collaboration network where each node was a contributor and the edges corresponded to a collaboration between two editors. To extract communities from this network, I clustered the nodes using the Blondel/Louvain clustering algorithm. The end result was a list of users, their communities and the pages they edited. Interestingly and not surprisingly, the communities were centered on scientific disciplines. For example, editors contributing to articles in physics, tended to edit articles related to physics.
The visualization of this data was a challenge since time was limited and the data set was multidimensional. Furthermore, since it was for a museum, it had to be at least attractive so I could not just slap a couple of charts on a web page and call it a day. To respect my time frame, I had to drop the time element of the data set. I would only concentrate on the communities and the articles they edited.
The design of this visualization is quite simple. On the left-hand side, all science related articles are grouped by categories in a list and on the right-hand side the participating communities for each selected article are illustrated by a particle flow proportional to their contribution to the article. A color is assigned to each community and the article is colored according to the community that has contributed the most.