A while back, I saw that the Internet Archive hosted an archive of CNN transcripts from 2000 to 2012. The first thing that came to my mind was that this was an amazing corpus to study. It contained the last 12 years of news in textual form at the same place. I felt that it would be an amazing project to retrieve all the transcript from 2000 to today and someone went already to the trouble of downloading this corpus.
Unfortunately, the data was basically a dump of the transcripts pages from CNN. This isn’t a problem for archival purposes, but for analysis, it would make things a bit difficult. For my new project, it meant that I would need to find a way to download all the transcripts from CNN, parse them and dump them to a database. To make things even more difficult, the HTML from the early 2000s was more about form that function. In other words, the CNN webmasters (in the 2000, web designers or developers didn’t exist, they were webmasters!) would throw something that would render in Internet Explorer or Netscape Navigator and call it a day. There was no effort in making the layout and content organized.
That meant that this part of the project would be boring as hell since I would need to write a parser, find some heuristics to make sure that the parsing was done correctly and then store it in a database. Since I wasn’t building a business or a process around this project, and that it would pretty much be an one-off project I went with the path of least resistance and scripted something specifically for this content source.
The script is written in Ruby since it’s a language, like Python, that has a lot of libraries and is pretty easy to learn. The script is split in two files; the first one is the script itself while the other one is a configuration file that describes what parsing needs to be done to extract the relevant information. The configuration file declares a bunch of global constants, but also has a list of different parsing options that are invoked depending on the transcript’s date like so:
baseurl: http://transcripts.cnn.com start_date: 2015-12-20 end_date: 2015-12-31 sleep_secs_between_transcripts: 1.5 output_file_prefix: super_ format_overides: - date: [2000-04-19, 2000-04-20] index_id: 0 index_xpath: /html/body/table/tr/td transcript_id: 0 transcript_xpath: /html/body/table/tr/td formats: - start_date: 2000-01-01 # inclusive end_date: 2000-03-15 # exclusive index_id: 0 index_xpath: /html/body/table/tr/td transcript_id: 0 transcript_xpath: /html/body/table/tr/td
The configuration was very laborious to build because when every time the script would fail, I would need to look through the source code of the page because the page layout changed. The good news is that since mid-2005, CNN kept pretty much the same page format which made the parsing much easier.
Once the page have been retrieved and the text was extracted using the appropriate configuration parameters, the script extracted the dialog lines so that the speaker and the content were separated like so:
|BILL CLINTON||My fellow Americans….|
|ANCHOR-OF-THE-DAY||Breaking news in DC –|
This extracted text and associated meta-data like the date, the name of the show, the link to the transcript and an unique ID were inserted in an ElasticSearch instance for easy retrieval and querying.
The next articles will analyze this dataset. I’ve made the scripts available on github here. The dataset itself is not available even though it contain data readily available from CNN’s website. The data doesn’t belong to me so I really can’t take that decision for them.