A while back, I saw that the Internet Archive hosted an archive of CNN
transcripts from 2000 to
first thing that came to my mind was that this was an amazing corpus to
study. It contained the last 12 years of news in textual form at the
same place. I felt that it would be an amazing project to retrieve all
the transcript from 2000 to today and someone went already to the
trouble of downloading this corpus.
Unfortunately, the data was basically a dump of the transcripts
pages from CNN. This isn’t
a problem for archival purposes, but for analysis, it would make things
a bit difficult. For my new project, it meant that I would need to find
a way to download all the transcripts from CNN, parse them and dump them
to a database. To make things even more difficult, the HTML from the
early 2000s was more about form that function. In other words, the CNN
webmasters (in the 2000, web designers or developers didn't exist, they
were webmasters!) would throw something that would render in Internet
Explorer or Netscape Navigator and call it a day. There was no effort in
making the layout and content organized.
That meant that this part of the project would be boring as hell since I
would need to write a parser, find some heuristics to make sure that the
parsing was done correctly and then store it in a database. Since I
wasn't building a business or a process around this project, and that it
would pretty much be an one-off project I went with the path of least
resistance and scripted something specifically for this content source.
The script is written in Ruby since it's a language, like Python, that
has a lot of libraries and is pretty easy to learn. The script is split
in two files; the first one is the script
while the other one is a configuration
that describes what parsing needs to be done to extract the relevant
information. The configuration file declares a bunch of global
constants, but also has a list of different parsing options that are
invoked depending on the transcript's date like so:
[sourcecode language="plain" title="formats.yml"]
- date: [2000-04-19, 2000-04-20]
- start_date: 2000-01-01 # inclusive
end_date: 2000-03-15 # exclusive
The configuration was very laborious to build because when every time
the script would fail, I would need to look through the source code of
the page because the page layout changed. The good news is that since
mid-2005, CNN kept pretty much the same page format which made the
parsing much easier.
Once the page have been retrieved and the text was extracted using the
appropriate configuration parameters, the script extracted the dialog
lines so that the speaker and the content were separated like so:
||My fellow Americans….
||Breaking news in DC –
This extracted text and associated meta-data like the date, the name of
the show, the link to the transcript and an unique ID were inserted in
an ElasticSearch instance for easy retrieval and querying.
The next articles will analyze this dataset. I’ve made the scripts
available on github here. The dataset itself is not available even
though it contain data readily available from CNN’s website. The data
doesn’t belong to me so I really can’t take that decision for them.