This page is currently not much more than an extended advertisment for doing content analysis in Python. In time it might expand to a full tutorial, should anyone express interest in reading one. In the meantime it'll hopefully just whet your appetite.
The scripts presented here are not intended to teach programming; I assume you have at least a vague idea about that already. Nor are they intended to exemplify fine coding style. The point is to show how easy things can be, if you pick the right tools. Now, to business...
Very often we want to see a keyword in context. With this script we will be able to type:
python kwic1.py idtext.txt identity 3
and get back all the instances of the word 'identity' in the document idtext.txt with three words of context either side. idtext.txt is a short text taken from the front page of the old Identity Project homepage at Harvard. We'll look at the script in a moment.
Output arrives in the console, and looks like this:
The concept of [identity] seems to be is, broadly speaking, [identity] discourse; that is, how to define [identity;] nor is there and scope of [identity;] nor is there for evidence that [identity] indeed affects knowledge, agreement on how [identity] affects these components how to treat [identity] as a variable. conceptualize and study [identity.] We prefer to this way: If [identity] is a key develop conceptualizations of [identity] and, more importantly, technologies for observing [identity] and identity change observing identity and [identity] change that will techniques for analyzing [identity] have consisted of researchers to approach [identity] research with a methods of analyzing [identity] as an independent
The code to generate this is reproduced below. We'll go through it line by line, since there aren't very many.
import sys, string, re
# command line arguments
file = sys.argv[1]
target = sys.argv[2]
window = int( sys.argv[3] )
a = open(file)
text = a.read()
a.close()
tokens = text.split() # split on whitespace
keyword = re.compile(target, re.IGNORECASE)
for index in range( len(tokens) ):
if keyword.match( tokens[index] ):
start = max(0, index-window)
finish = min(len(tokens), index+window+1)
lhs = string.join( tokens[start:index] )
rhs = string.join( tokens[index+1:finish] )
print "%s [%s] %s" % (lhs, tokens[index], rhs)
First, we import some modules that provide useful functions. Next we get the command line arguments. (Any text after '#' is a comment.) sys.argv is an array containing everything on the command line. Thus, sys.argv[0], which we ignore, is the script name (computers count from zero), sys.argv[1] is the filename, sys.argv[2] is the keyword, and sys.argv[3] is the context window size. sys.argv[3] is treated as a string by default, so we convert it to an integer with int().
Having got the relevant information, we open the file and read contents into the variable text. Next we split the text into words using the split function of the string module. split assumes that words are anything separated by whitespace. This won't work generally, but it'll do for now.
We could simply look for exact copies of the keyword, but often a substring match is more useful; trailing bits of punctuation won't spoil our match. Also, we don't care about case. To make this all happen we compile a regular expression from the target.
Finally we walk through the array of words, looking for our matches. If keyword matches the array element at the current index, we want to print out the matching word, surrounded by its context. We compute start and finish indices of the context explicitly to ensure we don't ask for a negative index or one past the end of our array. Finally, we construct the left and right hand sides of the concordance, and print out the result using a simple template.
There are no doubt hundred of ways to improve and extend this script, but it does what it is meant to. So, on to more interesting tasks.
The heart of most content analyses is a dictionary that assigns words to categories. In its simplest form a dictionary is just a set of words under different heading, e.g. egdict.txt. In this file each line starting with '>>' signs indicates the name of a category and every word beneath the category name is a category member. Simple, but adequate for basic dictionary based content analysis.
We'd like to be able to read in this dictionary file, and analyse a document with it by saying:
python dict1.py egdict.txt idtext.txt
dict1.py is the script, egdict.txt is the dictionary, and idtext.txt is the text. From this line we get, the number of times words from each dictionary category appeared in the text.
Default : 0 science : 13 self : 7 group : 7
The Default category here contains all words that don't appear under any heading. The code for this is shown below:
import sys, string, re
# command line arguments
dictfile = sys.argv[1]
textfile = sys.argv[2]
a = open(textfile)
text = string.split( a.read() ) # lowercase the text
a.close()
a = open(dictfile)
lines = a.readlines()
a.close()
dic = {}
scores = {}
# a default category for simple word lists
current_category = "Default"
scores[current_category] = 0
# inhale the dictionary
for line in lines:
if line[0:2] == '>>':
current_category = string.strip( line[2:] )
scores[current_category] = 0
else:
line = line.strip()
if len(line) > 0:
pattern = re.compile(line, re.IGNORECASE)
dic[pattern] = current_category
# examine the text
for token in text:
for pattern in dic.keys():
if pattern.match( token ):
categ = dic[pattern]
scores[categ] = scores[categ] + 1
for key in scores.keys():
print key, ":", scores[key]
Once again, we'll take it from the top. Much should be familiar from the previous script. We import some useful stuff, parse the command line arguments, and read in the text. Then we read in the dictionary file. This time we use readlines() rather than read() because we want to process it line by line.
Next we set up some data structures to represent the content dictionary. We shall make use of two hashtables (called dictionaries in python) dic and scores.
For those who have not met a hashtable before, it is a mapping from keys to values. Given a key, a hashtable returns the single object that is associated with it. Hashtables lie at the heart of most scripting languages such as perl and python.
The first hashtable, dic, will be a mapping from word patterns to category names. The second, scores, will map category names to the number of times a member of that category has been recognized in the text. The first thing to do is to initialize a working category name, here the default category, and set its count to zero. Then we start reading the dictionary file.
We work through the lines in the dictionary file, checking to see if we've met another category name (beginning with '>>'). If we haven't, then we compile the current line into pattern (so we can do case invariant substring matching), and add it as a key to dic. The value that this key will retrieve is set to the working category name. When we meet another category name, we switch the working category name to that, and carry on filling dic.
With the most important hashtable constructed, we can run through the text computing frequency statistics. Each time we see a word we check which, if any, of dic's keys matches it. As soon as a key matches, we find out which category dic maps the key to. We then add one to the count in scores indexed by the category's name. Finally, we cycle through the keys of scores (the category names), and print out their values.
There is certainly more to dictionary-based content analysis than this, but there's only so much we can show in a few lines of code. And there's certainly more to python than this e.g. functions, modules, classes, and some great built-in libraries; we just didn't need them.
If these simple scripts have tempted you to try this at home, then you'll want to know how to install python, learn more of the language, and make use of the many excellent libraries available.
If you run Mac OSX, python is already installed. If not you can download the latest version from the python homepage.
Naturally, everything mentioned above is free.
The python homepage has a tutorial and lots of documentation. Although we have made no use of it here, python has a shell intepreter; just type python at your system prompt and do some exploring.
I found the best book for learning python is Mark Lutz and David Asher's Learning Python, published by O'Reilly. (Avoid the similarly titled but much larger Programming Python by Mark Lutz.)
It's quite possible, and potentially rather fun to roll your own text processing code in python. The language does a lot for you already, from downloading pages from websites to processing xml and dealing with databases. However, some things move faster with a good targeted library.
Many useful libraries are linked from the python homepage. Of particular relevance to text processing applications is the Natural Language Processing Toolkit. NLTK implements a wide range of models from the natural language processing literature. If this aspect of content analysis interests you, you may want to have Manning and Schutze's classic but very readable text Foundations of Statistical Natural Language Processing to hand.
Happy programming.