When you have a huge collection of documents, you can no longer read all of them yourself. You want to know what are all these documents about? and find out in a glance.

Once you have your documents up in DocumentCloud, you can make them useful in all sorts of ways. All text is extracted and downloadable, so it's the perfect set-up for a visualization!

We're going to need some serious tools here. We need requests, a full scientific python stack, Gensim and t-sne.

import json, csv, requests, urllib, os.path
from itertools import islice, count
from gensim.models.word2vec import Word2Vec
import string
from glob import glob
import numpy as np
from tsne import bh_sne
from collections import Counter

In this post, I'll be visualizing documents that were made public in the Netherlands as a result of Freedom of Information requests. These documents were collected from government websites by OpenState.

project = '19706-wob-besluiten'
limit = None

Downloading all documents as plain text

For convenience, I'm downloading all the documents, but it should be possible to do this kind of visualization while keeping everything online.

def get_docs(project):
    """ Get metadata for all documents for this project """
    cloud_url = ("http://www.documentcloud.org/api/search.json"
            "?q=projectid:{project}&per_page=1000&page={page}")

    for i in count(start=1):
        url = cloud_url.format(page=i, project=project)
        documents_json = requests.get(url)
        wobs = json.loads(documents_json.text)
        if wobs['documents']:
            for doc in wobs['documents']:
                yield doc
        else:
            break

docs = list(get_docs(project))
print 'there are %s documents in this collection.' % len(docs)

def get_fname(doc):
    return '{proj}/{id}.txt'.format(proj=project, id=doc['id'])
docs = [(doc, get_fname(doc)) for doc in docs if not os.path.isfile(get_fname(doc))]
print 'downloading %s new documents.' % len(docs)

for doc, fname in islice(docs, limit):
    urllib.urlretrieve(doc['resources']['text'], fname)
    print 'downloaded', doc['id']
there are 11561 documents in this collection.
downloading 0 new documents.

Create word vectors

To capture the meaning of these documents, I'm building word vector space. It creates a vector - which is nothing but a list of numbers - for every word that occurs more than 5 times in total. The vectors encode what kind of surroundings a word has. If two words, like 'chair' and 'sofa' occur with the same words (like 'sit' or 'rest'), their vectors will me similar.

%%time
no_numbers = string.maketrans("0123456789","##########")
def tokenize(s):
    return s.translate(no_numbers, string.punctuation).lower().split()

fglob = '19706-wob-besluiten/*.txt'
def get_docs(n):
    return iter(tokenize(open(fname).read()) for fname in glob(fglob)[:n])

model = Word2Vec()
model.build_vocab(get_docs(limit))
model.train(get_docs(limit))
CPU times: user 5min 12s, sys: 8.15 s, total: 5min 21s
Wall time: 5min 32s

model.most_similar('kosten')

Create document vectors and make them 2D

Now we can add all the vectors of words in a document together, to get a document vector. That will make similar documents have similar vectors! Then we can reduce the size of those number lists (the vector dimensionality) down to 2, in a way that keeps nice properties like putting similar documents close together.

def doc_vecs(docs, model):
    for doc in docs:
        total, vec = 0, np.zeros( (model.layer1_size,) )
        for word in doc:
            if word in model:
                vec += model[word]
                total += 1
        if total:
            vec /= total
        yield np.array(vec, dtype='float64')

%%time
X = np.vstack(list(doc_vecs(get_docs(limit), model)))
X_2d = bh_sne(X)
X_2d_norm = X_2d/np.max(X_2d)
CPU times: user 4min 57s, sys: 5.75 s, total: 5min 3s
Wall time: 5min 12s

Extract some interesting words from each document

To get an idea of what a document is about, I thought it would be neat to look at some of the words that make it special. The words I choose here are the ones that occur in two of its most similar documents, too.

%%time
index2doc = dict(enumerate(glob(fglob)))
doc2index = {v:k for k,v in index2doc.iteritems()}

def neighbors(fname):
    nearest = np.argsort(-np.dot(X, X[doc2index[fname]]))
    return set([fname]+[index2doc[i] for i in nearest[:2]])

def wordlist(fname, model):
    """ Find the most important word for this document and its neighbors """
    counters = [ Counter(tokenize(open(f).read())) for f in neighbors(fname) ]
    counts = reduce(Counter.__add__, counters)
    intersection = reduce(Counter.__and__, counters)
    for word in counts.keys():
        if not (word in model.vocab and word in intersection and len(word)>1):
            counts.pop(word)
        else:
            counts[word] /= float(model.vocab[word].count)        
    return counts.most_common(10)

fnames = glob(fglob)[:X.shape[0]]
important_words = {f: wordlist(f, model) for f in fnames}
CPU times: user 3min 45s, sys: 7.97 s, total: 3min 53s
Wall time: 3min 56s

Make a plot

Finally, some javascript magic plots the documents in an interactive diagram!

from IPython.display import Javascript
import json

def visual_data():
    for fname, (x,y) in zip(glob(fglob),X_2d_norm):
        words = [{'word':w, 'size':c} for w,c in important_words[fname]]
        title = fname.split('/')[-1].split('.')[0]
        # round the numbers to prevent js floating point errors
        yield {'x':round(x,5), 'y':round(y,5), 'title':title, 'words':words}
Javascript('window.data = ' + json.dumps( list(visual_data()) ))