Example: Biclustering text documents

This example is a slightly modified version of the scikit-learn tutorial on spectral biclustering.

It clusters documents and words in the 20 newsgroups datasets using the spectral co-clustering algorithm. The dataset comprises around 10000 newsgroups posts on 20 topics such that the resulting document-word biclusters indicate subsets of words which are used more often in certain subsets of documents.

In [6]:
%matplotlib inline
from __future__ import print_function

print(__doc__)

from collections import defaultdict
import operator
import re
from time import time

import numpy as np

from sklearn.cluster.bicluster import SpectralCoclustering
from sklearn.cluster import MiniBatchKMeans
from sklearn.externals.six import iteritems
from sklearn.datasets.twenty_newsgroups import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.cluster import v_measure_score


def number_aware_tokenizer(doc):
    """ Tokenizer that maps all numeric tokens to a placeholder.

    For many applications, tokens that begin with a number are not directly
    useful, but the fact that such a token exists can be relevant.  By applying
    this form of dimensionality reduction, some methods may perform better.
    """
    token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')
    tokens = token_pattern.findall(doc)
    tokens = ["#NUMBER" if token[0] in "0123456789_" else token
              for token in tokens]
    return tokens

# exclude 'comp.os.ms-windows.misc'
categories = ['alt.atheism', 'comp.graphics',
              'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
              'comp.windows.x', 'misc.forsale', 'rec.autos',
              'rec.motorcycles', 'rec.sport.baseball',
              'rec.sport.hockey', 'sci.crypt', 'sci.electronics',
              'sci.med', 'sci.space', 'soc.religion.christian',
              'talk.politics.guns', 'talk.politics.mideast',
              'talk.politics.misc', 'talk.religion.misc']
newsgroups = fetch_20newsgroups(categories=categories)
y_true = newsgroups.target

newsgroups.data[20]
Automatically created module for IPython interactive environment
Out[6]:
"From: steve@hcrlgw (Steven Collins)\nSubject: Sphere from 4 points\nOrganization: Central Research Lab. Hitachi, Ltd.\nLines: 24\nNntp-Posting-Host: hcrlgw\n\n>\n>Another method is to first find the center of the circle defined by 2 sets\n>of 3 points, and intersecting the normals from there.  This would also define\n>the circle center.  However, small numerical imprecisions would make the\n>lines not intersect.  Supposedly 3 planes HAVE to intersect in a unique\n>point if they are not parallel.\n>\n\nHaving thought about this, why don't you project the 2 lines onto the 2d\nplane formed by the lines.  Do an intersection calculation in the plane in\n2D, where you're guaranteed a unique solution (unless they're parallel which\nwon't happen in this case), and then use parametric distance along the lines\nfrom the circle centres to determine the exact point of interest.  This\nbypasses the messy error propogation required to do the calculation in 3d.\n\nHope I haven't put my foot in it again!\n\nsteve\n---\n-- \n+---------------------------------------+--------------------------------+\n| Steven Collins\t\t\t| email: steve@crl.hitachi.co.jp |\n| Visiting Computer Graphics Researcher\t| phone: (0423)-23-1111 \t |\n| Hitachi Central Research Lab. Tokyo.\t| fax:   (0423)-27-7742\t\t |\n"

Documents will be represented using tf-idf (term frequency–inverse document frequency) vectorization. The result is a 10723x22217 sparse matrix (number of documents by number of words in the dictionary).

In [10]:
vectorizer = TfidfVectorizer(stop_words='english', min_df=5,
                             tokenizer=number_aware_tokenizer)

cocluster = SpectralCoclustering(n_clusters=len(categories),
                                 svd_method='arpack', random_state=0)
X = vectorizer.fit_transform(newsgroups.data)
np.shape(X)
Out[10]:
(10723, 22217)
In [11]:
print("Coclustering...")
start_time = time()
cocluster.fit(X)
y_cocluster = cocluster.row_labels_

feature_names = vectorizer.get_feature_names()
document_names = list(newsgroups.target_names[i] for i in newsgroups.target)


def bicluster_ncut(i):
    rows, cols = cocluster.get_indices(i)
    if not (np.any(rows) and np.any(cols)):
        import sys
        return sys.float_info.max
    row_complement = np.nonzero(np.logical_not(cocluster.rows_[i]))[0]
    col_complement = np.nonzero(np.logical_not(cocluster.columns_[i]))[0]
    # Note: the following is identical to X[rows[:, np.newaxis], cols].sum() but
    # much faster in scipy <= 0.16
    weight = X[rows][:, cols].sum()
    cut = (X[row_complement][:, cols].sum() +
           X[rows][:, col_complement].sum())
    return cut / weight


def most_common(d):
    """Items of a defaultdict(int) with the highest values.

    Like Counter.most_common in Python >=2.7.
    """
    return sorted(iteritems(d), key=operator.itemgetter(1), reverse=True)


bicluster_ncuts = list(bicluster_ncut(i)
                       for i in range(len(newsgroups.target_names)))
best_idx = np.argsort(bicluster_ncuts)[:5]

print()
print("Best biclusters:")
print("----------------")
for idx, cluster in enumerate(best_idx):
    n_rows, n_cols = cocluster.get_shape(cluster)
    cluster_docs, cluster_words = cocluster.get_indices(cluster)
    if not len(cluster_docs) or not len(cluster_words):
        continue

    # categories
    counter = defaultdict(int)
    for i in cluster_docs:
        counter[document_names[i]] += 1
    cat_string = ", ".join("{:.0f}% {}".format(float(c) / n_rows * 100, name)
                           for name, c in most_common(counter)[:3])

    # words
    out_of_cluster_docs = cocluster.row_labels_ != cluster
    out_of_cluster_docs = np.where(out_of_cluster_docs)[0]
    word_col = X[:, cluster_words]
    word_scores = np.array(word_col[cluster_docs, :].sum(axis=0) -
                           word_col[out_of_cluster_docs, :].sum(axis=0))
    word_scores = word_scores.ravel()
    important_words = list(feature_names[cluster_words[i]]
                           for i in word_scores.argsort()[:-11:-1])

    print("bicluster {} : {} documents, {} words".format(
        idx, n_rows, n_cols))
    print("categories   : {}".format(cat_string))
    print("words        : {}\n".format(', '.join(important_words)))
Coclustering...

Best biclusters:
----------------
bicluster 0 : 1995 documents, 4423 words
categories   : 23% talk.politics.guns, 19% talk.politics.misc, 15% sci.med
words        : gun, guns, geb, banks, firearms, gordon, clinton, cdt, surrender, veal

bicluster 1 : 1183 documents, 3380 words
categories   : 28% talk.politics.mideast, 26% soc.religion.christian, 25% alt.atheism
words        : god, jesus, christians, atheists, kent, morality, sin, belief, objective, resurrection

bicluster 2 : 2239 documents, 2829 words
categories   : 18% comp.sys.mac.hardware, 16% comp.sys.ibm.pc.hardware, 16% comp.graphics
words        : voltage, shipping, circuit, receiver, compression, stereo, hardware, package, processing, umass

bicluster 3 : 1769 documents, 2661 words
categories   : 26% rec.motorcycles, 23% rec.autos, 13% misc.forsale
words        : bike, car, dod, ride, motorcycle, engine, bikes, bmw, honda, helmet

bicluster 4 : 12 documents, 152 words
categories   : 100% rec.sport.hockey
words        : scorer, unassisted, reichel, semak, sweeney, kovalenko, ricci, audette, momesso, nedved