Github | App

The goal of this project was to explore automatic summarisation and understand how a subset of text representing the core “information” can be extracted from a document in an unsupervised manner.

A popular, unsupervised, algorithm for keyphrase extraction is TextRank, which essentially runs PageRank on a graph designed to represent the text. In the graph vertices are units of text (sentences) and the edges are undirected and weighted on the basis of some lexical similarity between the vertices. In the original paper the measure of similarity is given by the count of overlapping words normalised by the length of the two sentences.

This method does not require any training as it exploits the structure of the text to determine what is the “central” meaning of the text, but the measure of similarity used has no notion of the relationships between words, e.g. python and ruby are both programming languages but would result in no overlapping. Word embeddings encode the meaning of words in their vector representation so that words that are similar in meaning are cloase to each other in the vactor space.

I was interested in testing if projecting the text in a vector space using a model like word2vec could lead to simplifying the algorithm for automatic summarisation. The following steps work nicely for news, blog articles and short essays:

  1. Using word2vec trained on the brown corpus (nltk), sentences in the text are represented as the average of the individual term vectors.
  2. Using cosine distance, build the matrix of pairwise distances between sentences.
  3. For each sentence, calculate an individual score given by the sum of distances from each other sentence in the text.
  4. The final summary is built by ranking the sentences according to their final score and using the top 20%