The goal of this project was to explore automatic summarisation and understand how a subset of text representing the core “information” can be extracted from a document in an unsupervised manner.
A popular, unsupervised, algorithm for keyphrase extraction is TextRank, which essentially runs PageRank on a graph designed to represent the text. In the graph vertices are units of text (sentences) and the edges are undirected and weighted on the basis of some lexical similarity between the vertices. In the original paper the measure of similarity is given by the count of overlapping words normalised by the length of the two sentences.
This method does not require any training as it exploits the structure of the text to determine what is the “central” meaning of the text, but the measure of similarity used has no notion of the relationships between words, e.g. python and ruby are both programming languages but would result in no overlapping. Word embeddings encode the meaning of words in their vector representation so that words that are similar in meaning are cloase to each other in the vactor space.
I was interested in testing if projecting the text in a vector space using a model like word2vec could lead to simplifying the algorithm for automatic summarisation. The following steps work nicely for news, blog articles and short essays: