Shakespeare classifier

Github | Code Words | YouTube

The goal of this project was to come up with an unsupervised method for splitting Shakespeare’s plays in 2 groups, and compare the results to the traditional classification into comedies and tragedies.

Unsupervised document classification addresses the problem of assigning categories to documents without the use of a training set or predefined categories. This is useful to enhance information retrieval, the basic assumption being that similar contents are also relevant to the same query. A similar assumption is made in literature to define literary genres and sub-genres, where works which share specific conventions in terms of form and content are described by the same genre.

There are two main steps in the analysis:

  1. Content-feature extraction: represent plays as a mixture of topics using Latent Dirichlet Allocation.
  2. Split the plays in 2 groups using KMeans clustering.

The result is a vector of labels that divides the plays in two groups. The split doesn’t capture the difference between comedies and tragedies but rather a the evolution over time of Shakespeare’s style of writing, identifying an Elizabethan Shakespeare, who is younger and more influenced by the classics, as opposed to a Jacobean one, more mature and writing for a different king.