Language generation

Github

The goal of this project was to build a random generator of sentences based on a Markov chain.

A Markov chain is a stochastic model describing a sequence of possibile event in which the probability of each event depends only on the state attained at the previous step. As such, a Markov chain has a set of states and a matric of transition probabilities P where pij is the probability of the chain transitioning from state i to state j.

Given a text:

  • The set of possible states is the vocabulary in the document.
  • The probability of wordi is equal to the number of occurrences of wordi in the text, divided by the length.
  • The probability of wordj following wordi is equal to the number of times wordj occur after wordi, divided by the frequency of wordi in the document.

The algorithm for the generator is:

  1. Identify sentences in the document by separating them with <eos>.
  2. Compute the empirical probability distribution of the words in the text and the transition matrix.
  3. Select the first world of the sentence according to the empirical distribution of the vocabulary in the text.
  4. At each step, select the next word with according to the conditional probability given the previous word.
  5. Repeat 4 until <eos> is selected.