NLP Week 1

Basic Text Processing

Professors Jurafsky and Manning introduce regular expressions. You can play around with regexes in the site:
An example of using regexes for parsing phone numbers was discussed.

Word tokenization:
Involves text normalization – Concept of tokens and types

Analyzing tokens in a sample text file

I downloaded a copy of “What You Want”, a short story by O’Henry as a sample text to analyze. I will be using the less, tr and sort commands in Unix throughout.

less ohenry.txt | less

Now we want to extract all the words in this text file. This is equivalent to taking every non-alphabetic character (which is why there is an -sc denoting complement) and replacing it with a new line.

tr -sc A-Za-z '\n' < ohenry.txt | sort | less

But it would be more convenient to get the count of every word rather than a sorted list of words. The ‘uniq’ command helps us in this.

tr -sc A-Za-z '\n' < ohenry.txt | sort | uniq -c

It would be rather nice to see the list sorted according to frequency. We can use -n (number in first line of file) and -r (reverse the order) for this purpose.

tr -sc A-Za-z '\n' < ohenry.txt | sort | uniq -c | sort -n -r | less

One problem we notice is that the same two words having different cases are counted as two unique words. E.g. ‘The’ is treated differently from ‘the’. We don’t want this to be happening. So we can replace all upper cases with lower cases and pass this as input to the tr command to solve the problem.

tr A-Z a-z < ohenry.txt | tr -sc a-z '\n' | sort | uniq -c | sort -n -r | less

At this stage, part of the output looks something like this:

 96 the
 78 and
 58 of
 51 i
 50 a
 49 to
 41 his
 37 you
 31 in
 30 he
 21 that
 18 it
 16 was
 16 old
 15 t
 15 s
 15 james

We notice that we are getting stray ‘s’ and ‘t’. This may be due to apostrophe s that appear in the text. E.g. He’s or Jame’s. In these cases, we need to decide if we want to replace He’s with He is and so on.
Professor Jurafsky next moves onto discuss tokenizations in Chinese, Japanese, etc. Max-match algorithm works well in segmenting Chinese text, which generally doesn’t have spaces to distinguish between different words.

Word Normalization and Stemming

  • Deleting periods – e.g. U.S.A. is the same as USA
  • Sometimes case is important – e.g. US is different from us
  • Lemmatization – am/are/is –> be
    seeing/saw/seen –> see
  • Morphology
    • Stem – e.g. automat
    • Affix – e.g. ing, ed, es and so on

We can use an algorithm such as Porter’s algorithm for stemming, which consists of a series of rules for reducing a word to its stem. I have added a screenshot here of the basic algorithm that was introduced in the course.


source: Stanford NLP course on Coursera

It’s important to observe rule 1b which says that we have to strip an ‘ing’ only if it has one or more vowels before it. Otherwise we would end up reducing non-verb words such as ‘bling’ to ‘bl’. To implement this in our sample text file, we can use the grep command, which is a command to deal with strings and regexes. We ‘ll now extract words ending with an ‘ing’.

tr -sc A-Za-z '\n' < ohenry.txt | tr A-Z a-z | grep 'ing$' | sort | uniq -c | sort -n -r |less

But we notice that we end up with words such as ‘king’ which we don’t want. So we can modify the rule to this.

tr -sc A-Za-z '\n' < ohenry.txt | tr A-Z a-z | grep '[aeiou].*ing$' | sort | uniq -c | sort -n -r |less

That is, we are adding the added condition that we want at least one vowel followed by any number of characters. But we are still inevitably left with words such as ‘nothing’ and ‘something’.

 4 something
 2 smarting
 2 nothing
 2 making
 2 everything
 2 discerning
 2 cleaning
 2 anything
 1 yearning

Sentence Segmentation

We can use a decision tree to find out if we are at the end of sentence (eos).decision tree

Source: Stanford NLP course on Coursera

Edit Distance

Defining Minimum Edit Distance

Minimum edit distance is a way of solving the problem using string similarity. Given two strings, we are interested in knowing how similar they are. This concept is used in

  • spelling correction – given say ‘bllt’, is it closer to ball, bullet,ballet, etc
  • computational biology – how well do two sequences of nucleotides align with each other.
  • machine translation
  • information extraction
  • speech recognition

The minimum edit distance is the minimum number of operations (insertion, deletion, substitution) needed to transform one string to the other.

E.g. consider the strings ‘ball’ and ‘small’
b  *  a l l
s m a l l
This transformation can be achieved using one substitution (b and s) and one insertion (m).
If each operation costs 1, the distance is 2.
If substitution costs 2, the distance is 3. This is called the Levenshtein distance.

This can be used in machine translation systems to measure the how accurately the machine translates t by comparing to human translator.

It is also used in named entity extraction. E.g. IBM Inc. is the same is IBM.

For two strings

  • X of length n
  • Y of length m

We define D(i,j) as the distance between X[1..i] and Y[i..j].
The edit distance between X and Y is thus D(n,m).

Computing Minimum Edit Distance

We use Dynamic Programming, which is based on solving problems by combining solutions of sub-problems. We ‘ll use a bottom-up approach i.e. compute D(i,j) for small i,j and use that to compute D(I,j) for larger values of i,j.

In the following snapshot, the initialization conditions hold because it is just the cost of only deletions or insertions respectively. In the recurrence relation, there are 3 possible ways to reach a cell from a previous cell – deleting the last character in X, inserting the last character of Y into X (equivalent to deleting the last char of Y from our analysis), or substituting the last chars of X and Y.

min edit distSource : Stanford NLP course on Coursera

As an example, I have drawn the matrix for converting ‘ball’ into ‘small’.


So we get the Levenshtein distance as 3. The colored blocks show one possible path to get this value. The path is important because computing the minimum distance alone is not sufficient. We need to know how we align the character to obtain this optimal distance. This is done using a method called backtrace i.e. every time we enter a cell, we remember where we came from. When we reach the end, we can trace off the path backwards to know the alignment. The pointer to a particular cell can be any of the previous 3 cells in an ‘L’ shape – down (deletion), left (insertion) or diagonal (substitution).
backtraceSource : Stanford NLP course on Coursera

Performance of this algorithm
Time : O(n*m)
Space : O(n*m)
Backtrace : O(n+m) (worst case – n deletions and m insertions)

Weighted Edit Distance

We mat need to use weighted edit distances in

  • Spell correction : some letters are likely to get more mistyped than others. e.g. a with e, a with s (closer on keyboard), etc
  • Biology : certain kinds of deletions and insertions are more likely than others

weighted editSource : Stanford NLP course on Coursera

Minimum Edit Distance in Computational Biology

In computational biology, we look at maximizing similarity. Here we penalize deletions and insertions and reward matching characters with a substitution cost. Therefore, the best distance will be the maximum similarity rather than the minimum distance that we had been working with till now.

compu bio editSource : Stanford NLP course on Coursera

But we may have a few variants to the basic algorithm. Such cases arise when

  • Head of one sequence aligns with the tail of another. So we don’t want to penalize the gaps at the end. Our initialization case will now become, F(i,0)=0 and F(0,j)=0. And F opt  = max { max F(i,N), max (M,j) }. This is because only the ends would match, not the entire sequence.
  • Only parts of one sequence may match with parts of another sequence. The algorithm used in this case is the Smith-Waterman algorithm. Initialization conditions are same as above (i.e 0). But finally while maximizing, we also add a 0, because we want to start over once we get a certain number of mismatches that result in a negative score. A snapshot of the algorithm is given below.smith waterman
    Source : Stanford NLP course on Coursera

Assignment 1 : Building a bot to extract email ids and phone numbers

For the 1st assignment, we had to build be a bot using regexes that correctly extracts the email id and phone numbers from a list of expressions that don’t directly quote the them. My code for the same can be viewed here.

This completes week 1 of the course! I can’t wait to get started on the next one!

Previous                                                                                                                       Next

%d bloggers like this: