Readability measurements

In this section, I explore some of the readability measurements that I have explored, to analyze texts. Readability measurements give us a sense of how readable a text is and what type of audiences it can cater to.

SMOG Index

SMOG index gives an estimate of the number of years in education that the reader must have had, to be able to understand the text. It is most suitable for the English language. It is mostly dependent on the complexity of words in the text. Words having 3 syllable or more (called polysyllables) are said to be complex words. There is a simple formula to calculate the SMOG index. The steps are outlined below:

  1. Count the number of sentences (at least 30).
  2. Count the number of complex words.
  3. Take the square root of the above count and add 3.

This gives you the index.

For implementing this in code, I took help of a python package called ‘readability‘. Install it using

$ pip install https://github.com/andreasvc/readability/tarball/master

I also had to install another dependent package called ‘ucto‘ that does all the necessary preprocessing like separating punctuation marks, splitting texts into sentences and so on. Install it using

$ sudo apt-get install ucto

I next extracted the code for calculating the SMOG index. The repository which I referred to is : https://github.com/mmautner/readability.

Download the ‘utils.py’ and ‘syllables_en.py’ files from the repository. ‘utils.py’ has all the function definitions for getting character counts, getting sentence counts, getting syllable counts, etc, that may have to be used in various readability measurement formulas. ‘syllables_en.py’ is a fallback syllable counter based on the algorithm in Greg Fast’s perl module – Lingua::EN::Syllable. It basically bases its syllable counting on the number of vowels in the word along with two helping lists – fallback_subsyl and fallback_addsyl, as well as a dictionary ‘fallback_cache’. There is a list of special syllable words that are initially added to the fallback_cache. In case the input word is one from this dictionary, the count is immediately retrieved and returned. Otherwise, first, the count reflects the total number of vowels. Then, the word is compared with the tags in fallback_addsyl and fallback_subsyl, and count is incremented or decremented respectively.

$ nltk.download('punkt')

Also note that you may have to download ‘punkt’ from the nltk passage for the code to run.

The code for ‘smog.py’ is given below.

import math
import sys, getopt, os, subprocess

from utils import get_words
from utils import get_sentences
from utils import count_complex_words

class Smog:
    analyzedVars = {}

    def __init__(self, text):
        self.analyze_text(text)
    
    def analyze_text(self, text):
        words = get_words(text)
        word_count = len(words)
        sentence_count = len(get_sentences(text))
        complexwords_count = count_complex_words(text)
 
    self.analyzedVars = {
        'word_cnt': float(word_count),
        'sentence_cnt': float(sentence_count),
        'complex_word_cnt': float(complexwords_count),
    }

    def SMOGIndex(self):
        score = 0.0 
            if self.analyzedVars['word_cnt'] > 0.0:
                score = (math.sqrt(self.analyzedVars['complex_word_cnt']*(30/self.analyzedVars['sentence_cnt'])) + 3)
        return score

if __name__ == "__main__":
    text = ""
    try:
        opts, args = getopt.getopt(sys.argv[1:], "hi:",["ifile="])
    except getopt.GetoptError:
        print 'test.py -i <inputfile>'
        sys.exit(2)
    for opt, arg in opts:
        if opt == '-h':
            print 'smog.py -i <inputfile>'
            sys.exit()
        elif opt in ("-i", "--ifile"):
            inputfile = arg
 
    text = subprocess.check_output(["ucto","-L","en","-n","-s","''",inputfile])
 
    rd = Smog(text)
    print 'SMOGIndex: ', rd.SMOGIndex()

Run the code as:

$ python smog.py -i <inputfilename>

The output is the SMOG index for the text in the input file.

Flesch-Kincaid readability score

This score is more popular than the SMOG index for calculating the level of education that a reader requires in order to understand the text. The wiki page is here.

It is calculated as

206.835 - 1.015*(total words/total sentences) - 84.6*(total syllables/total words)

I implemented this for the entire text and on a sentence-by-sentence basis to determine sentences that made the passage difficult to read. The code is below. Make sure you have utils.py in the same directory.

from utils import get_char_count
from utils import get_words
from utils import get_sentences
from utils import count_syllables
from utils import count_complex_words

if __name__=="__main__":
  
    f = open('quark.txt','r')
    text = f.read()

   #Flesch-Kincaid on the entire text
    words = get_words(text)
    syllableCount = count_syllables(text)
    sentences = get_sentences(text)
    print 'Words : ', len(words)
    print 'Syllables : ', syllableCount
    print 'Sentences : ', len(sentences)
    fk = 206.835 - 1.015*len(words)/len(sentences) - 84.6*(syllableCount)/len(words)
    print 'Flesch-Kincaid readability ease : ', format(fk,'.2f')

   # Calculating Flesch-Kincaid scores for each sentence in the text
   for s in sentences:
       w = get_words(s)
       sc = count_syllables(s)
       score = 206.835 - 1.015*len(w) - 84.6*(sc)/len(w)
       print s,format(score,'.2f')

So any article can be retreived from wikipedia and we can calculate fk scores for them to analyze complicated sentence constructs. However, you might bump into some UnicodeDecode errors since the articles may contain special symbols that can’t be parsed by the functions in utils.py. This will have to be fixed before this program can be fully functional. This is the focus of the next topic.

 

Advertisements
%d bloggers like this: