Fork me on GitHub

Erin Hengel

Software > Textatistic

Python package to calculate the Flesch Reading Ease, Flesch-Kincaid, Gunning Fog, Simple Measure of Gobbledygook (SMOG) and Dale-Chall readability indices. Textatistic also contains functions to count the number of sentences, characters, syllables, words, words with three or more syllables and words on an expanded Dale-Chall list of easy words.

Motivation

I recently investigated whether academic journals demand clearer, more concise writing from women than they do men. To do so, I evaluated the readability of about 10,000 abstracts published in four of the top economics journals between 1950–2015.

The readability scores I use in my analysis correlate with reading difficulty but they are noisy (see, e.g., Begeny and Greene, 2014 or DuBay, 2004). Compounding that fact, many programs that calculate these scores rely on unclear, inconsistent and possibly inaccurate algorithms to count words, sentences and syllables and determine whether a word is on Dale-Chall's easy word list (for a discussion, see Sirco, 2007). Moreover, features of the text—particularly full stops used in abbreviations and decimals in numbers—frequently underestimate average words per sentence and syllables per word.

To transparently handle these issues and eliminate ambiguity, I wrote Textatistic and released it as open source software subject to an Apache 2 license.

Textatistic is very much an alpha release—use at your own risk!—and has only been tested on Python 3.4. Additionally, in the process of preparing it for distribution, I may have made errors that cause the figures in the paper that motivated Textatistic not to match up precisely with those returned by Textatistic. (Especially a risk since I changed several function and parameter names during the process.) I would greatly appreciate any feedback you have or errors you find—please let me know on GitHub.com or via email.

Installation

Install Textatistic with pip (probably as root):

$ pip install textatistic

To install from source, download the latest version on GitHub.com and run the following command (again, probably as root):

$ python setup.py install

If you are on a Mac, Python 2.7 is pre-installed by default; upgrade at python.org and replace python and pip with python3 and pip3, respectively, in the commands above. Alternatively, alter the python command link to point to the new installation and reinstall pip by issuing the following commands (assumes you've installed the latest OS; paths may be different for earlier versions):

$ cd /usr/bin
$ sudo rm -f python
$ sudo ln -s /Library/Frameworks/Python.framework/Versions/3.4/bin/python3.4 python
$ curl https://bootstrap.pypa.io/ez_setup.py | sudo python
$ curl https://bootstrap.pypa.io/get-pip.py | sudo python

Textatistic uses the PyHyphen hyphenation library which is not included in Python's original installation. Although pip installs PyHyphen automatically with Textatistic, it may omit the actual dictionaries. If your Textatistic installation was successful but you get an error that no dictionaries are installed, either uninstall PyHyphen and then reinstall it from source or manually install the dictionaries yourself (instructions).

Quickstart

Begin by importing the Textatistic module:

>>> from textatistic import Textatistic

Textatistic() returns an object containing all text statistics and readability scores—s in the following example:

sample_text = 'There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever.'

# Create a Textatistic object
s = Textatistic(text_sample)

Call scores or counts to return dictionaries of only the readability scores and word/syllable/character counts, respectively:

>>> s.counts
{'notdalechall_count': 4, 'polysyblword_count': 0, 'sybl_count': 73, 'word_count': 67, 'sent_count': 2, 'char_count': 265}

s.dict() returns a dictionary of all statistics. To generate the number of sentences in the sample text and the Flesch Reading Ease score:

# Return sentence count.
>>> s.sent_count 
2

# Return Flesch Reading Ease score.
>>> s.flesch_score
80.65638059701494

Every statistic contained in s can be similarly printed. All are listed below. The following functions end in count and correspondently return a character, syllable or word count.

This final list shows each readability score function with its formula:

\[\scriptsize{\begin{cases} 3.6365 & \frac{\texttt{notdalechall_count}}{\texttt{word_count}} > 0.05\\ 0 & \text{otherwise}\end{cases}+ 15.79\times\frac{\texttt{notdalechall_count}}{\texttt{word_count}}+0.0496\times\frac{\texttt{word_count}}{\texttt{sent_count}}.}\]

\[\scriptsize{206.835 - 1.015\times\frac{\texttt{word_count}}{\texttt{sent_count}} - 84.6\times\frac{\texttt{sybl_count}}{\texttt{word_count}}.}\]

\[\scriptsize{-15.59+0.39\times\frac{\texttt{word_count}}{\texttt{sent_count}}+11.8\times\frac{\texttt{sybl_count}}{\texttt{word_count}}.}\]

\[\scriptsize{0.4\times\left(\frac{\texttt{word_count}}{\texttt{sent_count}}+100\times\frac{\texttt{polysyblword_count}}{\texttt{word_count}}\right).}\]

\[\scriptsize{3.1291+1.0430\times\sqrt{30\times\frac{\texttt{polysybl_word}}{\texttt{sent_count}}}.}\]

Calling individual functions

Instead of returning an object containing every statistic, you may call each function separately with the text you wish to analyse as the argument.

If you need more than one or two different readability scores, calling Textatistic() is most efficient.

All statistics listed in Quickstart can be called as individual functions except sybl_count and polysyblword_count. They are returned in a dictionary using the sybl_counts function. For example, to find just the word count, syllable counts and Dale-Chall score:

>>> import textatistic

# Word count.
>>> textatistic.word_count(sample_text)
67

# Syllable counts.
>>> textatistic.sybl_counts(sample_text)
{'polysyblword_count': 0, 'sybl_count': 73}

# Dale-Chall score.
>>> textatistic.dalechall_score(sample_text)
6.240786567164179

Additionally, the five readability functions can take actual word, syllable, etc. counts as inputs. For example, if you knew a passage of text had 35 words, three sentences and 62 syllables, pass those values in a dictionary directly to flesch_score instead of the text itself using the vars keyword argument:

>>> params = {'word_count': 35, 'sent_count': 3, 'sybl_count': 62}
>>> textatistic.flesch_score(vars=params)
45.1304761904762

punct_clean and word_array are additional functions that return a modified version of the original text—useful to verify that the text was appropriately processed before readability scores were calculated. punct_clean removes punctuation that interferes with accurate sentence and word counts. It performs the following operations in the order given:

  1. Replaces em, en, etc. dashes with hyphens.
  2. Removes hyphens in hyphenated single words, e.g., co-author.
  3. Removes decimals, replacing them with a plus sign (+).
  4. Removes punctuation used in an obvious mid-sentence rhetorical manner, e.g. "[...] (must polluters pay?)." (Kolm, 1975).
  5. Replaces abbreviations with their full text.

punct_clean does not remove typesetting code (e.g., \(\LaTeX\)). You will need to inspect your text manually and ensure it doesn't include non-sentence-ending full stops, exclamation points or question marks not otherwise handled by punct_clean. Also, explicit changes made by punct_clean result from issues identified in my own text and may not apply to yours. If you identify other text components which impede accurate calculation of readability scores, please do let me know!

word_array returns a list of all words in the text. To generate the list, it first applies the function punct_clean, second replaces hyphens with spaces and finally generates a list based on remaining whitespace.

>>> textatistic.word_array(sample_text)
['There', 'were', 'a', 'king', 'with', 'a', 'large', 'jaw', 'and', 'a', 'queen', 'with', 'a', 'plain', 'face', 'on', 'the', 'throne', 'of', 'England', 'there', 'were', 'a', 'king', 'with', 'a', 'large', 'jaw', 'and', 'a', 'queen', 'with', 'a', 'fair', 'face', 'on', 'the', 'throne', 'of', 'France', 'In', 'both', 'countries', 'it', 'was', 'clearer', 'than', 'crystal', 'to', 'the', 'lords', 'of', 'the', 'State', 'preserves', 'of', 'loaves', 'and', 'fishes', 'that', 'things', 'in', 'general', 'were', 'settled', 'for', 'ever']

Hyphenator

Textatistic counts syllables using the Python module PyHyphen, itself based on the C library libhyphen. libhyphen is used by \(\TeX\)'s typesetting system and in most open source text processing software, including OpenOffice. By default, Textatistic uses PyHyphen's American English Hyphenator. To change the locale, manually import PyHyphen's Hyphenator class:

>>> from hyphen import Hyphenator

Next, create a new instance of a hyphenator with the desired location/language. For all Textatistic functions, simply indicate you want to use a different hyphenator with the hyphen argument:

# British English hyphenator.
>>> gb = Hyphenator('en_GB')
>>> textatistic.fleschkincaid_score(sample_text, hyphen=gb)
10.155597014925375

# Default hyphenator (American English).
>>> textatistic.fleschkincaid_score(sample_text)
10.331716417910451

PyHyphen automatically installs American and British English dictionaries and a dictionary of your local language. See the documentation for instructions on downloading and installing others.

Readability tests were created for and tested on American English. The components which determine sentence complexity in one language may be very different to those in another—thus reducing (or eliminating) scores' accuracy.

Abbreviations

To determine sentence count, Textatistic replaces common abbreviations with their full text—otherwise their periods skew sentence counts. The file abbreviations.txt, stored in the package contents, maintains an explicit list of all such text replacements.

The list is specific to my own work and may be irrelevant and/or incomplete for yours. If you are using Textatistic for serious analysis, heavily scrutinise each replacement—and determine elements to remove from or add to it.

To view the contents of that file, import the Abbreviations class from Textatistic, create an Abbreviations object and list its contents:

>>> from textatistic import Abbreviations
>>> abbreviations = Abbreviations()
>>> abbreviations.list
[['i.e.', 'id est'], ['i. e.', 'id est'], ['e.g.', 'exempli gratia'], ['e. g.', 'exempli gratia'], ['i.i.d.', 'independently and identically distributed'], ['et al.', 'et alii'], ['etc.', 'etcetera'], ['St.', 'Saint'], ['U.S.', 'United States'], ['U. S.', 'United States'], ['U.K.', 'United Kingdom'], ['U. K.', 'United Kingdom'], ['U.N.', 'United Nations'], ['U. N.', 'United Nations'], ['Roe v. Wade', 'Roe versus Wade'], ['Inc.', 'Incorporated'], ['Sec.', 'Section'], ['Vol.', 'Volume'], ['cf.', 'confer'], [' pp.', ' pages'], [' ff.', ' folio'], ['Dr.', 'Doctor'], ['viz.', 'videlicet']]

Each element in the outermost list, e.g., ['e.g.', 'exampli gratia'], indicates a single abbreviation and its replacement. The first element in the inner list—'e.g.' in the example—is the abbreviation; the second element—'exampli gratia'—will replace it.

Abbreviations are replaced with their entire text only if those abbreviations are marked with full stops. Thus, U.S. becomes United States but US does not. If you are using Textatistic in a relative analysis of text samples and they all use one or the other, this should be fine; if, on the other hand, your text samples sometimes use U.S. and other times US, add US to the list of replacements or manually change the samples to make them uniform.

To add abbreviations, simply provide them as a list of lists using the append keyword argument—making sure to follow the order described above: the first element of each inner list is the text to be replaced, the second the text that replaces it. As when using your own hyphenator, identify the new abbreviations object to all Textatistic functions, this time with the abbr keyword argument. The following example adds two pointless substitutions:

>>> adds = [['queen', 'princess'], ['jaw', 'elephant']]
>>> abbreviations = Abbreviations(append=adds)
>>> textatistic.word_array(sample_text, abbr=abbreviations)
['There', 'were', 'a', 'king', 'with', 'a', 'large', 'elephant', 'and', 'a', 'princess', 'with', 'a', 'plain', 'face', 'on', 'the', 'throne', 'of', 'England', 'there', 'were', 'a', 'king', 'with', 'a', 'large', 'elephant', 'and', 'a', 'princess', 'with', 'a', 'fair', 'face', 'on', 'the', 'throne', 'of', 'France', 'In', 'both', 'countries', 'it', 'was', 'clearer', 'than', 'crystal', 'to', 'the', 'lords', 'of', 'the', 'State', 'preserves', 'of', 'loaves', 'and', 'fishes', 'that', 'things', 'in', 'general', 'were', 'settled', 'for', 'ever']

Should you wish to modify an existing text substitution—for example, replace e.g. with eg instead of 'example gratis'—use the modify keyword; to remove a substitution, use remove:

>>> mods = [['e.g.', 'eg']]
>>> dels = [['U. K.', 'United Kingdom']]
>>> abbreviations = Abbreviations(modify=mods, remove=dels)
[['i.e.', 'id est'], ['i. e.', 'id est'], ['e.g.', 'eg'], ['e. g.', 'exempli gratia'], ['i.i.d.', 'independently and identically distributed'], ['et al.', 'et alii'], ['etc.', 'etcetera'], ['St.', 'Saint'], ['U.S.', 'United States'], ['U. S.', 'United States'], ['U.K.', 'United Kingdom'], ['U.N.', 'United Nations'], ['U. N.', 'United Nations'], ['Roe v. Wade', 'Roe versus Wade'], ['Inc.', 'Incorporated'], ['Sec.', 'Section'], ['Vol.', 'Volume'], ['cf.', 'confer'], [' pp.', ' pages'], [' ff.', ' folio'], ['Dr.', 'Doctor'], ['viz.', 'videlicet']]

When these replacements are actually made in the evaluated text, they are done so in order. Items are appended to the end of the list, but those only modified maintain their original order. Thus, simply appending ['e.g.', 'eg'] won't have any effect on replacements—because they've already been replaced by 'exampli gratia' .

It is also possible to add regular expressions. Indicate them using precisely the same syntax employed by re.sub()—only make sure to preface the string with r (otherwise optional for re.sub()) and wrap the entire regular expression in quotes:

>>> regex_adds = [['r"([a-z]\.){4}"', "xxxx"]]
>>> abbreviations = Abbreviations(append=regex_adds)
>>> textatistic.word_array("An example a.b.c.d. initial.", abbr=abbreviations)
['An', 'example', 'xxxx', 'initial']

You may wish to throw out my entire abbreviation list and substitute your own. To do so, create a file with two comma-separated columns, the first of which contains the abbreviation and the second what replaces it. For example, assume I created such a file in my current working directory and called it my_abbrvs.txt. It would look something like this

"king", "queen"
"fair", "ugly"
"face", "shoe"

To use this list instead of the default abbreviations, use the file argument keyword:

>>> abbreviations = Abbreviations(file='my_abbrvs.txt')
>>> textatistic.word_array(sample_text, abbr=abbreviations)
['There', 'were', 'a', 'queen', 'with', 'a', 'large', 'jaw', 'and', 'a', 'queen', 'with', 'a', 'plain', 'shoe', 'on', 'the', 'throne', 'of', 'England', 'there', 'were', 'a', 'queen', 'with', 'a', 'large', 'jaw', 'and', 'a', 'queen', 'with', 'a', 'ugly', 'shoe', 'on', 'the', 'throne', 'of', 'France', 'In', 'both', 'countries', 'it', 'was', 'clearer', 'than', 'crystal', 'to', 'the', 'lords', 'of', 'the', 'State', 'preserves', 'of', 'loaves', 'and', 'fishes', 'that', 'things', 'in', 'general', 'were', 'settled', 'for', 'ever']

When defining abbreviations, do so carefully. Consider the substitution [' pp.', ' pages']. Note that pp. is preceded by a space. This prevents inadvertent deletion of an actual full stop. Without it, "This sentence ends with app." would become "This sentence ends with pages". Special care should be taken when using regular expressions, since their odd syntax may make it particularly easy to overlook such mistakes.

Dale-Chall expanded easy word list

The Dale-Chall easy word list consists of 3,000 words understood by 80 percent of fourth-grade readers (aged 9–10). Only singular nouns and verb infinitives are listed, but according to explicit instructions, the list encompasses any alternate form of these words: "eat" includes "ate", etc.

The Dale-Chall easy word list is in American English so the functions notdalechall_count and dalechall_score should not be used with text in any other language or locale.

I considered several algorithms that might identify alternate forms of words but in the end decided it would be simpler (and faster) to use a single, comprehensive list of all possible word forms from the original Dale-Chall list.

Since I couldn't find one already created, I had to make my own. To do so, I used Python's Pattern library to generate every conceivable alternate form of each word I could think of: verb tenses, comparative and superlative adjective forms, plural nouns, etc.

The resulting list had more than 14,000 words—but many were gibberish. To get rid of nonsense, the text of 94 English novels were matched with words on the expanded list. Those not found were deleted.

If my Dale-Chall list omits some word (and it undoubtedly does) or you find more gibberish, please let me know!

You may, however, wish to use your own list. To do so, place each word (without quotes) on a separate line of a text file and call the EasyWords class object with the file keyword argument. All Textatistic functions must explicitly reference your list with the keyword argument easy.

For example, create a file named easy_words.txt in your current working directory and insert "king" and "queen" into it (without quotes; on separate lines). Compare the count of words not on easy_words.txt to the corresponding count using the default list:

>>> from textatistic import EasyWords

# Use your own list of easy words.
>>> easy_words = EasyWords(file='easy_words.txt')
>>> textatistic.notdalechall_count(sample_text, easy=easy_words)
63

# Use default Dale-Chall expanded list of easy words.
>>> textatistic.notdalechall_count(sample_text)
4 

When speed is a concern...

Textatistic() described in Quickstart calculates all statistics and scores in the most efficient means possible—i.e., no function is run more than once on the same text sample. This probably isn't the case if you call each function individually. So don't do that unless you're really only interested in one or two.

And when you do, make use of the prepped keyword argument. Setting it true indicates that the text has already been appropriately "prepared" by the punct_clean or word_array functions. (By default, prepped is false.)

prepped's purpose is to avoid repeatedly running these functions on the same text sample. For example, sent_count and char_count invoke punct_clean unless prepped is true. This is inefficient. It is faster to first manually apply punct_clean to the text sample and pass that modified text directly to each function while setting prepped equal to True:

>>> prepped_text = textatistic.punct_clean(sample_text)
>>> textatistic.sent_count(prepped_text, prepped=True)
2
>>> textatistic.char_count(prepped_text, prepped=True)
265

Of course, be sure you've appropriately processed the text before using the prepped argument—signalling text is prepped when it isn't can significantly screw up your analysis.

Below lists functions that need text already subjected to punct_clean and word_array, respectively.

  1. Functions that require text per punct_clean:

    • word_array
    • char_count
    • sent_count
  2. Functions that require text per word_array

    • notdalechall_count
    • sybl_counts
    • word_count

None of the readability score functions accept prepped. Instead, construct a dictionary of count statistics and pass that to the function using the vars keyword argument described earlier. For example, if you wanted the most efficient means of calculating only the Flesch Reading Ease and Flesch-Kincaid scores:

# Prepare text.
>>> prepped_text = textatistic.punct_clean(sample_text)
>>> word_list = textatistic.word_array(prepped_text, prepped=True)

# Calculate sentence, syllable and word counts.
>>> params = {}
>>> params['sent_count'] = textatistic.sent_count(prepped_text, prepped=True)
>>> params['sybl_count'] = textatistic.sybl_counts(word_list, prepped=True)['sybl_count']
>>> params['word_count'] = textatistic.word_count(word_list, prepped=True)

# Calculate Flesch Reading Ease and Flesch-Kincaid scores.
>>> textatistic.flesch_score(vars=params)
80.65638059701494
>>> textatistic.fleschkincaid_score(vars=params)
10.331716417910451

Another way to improve speed is to remove as many regular expressions from the source code as your particular text samples will allow while still maintaining accurate counts. Make sure you haven't added any regular expressions to the list of abbreviation replacements, either.

Finally, you may save time by opening instances of PyHyphen's Hyphenator class, Abbreviations and EasyWords and manually referencing them in each function you use. Nevertheless, I conducted several time trials, and this modification appears to have very little impact.

License

Copyright 2015 Erin Hengel

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.