Natural Language Toolkit NLTK

NLTK was originally created in 2001 as part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania. Since then it has been developed and expanded with the help of dozens of contributors. It has now been adopted in courses in dozens of universities, and serves as the basis of many research projects. Table P-2 lists the most important NLTK modules.

Table P-2. Language processing tasks and corresponding NLTK modules with examples of functionality

Language processing task

NLTK modules


Accessing corpora


Standardized interfaces to corpora and lexicons

String processing

nltk.tokenize, nltk.stem

Tokenizers, sentence tokenizers, stemmers

Collocation discovery


t-test, chi-squared, point-wise mutual information

Part-of-speech tagging


n-gram, backoff, Brill, HMM, TnT


nltk.classify, nltk.cluster

Decision tree, maximum entropy, naive Bayes, EM, k-means



Regular expression, n-gram, named entity



Chart, feature-based, unification, probabilistic, dependency

Semantic interpretation

nltk.sem, nltk.inference

Lambda calculus, first-order logic, model checking

Evaluation metrics


Precision, recall, agreement coefficients

Probability and estimation


Frequency distributions, smoothed probability distributions


Graphical concordancer, parsers, WordNet browser, chatbots

Language processing task NLTK modules

Linguistic fieldwork nltk.toolbox


Manipulate data in SIL Toolbox format

NLTK was designed with four primary goals in mind: Simplicity

To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data Consistency

To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names


To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task Modularity

To provide components that can be used independently without needing to understand the rest of the toolkit

Contrasting with these goals are three non-requirements—potentially useful qualities that we have deliberately avoided. First, while the toolkit provides a wide range of functions, it is not encyclopedic; it is a toolkit, not a system, and it will continue to evolve with the field of NLP. Second, while the toolkit is efficient enough to support meaningful tasks, it is not highly optimized for runtime performance; such optimizations often involve more complex algorithms, or implementations in lower-level programming languages such as C or C++. This would make the software less readable and more difficult to install. Third, we have tried to avoid clever programming tricks, since we believe that clear implementations are preferable to ingenious yet indecipherable ones.

Natural Language Processing is often taught within the confines of a single-semester course at the advanced undergraduate level or postgraduate level. Many instructors have found that it is difficult to cover both the theoretical and practical sides of the subject in such a short span of time. Some courses focus on theory to the exclusion of practical exercises, and deprive students of the challenge and excitement of writing programs to automatically process language. Other courses are simply designed to teach programming for linguists, and do not manage to cover any significant NLP content. NLTK was originally developed to address this problem, making it feasible to cover a substantial amount of theory and practice within a single-semester course, even if students have no prior programming experience.

Was this article helpful?

0 -1

Post a comment