The Natural Language Toolkit (NLTK) is an open source Python library for Natural Language Processing. A free online book is available. (If you use the library for academic research, please cite the book.)

Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc.

@version: 3.2


accuracy(reference, test) Given a list of reference values and a corresponding list of test values, return the fraction of corresponding values that are equal.
add_logs(logx, logy) Given two numbers logx = log(x) and logy = log(y), return log(x+y).
alignment_error_rate(reference, hypothesis) Return the Alignment Error Rate (AER) of an alignment with respect to a “gold standard” reference alignment.
apply_features(feature_func, toks[, labeled]) Use the LazyMap class to construct a lazy list-like object that is analogous to map(feature_func, toks).
approxrand(a, b, **kwargs) Returns an approximate significance level between two lists of independently generated test values.
arity(rel) Check the arity of a relation.
bigrams(sequence, **kwargs) Return the bigrams generated from a sequence of items, as an iterator.
binary_distance(label1, label2) Simple equality test.
binary_search_file(file, key[, cache, ...]) Return the line from the file with first word key.
binding_ops() Binding operators
bleu(references, hypothesis[, weights, ...]) Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu.
boolean_ops() Boolean operators
bracket_parse(s) Use, remove_empty_top_bracketing=True) instead.
breadth_first(tree[, children, maxdepth]) Traverse the nodes of a tree in breadth-first order.
build_opener(*handlers) Create an opener object from a list of handlers.
call_megam(args) Call the megam binary with the given arguments.
casual_tokenize(text[, preserve_case, ...]) Convenience function for wrapping the tokenizer.
choose(n, k) This function is a fast way to calculate binomial coefficients, commonly known as nCk, i.e.
clause(reldict, relsym) Print the relation in clausal form.
config_java([bin, options, verbose]) Configure nltk’s java interface, by letting nltk know where it can find the Java binary, and what extra options (if any) should be passed to Java when it is run.
config_megam([bin]) Configure NLTK’s interface to the megam maxent optimization package.
conflicts(fstruct1, fstruct2[, trace]) Return a list of the feature paths of all features which are assigned incompatible values by fstruct1 and fstruct2.
conllstr2tree(s[, chunk_types, root_label]) Return a chunk structure for a single sentence encoded in the given CONLL 2000 style string.
conlltags2tree(sentence[, chunk_types, ...]) Convert the CoNLL IOB format to a tree.
decorator(caller) General purpose decorator factory: takes a caller function as input and returns a decorator with the same attributes.
edit_distance(s1, s2[, transpositions]) Calculate the Levenshtein edit-distance between two strings.
elementtree_indent(elem[, level]) Recursive function to indent an ElementTree._ElementInterface used for pretty printing.
equality_preds() Equality predicates
evaluate_sents(inputs, grammar, model, ...) Add the truth-in-a-model value to each semantic representation for each syntactic parse of each input sentences.
everygrams(sequence[, min_len, max_len]) Returns all possible ngrams generated from a sequence of items, as an iterator.
extract_rels(subjclass, objclass, doc[, ...]) Filter the output of semi_rel2reldict according to specified NE classes and a filler pattern.
extract_test_sentences(string[, ...]) Parses a string with one test sentence per line.
f_measure(reference, test[, alpha]) Given a set of reference values and a set of test values, return the f-measure of the test values, when compared against the reference values.
flatten(*args) Flatten a list.
getproxies() Return a dictionary of scheme -> proxy server URL mappings.
ghd(ref, hyp[, ins_cost, del_cost, ...]) Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations.
guess_encoding(data) Given a byte string, attempt to decode it.
ieerstr2tree(s[, chunk_types, root_label]) Return a chunk structure containing the chunked tagged text that is encoded in the given IEER style string.
in_idle() Return True if this function is run within idle.
induce_pcfg(start, productions) Induce a PCFG grammar from a list of productions.
interpret_sents(inputs, grammar[, semkey, trace]) Add the semantic representation to each syntactic parse tree of each input sentence.
interval_distance(label1, label2) Krippendorff’s interval distance metric
invert_graph(graph) Inverts a directed graph.
is_rel(s) Check whether a set represents a relation (of any arity).
jaccard_distance(label1, label2) Distance metric comparing set-similarity.
line_tokenize(text[, blanklines])
load(resource_url[, format, cache, verbose, ...]) Load a given resource from the NLTK data package.
load_parser(grammar_url[, trace, parser, ...]) Load a grammar from a file, and build a parser based on that grammar.
log_likelihood(reference, test) Given a list of reference values and a corresponding list of test probability distributions, return the average log likelihood of the reference values, given the probability distributions.
map_tag(source, target, source_tag) Maps the tag from the source tagset to the target tagset.
masi_distance(label1, label2) Distance metric that takes into account partial agreement when multiple labels are assigned.
ne_chunk(tagged_tokens[, binary]) Use NLTK’s currently recommended named entity chunker to chunk the given list of tagged tokens.
ne_chunk_sents(tagged_sentences[, binary]) Use NLTK’s currently recommended named entity chunker to chunk the given list of tagged sentences, each consisting of a list of tagged tokens.
ngrams(sequence, n[, pad_left, pad_right, ...]) Return the ngrams generated from a sequence of items, as an iterator.
nonterminals(symbols) Given a string containing a list of symbol names, return a list of Nonterminals constructed from those symbols.
pad_sequence(sequence, n[, pad_left, ...]) Returns a padded sequence of items before ngram extraction.
parse_sents(inputs, grammar[, trace]) Convert input sentences into syntactic trees.
pk(ref, hyp[, k, boundary]) Compute the Pk metric for a pair of segmentations A segmentation is any sequence over a vocabulary of two items (e.g.
pos_tag(tokens[, tagset]) Use NLTK’s currently recommended part of speech tagger to tag the given list of tokens.
pos_tag_sents(sentences[, tagset]) Use NLTK’s currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.
pprint(object[, stream, indent, width, depth]) Pretty-print a Python object to a stream [default is sys.stdout].
pr(data[, start, end]) Pretty print a sequence of data items
precision(reference, test) Given a set of reference values and a set of test values, return the fraction of test values that appear in the reference set.
presence(label) Higher-order function to test presence of a given label
print_string(s[, width]) Pretty print a string, breaking lines on whitespace
python_2_unicode_compatible(klass) This decorator defines __unicode__ method and fixes __repr__ and __str__ methods under Python 2.
raise_unorderable_types(ordering, a, b)
ranks_from_scores(scores[, rank_gap]) Given a sequence of (key, score) tuples, yields each key with an increasing rank, tying with previous key’s rank if the difference between their scores is less than rank_gap.
ranks_from_sequence(seq) Given a sequence, yields each element with an increasing rank, suitable for use as an argument to spearman_correlation.
re_show(regexp, string[, left, right]) Return a string with markers surrounding the matched substrings.
read_grammar(input, nonterm_parser[, ...]) Return a pair consisting of a starting category and a list of Productions.
read_logic(s[, logic_parser, encoding]) Convert a file of First Order Formulas into a list of {Expression}s.
read_valuation(s[, encoding]) Convert a valuation string into a valuation.
recall(reference, test) Given a set of reference values and a set of test values, return the fraction of reference values that appear in the test set.
regexp_span_tokenize(s, regexp) Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each successive match of regexp.
regexp_tokenize(text, pattern[, gaps, ...]) Return a tokenized copy of text.
register_tag(cls) Decorates a class to register it’s json tag.
ribes(references, hypothesis[, alpha, beta]) The RIBES (Rank-based Intuitive Bilingual Evaluation Score) from Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh and Hajime Tsukada.
root_semrep(syntree[, semkey]) Find the semantic representation at the root of a tree.
rte_classifier(trainer[, features]) Classify RTEPairs
rtuple(reldict[, lcon, rcon]) Pretty print the reldict as an rtuple.
sent_tokenize(text[, language]) Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently PunktSentenceTokenizer for the specified language).
set2rel(s) Convert a set containing individuals (strings or numbers) into a set of unary tuples.
set_proxy(proxy[, user, password]) Set the HTTP proxy for Python to download through.
sinica_parse(s) Parse a Sinica Treebank string and return a tree.
skipgrams(sequence, n, k, **kwargs) Returns all possible skipgrams generated from a sequence of items, as an iterator.
skolemize(expression[, univ_scope, ...]) Skolemize the expression and convert to conjunctive normal form (CNF)
slice_bounds(sequence, slice_obj[, allow_step]) Given a slice, return the corresponding (start, stop) bounds, taking into account None indices and negative indices.
spearman_correlation(ranks1, ranks2) Returns the Spearman correlation coefficient for two rankings, which should be dicts or sequences of (key, rank).
str2tuple(s[, sep]) Given the string representation of a tagged token, return the corresponding tuple representation.
string_span_tokenize(s, sep) Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each occurrence of sep.
subsumes(fstruct1, fstruct2) Return True if fstruct1 subsumes fstruct2.
tagset_mapping(source, target) Retrieve the mapping dictionary between tagsets.
tagstr2tree(s[, chunk_label, root_label, ...]) Divide a string of bracketted tagged text into chunks and unchunked tokens, and produce a Tree.
tokenwrap(tokens[, separator, width]) Pretty print a list of text tokens, breaking lines on whitespace
total_ordering(cls) Class decorator that fills in missing ordering methods
transitive_closure(graph[, reflexive]) Calculate the transitive closure of a directed graph, optionally the reflexive transitive closure.
tree2conllstr(t) Return a multiline string where each line contains a word, tag and IOB tag.
tree2conlltags(t) Return a list of 3-tuples containing (word, tag, IOB-tag).
trigrams(sequence, **kwargs) Return the trigrams generated from a sequence of items, as an iterator.
tuple2str(tagged_token[, sep]) Given the tuple representation of a tagged token, return the corresponding string representation.
unify(fstruct1, fstruct2[, bindings, trace, ...]) Unify fstruct1 with fstruct2, and return the resulting feature structure.
untag(tagged_sentence) Given a tagged sentence, return an untagged version of that sentence.
usage(obj[, selfname])
windowdiff(seg1, seg2, k[, boundary, weighted]) Compute the windowdiff score for a pair of segmentations.
word_tokenize(text[, language]) Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language).


AbstractLazySequence An abstract base class for read-only sequences whose values are computed as needed.
AffixTagger([train, model, affix_length, ...]) A tagger that chooses a token’s tag based on a leading or trailing substring of its word string.
AlignedSent(words, mots[, alignment]) Return an aligned sentence object, which encapsulates two sentences along with an Alignment between them.
Alignment A storage class for representing alignment between two sequences, s1, s2.
AnnotationTask([data, distance]) Represents an annotation task, i.e.
ApplicationExpression(function, argument) This class is used to represent two related types of logical expressions.
Assignment(domain[, assign]) A dictionary which represents an assignment of values to variables.
BigramAssocMeasures A collection of bigram association measures.
BigramCollocationFinder(word_fd, bigram_fd) A tool for the finding and ranking of bigram collocations or other association measures.
BigramTagger([train, model, backoff, ...]) A tagger that chooses a token’s tag based its word string and on the preceding words’ tag.
BinaryMaxentFeatureEncoding(labels, mapping) A feature encoding that generates vectors containing a binary
BlanklineTokenizer() Tokenize a string, treating any sequence of blank lines as a delimiter.
BllipParser([parser_model, ...]) Interface for parsing with BLLIP Parser.
BottomUpChartParser(grammar, **parser_args) A ChartParser using a bottom-up parsing strategy.
BottomUpLeftCornerChartParser(grammar, ...) A ChartParser using a bottom-up left-corner parsing strategy.
BottomUpProbabilisticChartParser(grammar[, ...]) An abstract bottom-up parser for PCFG grammars that uses a Chart to record partial results.
Boxer([boxer_drs_interpreter, elimeq, ...]) This class is an interface to Johan Bos’s program Boxer, a wide-coverage semantic parser that produces Discourse Representation Structures (DRSs).
BrillTagger(initial_tagger, rules[, ...]) Brill’s transformational rule-based tagger.
BrillTaggerTrainer(initial_tagger, templates) A trainer for tbl taggers.
CFG(start, productions[, calculate_leftcorners]) A context-free grammar.
CRFTagger([feature_func, verbose, training_opt]) A module for POS tagging using CRFSuite
ChartParser(grammar[, strategy, trace, ...]) A generic chart parser.
ChunkParserI A processing interface for identifying non-overlapping groups in unrestricted text.
ChunkScore(**kwargs) A utility class for scoring chunk parsers.
ClassifierBasedPOSTagger([feature_detector, ...]) A classifier based part of speech tagger.
ClassifierBasedTagger([feature_detector, ...]) A sequential tagger that uses a classifier to choose the tag for each token in a sentence.
ClassifierI A processing interface for labeling tokens with a single category label (or “class”).
ConcordanceIndex(tokens[, key]) An index that can be used to look up the offset locations at which a given word occurs in a document.
ConditionalExponentialClassifier Alias for MaxentClassifier.
ConditionalFreqDist([cond_samples]) A collection of frequency distributions for a single experiment run under different conditions.
ConditionalProbDist(cfdist, ...) A conditional probability distribution modeling the experiments that were used to generate a conditional frequency distribution.
ConditionalProbDistI() A collection of probability distributions for a single experiment run under different conditions.
ConfusionMatrix(reference, test[, sort_by_count]) The confusion matrix between a list of reference values and a corresponding list of test values.
ContextIndex(tokens[, context_func, filter, key]) A bidirectional index between words and their ‘contexts’ in a text.
ContextTagger(context_to_tag[, backoff]) An abstract base class for sequential backoff taggers that choose a tag for a token based on the value of its “context”.
ContingencyMeasures(measures) Wraps NgramAssocMeasures classes such that the arguments of association measures are contingency table values rather than marginals.
CrossValidationProbDist(freqdists, bins) The cross-validation estimate for the probability distribution of the experiment used to generate a set of frequency distribution.
DRS(refs, conds[, consequent]) A Discourse Representation Structure.
DecisionTreeClassifier(label[, ...])
DefaultTagger(tag) A tagger that assigns the same tag to every token.
DependencyEvaluator(parsed_sents, gold_sents) Class for measuring labelled and unlabelled attachment score for dependency parsing.
DependencyGrammar(productions) A dependency grammar.
DependencyGraph([tree_str, cell_extractor, ...]) A container for the nodes and labelled edges of a dependency structure.
DependencyProduction(lhs, rhs) A dependency grammar production.
DictionaryConditionalProbDist(probdist_dict) An alternative ConditionalProbDist that simply wraps a dictionary of ProbDists rather than creating these from FreqDists.
DictionaryProbDist([prob_dict, log, normalize]) A probability distribution whose probabilities are directly specified by a given dictionary.
DiscourseTester(input[, reading_command, ...]) Check properties of an ongoing discourse.
DrtExpression This is the base abstract DRT Expression from which every DRT Expression extends.
DrtGlueReadingCommand([semtype_file, ...])
ELEProbDist(freqdist[, bins]) The expected likelihood estimate for the probability distribution of the experiment used to generate a frequency distribution.
EarleyChartParser(grammar, **parser_args)
Expression This is the base abstract object for all logical expressions
FeatDict([features]) A feature structure that acts like a Python dictionary.
FeatList([features]) A list of feature values, where each feature value is either a basic value (such as a string or an integer), or a nested feature structure.
FeatStruct A mapping from feature identifiers to feature values, where each feature value is either a basic value (such as a string or an integer), or a nested feature structure.
FeatStructReader([features, fdict_class, ...])
Feature(name[, default, display]) A feature identifier that’s specialized to put additional constraints, default values, etc.
FeatureBottomUpChartParser(grammar, ...)
FeatureChartParser(grammar[, strategy, ...])
FeatureEarleyChartParser(grammar, **parser_args)
FeatureIncrementalChartParser(grammar[, ...])
FeatureTopDownChartParser(grammar, **parser_args)
FreqDist([samples]) A frequency distribution for the outcomes of an experiment.
HeldoutProbDist(base_fdist, heldout_fdist[, ...]) The heldout estimate for the probability distribution of the experiment used to generate two frequency distributions.
HiddenMarkovModelTagger(symbols, states, ...) Hidden Markov model class, a generative model for labelling sequence data.
HiddenMarkovModelTrainer([states, symbols]) Algorithms for learning HMM parameters from training data.
HunposTagger(path_to_model[, path_to_bin, ...]) A class for pos tagging with HunPos.
IBMModel(sentence_aligned_corpus) Abstract base class for all IBM models
IBMModel1(sentence_aligned_corpus, iterations) Lexical translation model that ignores word order
IBMModel2(sentence_aligned_corpus, iterations) Lexical translation model that considers word order
IBMModel3(sentence_aligned_corpus, iterations) Translation model that considers how a word can be aligned to
IBMModel4(sentence_aligned_corpus, ...[, ...]) Translation model that reorders output words based on their type and
IBMModel5(sentence_aligned_corpus, ...[, ...]) Translation model that keeps track of vacant positions in the target
ISRIStemmer() ISRI Arabic stemmer based on algorithm: Arabic Stemming without a root dictionary.
ImmutableMultiParentedTree(node[, children])
ImmutableParentedTree(node[, children])
ImmutableProbabilisticTree(node[, children])
ImmutableTree(node[, children])
IncrementalBottomUpChartParser(grammar, ...)
IncrementalChartParser(grammar[, strategy, ...]) An incremental chart parser implementing Jay Earley’s
IncrementalLeftCornerChartParser(grammar, ...)
IncrementalTopDownChartParser(grammar, ...)
InsideChartParser(grammar[, beam_size, trace]) A bottom-up parser for PCFG grammars that tries edges in descending order of the inside probabilities of their trees.
JSONTaggedDecoder([encoding, object_hook, ...])
JSONTaggedEncoder([skipkeys, ensure_ascii, ...])
KneserNeyProbDist(freqdist[, bins, discount]) Kneser-Ney estimate of a probability distribution.
LancasterStemmer() Lancaster Stemmer
LaplaceProbDist(freqdist[, bins]) The Laplace estimate for the probability distribution of the experiment used to generate a frequency distribution.
LazyConcatenation(list_of_lists) A lazy sequence formed by concatenating a list of lists.
LazyEnumerate(lst) A lazy sequence whose elements are tuples, each ontaining a count (from zero) and a value yielded by underlying sequence.
LazyMap(function, *lists, **config) A lazy sequence whose elements are formed by applying a given function to each element in one or more underlying lists.
LazySubsequence(source, start, stop) A subsequence produced by slicing a lazy sequence.
LazyZip(*lists) A lazy sequence whose elements are tuples, each containing the i-th element from each of the argument sequences.
LeftCornerChartParser(grammar, **parser_args)
LidstoneProbDist(freqdist, gamma[, bins]) The Lidstone estimate for the probability distribution of the experiment used to generate a frequency distribution.
LineTokenizer([blanklines]) Tokenize a string into its lines, optionally discarding blank lines.
LongestChartParser(grammar[, beam_size, trace]) A bottom-up parser for PCFG grammars that tries longer edges before shorter ones.
MLEProbDist(freqdist[, bins]) The maximum likelihood estimate for the probability distribution of the experiment used to generate a frequency distribution.
MWETokenizer([mwes, separator]) A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.
MaceCommand([goal, assumptions, max_models, ...]) A MaceCommand specific to the Mace model builder.
MaltParser(parser_dirname[, model_filename, ...]) A class for dependency parsing with MaltParser.
MaxentClassifier(encoding, weights[, ...]) A maximum entropy classifier (also known as a “conditional exponential classifier”).
Model(domain, valuation) A first order model is a domain D of discourse and a valuation V.
MultiClassifierI A processing interface for labeling tokens with zero or more category labels (or “labels”).
MultiParentedTree(node[, children]) A Tree that automatically maintains parent pointers for multi-parented trees.
MutableProbDist(prob_dist, samples[, store_logs]) An mutable probdist where the probabilities may be easily modified.
NaiveBayesClassifier(label_probdist, ...) A Naive Bayes classifier.
NaiveBayesDependencyScorer() A dependency scorer built around a MaxEnt classifier.
NgramAssocMeasures An abstract class defining a collection of generic association measures.
NgramTagger(n[, train, model, backoff, ...]) A tagger that chooses a token’s tag based on its word string and on the preceding n word’s tags.
NonprojectiveDependencyParser(dependency_grammar) A non-projective, rule-based, dependency parser.
Nonterminal(symbol) A non-terminal symbol for a context free grammar.
PCFG(start, productions[, calculate_leftcorners]) A probabilistic context-free grammar.
Paice(lemmas, stems) Class for storing lemmas, stems and evaluation metrics.
ParallelProverBuilder(prover, modelbuilder) This class stores both a prover and a model builder and when either prove() or build_model() is called, then both theorem tools are run in parallel.
ParallelProverBuilderCommand(prover, ...[, ...]) This command stores both a prover and a model builder and when either prove() or build_model() is called, then both theorem tools are run in parallel.
ParentedTree(node[, children]) A Tree that automatically maintains parent pointers for single-parented trees.
ParserI A processing class for deriving trees that represent possible structures for a sequence of tokens.
PerceptronTagger([load]) Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal.
PhraseTable() In-memory store of translations for a given phrase, and the log
PorterStemmer() A word stemmer based on the Porter stemming algorithm.
PositiveNaiveBayesClassifier(label_probdist, ...)
ProbDistI() A probability distribution for the outcomes of an experiment.
ProbabilisticDependencyGrammar(productions, ...)
ProbabilisticMixIn(**kwargs) A mix-in class to associate probabilities with other classes (trees, rules, etc.).
ProbabilisticNonprojectiveParser() A probabilistic non-projective dependency parser.
ProbabilisticProduction(lhs, rhs, **prob) A probabilistic context free grammar production.
ProbabilisticProjectiveDependencyParser() A probabilistic, projective dependency parser.
ProbabilisticTree(node[, children])
Production(lhs, rhs) A grammar production.
ProjectiveDependencyParser(dependency_grammar) A projective, rule-based, dependency parser.
Prover9Command([goal, assumptions, timeout, ...]) A ProverCommand specific to the Prover9 prover.
PunktSentenceTokenizer([train_text, ...]) A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries.
QuadgramCollocationFinder(word_fd, ...) A tool for the finding and ranking of quadgram collocations or other association measures.
RSLPStemmer() A stemmer for Portuguese.
RTEFeatureExtractor(rtepair[, stop, lemmatize]) This builds a bag of words for both the text and the hypothesis after throwing away some stopwords, then calculates overlap and difference.
RandomChartParser(grammar[, beam_size, trace]) A bottom-up parser for PCFG grammars that tries edges in random order.
RangeFeature(name[, default, display])
RecursiveDescentParser(grammar[, trace]) A simple top-down CFG parser that parses texts by recursively expanding the fringe of a Tree, and matching it against a text.
RegexpChunkParser(rules[, chunk_label, ...]) A regular expression based chunk parser.
RegexpParser(grammar[, root_label, loop, trace]) A grammar based chunk parser.
RegexpStemmer(regexp[, min]) A stemmer that uses regular expressions to identify morphological affixes.
RegexpTagger(regexps[, backoff]) Regular Expression Tagger
RegexpTokenizer(pattern[, gaps, ...]) A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
ResolutionProverCommand([goal, assumptions, ...])
SExprTokenizer([parens, strict]) A tokenizer that divides strings into s-expressions.
Senna(senna_path, operations[, encoding])
SennaChunkTagger(path[, encoding])
SennaNERTagger(path[, encoding])
SennaTagger(path[, encoding])
SequentialBackoffTagger([backoff]) An abstract base class for taggers that tags words sequentially, left to right.
ShiftReduceParser(grammar[, trace]) A simple bottom-up CFG parser that uses two operations, “shift” and “reduce”, to find a single parse for a text.
SimpleGoodTuringProbDist(freqdist[, bins]) SimpleGoodTuring ProbDist approximates from frequency to frequency of frequency into a linear line under log space by linear regression.
SklearnClassifier(estimator[, dtype, sparse]) Wrapper for scikit-learn classifiers.
SlashFeature(name[, default, display])
SnowballStemmer(language[, ignore_stopwords]) Snowball Stemmer
SpaceTokenizer Tokenize a string using the space character as a delimiter, which is the same as s.split(' ').
StackDecoder(phrase_table, language_model) Phrase-based stack decoder for machine translation
StanfordNERTagger(*args, **kwargs) A class for Named-Entity Tagging with Stanford Tagger.
StanfordPOSTagger(*args, **kwargs) A class for pos tagging with Stanford Tagger.
StanfordTagger(model_filename[, ...]) An interface to Stanford taggers.
StanfordTokenizer([path_to_jar, encoding, ...]) Interface to the Stanford Tokenizer
StemmerI A processing interface for removing morphological affixes from words.
SteppingChartParser(grammar[, strategy, trace]) A ChartParser that allows you to step through the parsing process, adding a single edge at a time.
SteppingRecursiveDescentParser(grammar[, trace]) A RecursiveDescentParser that allows you to step through the parsing process, performing a single operation at a time.
SteppingShiftReduceParser(grammar[, trace]) A ShiftReduceParser that allows you to setp through the parsing process, performing a single operation at a time.
TabTokenizer Tokenize a string use the tab character as a delimiter, the same as s.split('\t').
TableauProverCommand([goal, assumptions, prover])
TaggerI A processing interface for assigning a tag to each token in a list.
TestGrammar(grammar, suite[, accept, reject]) Unit tests for CFG.
Text(tokens[, name]) A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console).
TextCollection(source) A collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc.
TextTilingTokenizer([w, k, ...]) Tokenize a document into topical sections using the TextTiling algorithm.
TnT([unk, Trained, N, C]) TnT - Statistical POS tagger
TokenSearcher(tokens) A class that makes it easier to use regular expressions to search over tokenized strings.
TopDownChartParser(grammar, **parser_args) A ChartParser using a top-down parsing strategy.
TransitionParser(algorithm) Class for transition based parser.
Tree(node[, children]) A Tree represents a hierarchical grouping of leaves and subtrees.
TreebankWordTokenizer The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
Trie([strings]) A Trie implementation for strings
TrigramAssocMeasures A collection of trigram association measures.
TrigramCollocationFinder(word_fd, bigram_fd, ...) A tool for the finding and ranking of trigram collocations or other association measures.
TrigramTagger([train, model, backoff, ...]) A tagger that chooses a token’s tag based its word string and on the preceding two words’ tags.
TweetTokenizer([preserve_case, reduce_len, ...]) Tokenizer for tweets.
TypedMaxentFeatureEncoding(labels, mapping) A feature encoding that generates vectors containing integer,
UniformProbDist(samples) A probability distribution that assigns equal probability to each sample in a given set; and a zero probability to all other samples.
UnigramTagger([train, model, backoff, ...]) Unigram Tagger
UnsortedChartParser(grammar[, beam_size, trace]) A bottom-up parser for PCFG grammars that tries edges in whatever order.
Valuation(xs) A dictionary which represents a model-theoretic Valuation of non-logical constants.
ViterbiParser(grammar[, trace]) A bottom-up PCFG parser that uses dynamic programming to find the single most likely parse for a text.
WekaClassifier(formatter, model_filename)
WhitespaceTokenizer() Tokenize a string on whitespace (space, tab, newline).
WittenBellProbDist(freqdist[, bins]) The Witten-Bell estimate of a probability distribution.
WordNetLemmatizer() WordNet Lemmatizer
WordPunctTokenizer() Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.
chain chain(*iterables) –> chain object
combinations combinations(iterable, r) –> combinations object
defaultdict defaultdict(default_factory[, ...]) –> dict with default factory
deque deque([iterable[, maxlen]]) –> deque object
islice islice(iterable, [start,] stop [, step]) –> islice object
text_type alias of unicode


LogicalExpressionException(index, message)