This module contains classes and functions for statistics and information theory. It is imported as follows:
import pynlpl.statistics
Amongst others, the following generic statistical functions are available:
* ``mean(list)'' - Computes the mean of a given list of numbers
One of the most basic and widespread tasks in NLP is the creation of a frequency list. Counting is established by simply appending lists to the frequencylist:
freqlist = pynlpl.statistics.FrequencyList()
freqlist.append(['to','be','or','not','to','be'])
Take care not to append lists rather than strings unless you mean to create a frequency list over its characters rather than words. You may want to use the pynlpl.textprocessors.crudetokeniser first:
freqlist.append(pynlpl.textprocessors.crude_tokeniser("to be or not to be"))
The count can also be incremented explicitly explicitly for a single item:
freqlist.count(‘shakespeare’)
The FrequencyList offers dictionary-like access. For example, the following statement will be true for the frequency list just created:
freqlist['be'] == 2
Normalised counts (pseudo-probabilities) can be obtained using the p() method:
freqlist.p('be')
Normalised counts can also be obtained by instantiation a Distribution instance using the frequency list:
dist = pynlpl.statistics.Distribution(freqlist)
This too offers a dictionary-like interface, where values are by definition normalised. The advantage of a Distribution class is that it offers information-theoretic methods such as entropy(), maxentropy(), perplexity() and poslog().
A frequency list can be saved to file using the save(filename) method, and loaded back from file using the load(filename) method. The output() method is a generator yielding strings for each line of output, in ranked order.
This is a Python library containing classes for Statistic and Information Theoretical computations. It also contains some code from Peter Norvig, AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html
A distribution can be created over a FrequencyList or a plain dictionary with numeric values. It will be normalized automatically. This implemtation uses dictionaries/hashing
Compute the entropy of the distribution
Computes the information content of the specified type: -log_e(p(X))
Returns an unranked list of (type, prob) pairs. Use this only if you are not interested in the order.
Compute the maximum entropy of the distribution: log_e(N)
Returns the type that occurs the most frequently in the probability distribution
Generator yielding formatted strings expressing the time and probabily for each item in the distribution
alias for information content
A frequency list (implemented using dictionaries)
Add a list of tokens to the frequencylist. This method will count them for you.
Count a certain type. The counter will increase by the amount specified (defaults to one)
Returns an unranked list of (type, count) pairs. Use this only if you are not interested in the order.
Load a frequency list from file (in the format produced by the save method)
Returns the type that occurs the most frequently in the frequency list
Print a representation of the frequency list
Returns the probability (relative frequency) of the token
Save a frequency list to file, can be loaded later using the load method
Returns the total amount of tokens
Returns the total amount of tokens
Computes the type/token ratio
Is state tonode directly accessible (in one step) from state fromnode? (i.e. is there an edge between the nodes). If so, return the probability, else zero
See if a node communicates (directly or indirectly) with another. Returns the probability of the shortest path (probably, but not necessarily the highest probability)
Returns the probability of the given sequence or subsequence (if subsequence=True, default).
Return the sum of the element-wise product of vectors x and y. >>> dotproduct([1, 2, 3], [1000, 100, 10]) 1230
Return a list of (value, count) pairs, summarizing the input values. Sorted by increasing value, or if mode=1, by decreasing count. If bin_function is given, map it over values first.
Computes the levenshtein distance between two strings. Adapted from: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python
Base 2 logarithm. >>> log2(1024) 10.0
Return the arithmetic average of the values.
Return the middle value, when the values are sorted. If there are an odd number of elements, try to average the middle two. If they can’t be averaged (e.g. they are strings), choose one at random. >>> median([10, 100, 11]) 11 >>> median([1, 2, 3, 4]) 2.5
Return the most common value in the list of values. >>> mode([1, 2, 3, 2]) 2
Multiply each number by a constant such that the sum is 1.0 (or total). >>> normalize([1,2,1]) [0.25, 0.5, 0.25]
Return the product of a sequence of numerical values. >>> product([1,2,6]) 12
The standard deviation of a set of values. Pass in the mean if you already know it.
Component-wise addition of two vectors. >>> vector_add((0, 1), (8, 9)) (8, 10)