qp Demo

Alex Malz, Phil Marshall, Eric Charles

In this notebook we use the qp module to approximate some simple, standard, 1D PDFs using sets of quantiles, samples, and histograms, and assess their relative accuracy. We also show how such analyses can be extended to use "composite" PDFs made up of mixtures of standard distributions.

Requirements

To run qp, you will need to first install the module by following the instructions here.

Background: the scipy.stats module

The scipy.stats module is the standard for manipulating distribtions so is a natural place to start for implementing 1D PDF parameterizations.
It allows you do define a wide variety of distibutions and uses numpy array broadcasting for efficiency.

Gaussian (Normal) example

Here are some examples of things you can do with the scipy.stats module, using a Gaussian or Normal distribution. loc and scale are the means and standard deviations of the underlying Gaussians.

Note the distinction between passing arguments to norm and passing arguments to pdf to access multiple distributions and their PDF values at multiple points.

The scipy.stats classes

In the scipy.stats module, all of the distributions are sub-classes of scipy.stats.rv_continuous.
You make an object of a particular sub-type, and then 'freeze' it by passing it shape parameters.

Properties of distributions

scipy.stats lets you evaluate multiple properties of distributions. These include:

  1. pdf: Probability Density Function
  2. cdf: Cumulative Distribution Function
  3. ppf: Percent Point Function (Inverse of CDF)
  4. sf: Survival Function (1-CDF)
  5. isf: Inverse Survival Function (Inverse of SF)
  6. rvs: Random Variates (i.e., sampled values)
  7. stats: Return mean, variance, optionally: (Fisher’s) skew, or (Fisher’s) kurtosis
  8. moment: non-central moments of the distribution

qp parameterizations and visualization functionality

The next part of this notebook shows how we can extend the functionality of scipy.stats to implement distributions that are based on parameterizations of 1D PDFs, like histograms, interpolations, splines, or mixture models.

Parameterizations from scipy.stats

qp automatically generates classes for all of the scipy.stats.rv_continuous distributions, providing feed-through access to all scipy.stats.rv_continuous objects but adds on additional attributes and methods specific to parameterization conversions.

Native plotting

If you have a single distribution you can plot it, the qp.plotting.plot_native function will find a nice way to represent the data used to construct the distribution.

qp histogram (piecewise constant) parameterization

This represents a set of distributions made by interpolating a set of histograms with shared binning. To construct this you need to give the bin edges (shape=(N)) and the bin values (shape=(npdf, N-1)).

Note that the native visual representation is different from the Normal distribution.

What if you want to evaluate a vector of input values, where each input value is different for each PDF? In that case you need the shape of the vector of input value to match the implicit shape of the PDFs, which in this case is (2,1)

qp quantile parameterization

This represents a set of distributions made by interpolating the locations at which various distributions reach a given set of quantiles. To construct this you need to give the quantiles edges (shape=(N)) and the location values (shape=(npdf, N)).

Note that the native visual representation is different.

qp interpolated parameterization

This represents a set of distributions made by interpolating a set of x and y values. To construct this you need to give the x and y values (both of shape=(npdf, N))

Note that the native visual representation is pretty similar to the original one for the Gaussian.

qp spline parameterization constructed from kernel density estimate (samples) parameterization

This represents a set of distributions made by producing a kernel density estimate from a set of samples.

To construct this you need to give the samples edges (shape=(npdf, Nsamples)).

Note again that the the native visual represenation is different.

qp spline parameterization

This represents a set of distributions made building a set of splines. Though the parameterization is defined by the spline knots, you can construct this from x and y values (both of shape=(npdf, N)).

Note that the native visual representation is pretty similar to the original one for the Gaussian.

Note also that the spline knots are stored.

Overplotting

You can visually compare the represenations by plotting them all on the same figure.

The qp.Ensemble Class

This is the basic element of qp - an object representing a set of probability density functions. This class is stored in the module ensemble.py.

To create a qp.Ensemble you need to specify the class used to represent the PDFs, and provide that data for the specific set of PDFs.

Ensembles of distributions

qp no longer distinguishes between distributions and ensembles thereof -- a single distribution is just a special case of an ensemble with only one member, which takes advantage of computational efficiencies in scipy. The shape of the array returned by a call to the pdf function of a distribution depends on the shape of the parameters and evaluate points.

For distributions that take multiple input arrays, qp uses te convention that the rows are the individual distributions and the columns are the values of the parameters defining the distributions under a known parameterization.

Here we will create 100 Gaussians with means distributed between -1 and 1, and widths distributed between 0.9 and 1.1.

Using the ensemble

All of the methods of the distributions (pdf, cdf etc.) work the same way for an ensemble as for underlying classes.

To isolate a single distribution in the ensemble, use the square brackets operator [].

Converting the ensemble

The qp.Ensemble.convert_to function lets you convert ensembles to other representations. To do this you have to provide the original ensemble, the class you want to convert to, and any some keyword arguments to specify details about how to convert to the new class, here are some examples.

The qp.convert function also works the more or less the same way, but with slightly different syntax, where you can use the name of the class instead of the class object.

Comparing Parametrizations

qp supports quantitative comparisons between different distributions, across parametrizations.

Qualitative Comparisons: Plotting

Let's visualize the PDF object in order to original and the other representaions. The solid, black line shows the true PDF evaluated between the bounds. The green rugplot shows the locations of the 1000 samples we took. The vertical, dotted, blue lines show the percentiles we asked for, and the hotizontal, dotted, red lines show the 10 equally spaced bins we asked for. Note that the quantiles refer to the probability distribution between the bounds, because we are not able to integrate numerically over an infinite range. Interpolations of each parametrization are given as dashed lines in their corresponding colors. Note that the interpolations of the quantile and histogram parametrizations are so close to each other that the difference is almost imperceptible!

We can also interpolate the function onto an evenly spaced grid point and cache those values with the gridded function.

Quantitative Comparisons

Next, let's compare the different parametrizations to the truth using the Kullback-Leibler Divergence (KLD). The KLD is a measure of how close two probability distributions are to one another -- a smaller value indicates closer agreement. It is measured in units of bits of information, the information lost in going from the second distribution to the first distribution. The KLD calculator here takes in a shared grid upon which to evaluate the true distribution and the interpolated approximation of that distribution and returns the KLD of the approximation relative to the truth, which is not in general the same as the KLD of the truth relative to the approximation. Below, we'll calculate the KLD of the approximation relative to the truth over different ranges, showing that it increases as it includes areas where the true distribution and interpolated distributions diverge.

The progression of KLD values should follow that of the root mean square error (RMSE), another measure of how close two functions are to one another. The RMSE also increases as it includes areas where the true distribution and interpolated distribution diverge. Unlike the KLD, the RMSE is symmetric, meaning the distance measured is not that of one distribution from the other but of the symmetric distance between them.

Both the KLD and RMSE metrics suggest that the quantile approximation is better in the high density region, but samples work better when the tails are included. We might expect the answer to the question of which approximation to use to depend on the application, and whether the tails need to be captured or not.

Storing and retreiving ensembles

You can store and retrieve ensembles from disk using the qp.Ensemble.write_to and qp.Ensemble.read_from methods.

These work in two steps, first they convert the Ensemble data to astropy.table objects, and then they write the tables. This means you can store the data in any format support by astropy.

Here is a loopback test showing that we get the same results before and after a write/read cycle.

Finally, we can compare the moments of each approximation and compare those to the moments of the true distribution.