6.8.7. statsmodels.sandbox.stats.stats_dhuard

from David Huard’s scipy sandbox, also attached to a ticket and in the matplotlib-user mailinglist (links ???)

6.8.7.1. Notes

out of bounds interpolation raises exception and wouldn’t be completely defined

>>> scoreatpercentile(x, [0,25,50,100])

Traceback (most recent call last): ...

raise ValueError(“A value in x_new is below the interpolation “

ValueError: A value in x_new is below the interpolation range. >>> percentileofscore(x, [-50, 50]) Traceback (most recent call last): ...

raise ValueError(“A value in x_new is below the interpolation “

ValueError: A value in x_new is below the interpolation range.

6.8.7.2. idea

6.8.7.2.1. histogram and empirical interpolated distribution

dual constructor * empirical cdf : cdf on all observations through linear interpolation * binned cdf : based on histogram both should work essentially the same, although pdf of empirical has many spikes, fluctuates a lot - alternative: binning based on interpolated cdf : example in script * ppf: quantileatscore based on interpolated cdf * rvs : generic from ppf * stats, expectation ? how does integration wrt cdf work - theory?

Problems * limits, lower and upper bound of support

does not work or is undefined with empirical cdf and interpolation
  • extending bounds ? matlab has pareto tails for empirical distribution, breaks linearity

6.8.7.2.2. empirical distribution with higher order interpolation

  • should work easily enough with interpolating splines
  • not piecewise linear
  • can use pareto (or other) tails
  • ppf how do I get the inverse function of a higher order spline? Chuck: resample and fit spline to inverse function this will have an approximation error in the inverse function
  • -> doesn’t work: higher order spline doesn’t preserve monotonicity see mailing list for response to my question
  • pmf from derivative available in spline

-> forget this and use kernel density estimator instead

6.8.7.2.3. bootstrap/empirical distribution:

discrete distribution on real line given observations what’s defined? * cdf : step function * pmf : points with equal weight 1/nobs * rvs : resampling * ppf : quantileatscore on sample? * moments : from data ? * expectation ? sum_{all observations x} [func(x) * pmf(x)] * similar for discrete distribution on real line * References : ? * what’s the point? most of it is trivial, just for the record ?

Created on Monday, May 03, 2010, 11:47:03 AM Author: josef-pktd, parts based on David Huard License: BSD

6.8.7.2.4. Functions

empiricalcdf(data[, method]) Return the empirical cdf.
percentileofscore(data, score) Return the percentile-position of score relative to data.
scoreatpercentile(data, percentile) Return the score at the given percentile of the data.

6.8.7.2.5. Classes

HistDist(data) Distribution with piecewise linear cdf, pdf is step function