idq.calibration¶

describe the point of this module (estimating pdfs robustly). this is useful within calibration (required by all idq.clasifiers.SupervisedClassifier and idq.classifiers.IncrementalSupervisedClassifier subclasses) and can be re-used within idq.classifiers.NaiveBayes and idq.classifiers.IncrementalNaiveBayes.

describe why we declare a class to handle calibration mapping (idq.classifiers.SupervisedClassifier instances retain a pointer to this object and delegate when they want to generate calibrated probabilities from ranks).

class idq.calibration.CalibrationMap(dataset, num_quantiles=101, num_mc=10000, num_points=101, gch_num_points=101, gch_b=0.1, cln_num_points=101, cln_b=0.1, compute=True, clean_segs=None, rate_estimation='livetime', min_livetime=86400, min_clean_livetime=43200, model_id=None, **kwargs)[source]¶

a helper object that represents 2 FixedBandwidth1DKDE, one for cleans and one for glitches builds and manages these for you based on input samples can then compute things like likelihood ratios, efficiency, fap, etc. Can even generate a full ROC curve automatically.

auto_optimize(**kwargs)[source]¶: delegation to individual functions for gch and cln

compute_loglike_cdf(num_quantiles=101, num_mc=10000)[source]¶: compute CDF(loglike|rank) interpolation object via monte carlo

compute_rates()[source]¶: compute the approximate rate of gch and clean based on datasets

property hash¶: the identifier used to locate this model.

is_healthy()[source]¶: Determine the health of the pipeline. Based on calculation of dfap/fap given in map. Similar results shown in calibration accuracy plot.

loglike(ranks)[source]¶: NOTE: returns log(E[p(r|g)]) - log(E[p(r|c)])

loglike_mean(ranks)[source]¶: NOTE: returns E[log(p(r|g)) - log(p(r|c))]

loglike_quantile(ranks, q)[source]¶

interpolates between requested quantile and stored quantiles first, then interpolates over the resulting function of rank to get the values at the requested ranks makes use of self._ranks and self._interp_loglike_cdf to do this, with self._interp_loglike_cdf being computed via a monte-carlo integration as part of self.compute()

NOTE: this function assumes “q” is a float, not an array

optimize(**kwargs)[source]¶: delegation to individual functions for gch and cln

pglitch(ranks)[source]¶

compute the p(glitch) at this value of b: pglitch = 1 / ( 1 + (np.exp(-loglike(b)/prior_odds) )

NOTE: this includes the prior_odds based on approximate rates of glitches and cleans. If that is not desired, users can call loglike directly and apply their own prior_odds

property prior_cln¶: an estimate of the prior that any given sample is a clean, based on the approximate rate derived from self._cln_dataset

property prior_gch¶: an estimate of the prior that any given sample is a glitch, based on the approximate rate derived from self._gch_dataset

property prior_odds¶: the prior odds that a sample is a glitch vs a clean (self.prior_gch/self.prior_cln)

roc(ranks=None)[source]¶: returns fap(ranks), eff(ranks)

class idq.calibration.Discrete1DKDE(obs, **kwargs)[source]¶

an object similar to FixedBandwidth1DKDE but assuming ranks only occur at discrete values instead of being drawn from a continuous distribution

coverage()[source]¶: return ranks, CDF(ranks), fraction of ranks with CDF(rank) <= X

flush(max_samples=inf)[source]¶: remove the observations from the front of the list until the total number of observations is <=max_samples.

class idq.calibration.DiscreteCalibrationMap(dataset, num_quantiles=101, num_mc=10000, compute=True, clean_segs=None, rate_estimation='livetime', min_livetime=86400, min_clean_livetime=43200, model_id=None, **kwargs)[source]¶

a helper object that represents 2 FixedBandwidth1DKDE, one for cleans and one for glitches

builds and manages these for you based on input samples can then compute things like likelihood ratios, efficiency, fap, etc. Can even generate a full ROC curve automatically.

compute_loglike_cdf(num_quantiles=101, num_mc=10000)[source]¶: compute CDF(loglike|rank) interpolation object via monte carlo

loglike(ranks)[source]¶: NOTE: returns log(E[p(r|g)]) - log(E[p(r|c)])

loglike_quantile(ranks, q)[source]¶

interpolates between requested quantile and stored quantiles first, then interpolates over the resulting function of rank to get the values at the requested ranks makes use of self._ranks and self._interp_loglike_cdf to do this, with self._interp_loglike_cdf being computed via a monte-carlo integration as part of self.compute()

NOTE: this function assumes “q” is a float, not an array

class idq.calibration.FixedBandwidth1DKDE(observations, num_points=101, b=0.1, **kwargs)[source]¶

an object that represents the calibration mapping between SupervisedClassifier output (ranks in [0, 1]) and probabilistic statements essentially a fancy interpolation device

SupervisedClassifier and IncrementalSupervisedClassifier objects retain a reference to one of these functionality should be universal for all classifiers!

We accomplish this with a standard fixed-bandwidth Gaussian KDE, represented internally as a vector that’s referenced via interpolation for rapid execution.

add_observations(observations)[source]¶: update the local data structures using the new observations compute the interpolation arrays used for fast execution later this includes error estimates, which we compute by fitting beta distributions to the first 2 moments of the pdf and cdf, respectively

cdf_alpha(ranks)[source]¶: NOTE: because force the cdf to go through (0,0) and (1,1), the error estimates are messed up near the end points because I can’t represent that with a beta function. This turns out to also affect the second-to-last point as well because of how np.interp works (the nan from the last point get mixed in)

cdf_beta(ranks)[source]¶: NOTE: because force the cdf to go through (0,0) and (1,1), the error estimates are messed up near the end points because I can’t represent that with a beta function. This turns out to also affect the second-to-last point as well because of how np.interp works (the nan from the last point get mixed in)

cdf_quantile(ranks, q)[source]¶: NOTE: because force the cdf to go through (0,0) and (1,1), the error estimates are messed up near the end points because I can’t represent that with a beta function. This turns out to also affect the second-to-last point as well because of how np.interp works (the nan from the last point get mixed in)

compute()[source]¶: this works just like update, except it zero’s all the starting arrays and then delegates to update handing it all the current observations

coverage()[source]¶: return ranks, CDF(ranks), fraction of ranks with CDF(rank) <= X

flush(max_samples=inf)[source]¶: remove the observations from the front of the list until the total number of observations is <=max_samples.

grad(b)[source]¶

computes dloglike/dlogb at b

dloglike/dlogb = (1/N) sum_i frac{sum_{j neq i} ((x_i-x_j)**2/b**2 - 1)

exp(-0.5*(x_i-x_j)**2/b**2)}{sum_{j neq i}

exp(-0.5*(x_i-x_j)**2/b**2)}

loglike(b)[source]¶: compute the loglikelihood at this value of b

loglike = (1/N)sum_i log( (1/(N-1)) sum_{j \neq i}

logpdf(ranks)[source]¶: NOTE: this returns log(E[pdf]), not E[log(pdf)]

logpdf_mean(ranks)[source]¶: NOTE: this returns E[log(pdf)] whereas logpdf returns log(E[pdf])

optimize(minb=0.0001, maxb=1.0, tol=0.0001, bounded=False, b_factor=0.08333333333333333, **kwargs)[source]¶: looks for the maximum of logL via Newton’s method for the zeros of dlogL/dlogb expects dlogL/dlogb to be monotonic in logb, which will likely be true. However, if it is not then the logic in this loop may fail.

remove_observations(observations)[source]¶: works just like update, except we subtract out the contributions of the observations instead of adding them in