idq.calibration¶
describe the point of this module (estimating pdfs robustly). this is useful within calibration (required by all idq.clasifiers.SupervisedClassifier and idq.classifiers.IncrementalSupervisedClassifier subclasses) and can be re-used within idq.classifiers.NaiveBayes and idq.classifiers.IncrementalNaiveBayes.
describe why we declare a class to handle calibration mapping (idq.classifiers.SupervisedClassifier instances retain a pointer to this object and delegate when they want to generate calibrated probabilities from ranks).
- class idq.calibration.CalibrationMap(dataset, num_quantiles=101, num_mc=10000, num_points=101, gch_num_points=101, gch_b=0.1, cln_num_points=101, cln_b=0.1, compute=True, clean_segs=None, rate_estimation='livetime', min_livetime=86400, min_clean_livetime=43200, model_id=None, **kwargs)[source]¶
a helper object that represents 2 FixedBandwidth1DKDE, one for cleans and one for glitches builds and manages these for you based on input samples can then compute things like likelihood ratios, efficiency, fap, etc. Can even generate a full ROC curve automatically.
- compute_loglike_cdf(num_quantiles=101, num_mc=10000)[source]¶
compute CDF(loglike|rank) interpolation object via monte carlo
- property hash¶
the identifier used to locate this model.
- is_healthy()[source]¶
Determine the health of the pipeline. Based on calculation of dfap/fap given in map. Similar results shown in calibration accuracy plot.
- loglike_quantile(ranks, q)[source]¶
interpolates between requested quantile and stored quantiles first, then interpolates over the resulting function of rank to get the values at the requested ranks makes use of self._ranks and self._interp_loglike_cdf to do this, with self._interp_loglike_cdf being computed via a monte-carlo integration as part of self.compute()
NOTE: this function assumes “q” is a float, not an array
- pglitch(ranks)[source]¶
- compute the p(glitch) at this value of b
pglitch = 1 / ( 1 + (np.exp(-loglike(b)/prior_odds) )
NOTE: this includes the prior_odds based on approximate rates of glitches and cleans. If that is not desired, users can call loglike directly and apply their own prior_odds
- property prior_cln¶
an estimate of the prior that any given sample is a clean, based on the approximate rate derived from self._cln_dataset
- property prior_gch¶
an estimate of the prior that any given sample is a glitch, based on the approximate rate derived from self._gch_dataset
- property prior_odds¶
the prior odds that a sample is a glitch vs a clean (self.prior_gch/self.prior_cln)
- class idq.calibration.Discrete1DKDE(obs, **kwargs)[source]¶
an object similar to FixedBandwidth1DKDE but assuming ranks only occur at discrete values instead of being drawn from a continuous distribution
- class idq.calibration.DiscreteCalibrationMap(dataset, num_quantiles=101, num_mc=10000, compute=True, clean_segs=None, rate_estimation='livetime', min_livetime=86400, min_clean_livetime=43200, model_id=None, **kwargs)[source]¶
a helper object that represents 2 FixedBandwidth1DKDE, one for cleans and one for glitches
builds and manages these for you based on input samples can then compute things like likelihood ratios, efficiency, fap, etc. Can even generate a full ROC curve automatically.
- compute_loglike_cdf(num_quantiles=101, num_mc=10000)[source]¶
compute CDF(loglike|rank) interpolation object via monte carlo
- loglike_quantile(ranks, q)[source]¶
interpolates between requested quantile and stored quantiles first, then interpolates over the resulting function of rank to get the values at the requested ranks makes use of self._ranks and self._interp_loglike_cdf to do this, with self._interp_loglike_cdf being computed via a monte-carlo integration as part of self.compute()
NOTE: this function assumes “q” is a float, not an array
- class idq.calibration.FixedBandwidth1DKDE(observations, num_points=101, b=0.1, **kwargs)[source]¶
an object that represents the calibration mapping between SupervisedClassifier output (ranks in [0, 1]) and probabilistic statements essentially a fancy interpolation device
SupervisedClassifier and IncrementalSupervisedClassifier objects retain a reference to one of these functionality should be universal for all classifiers!
We accomplish this with a standard fixed-bandwidth Gaussian KDE, represented internally as a vector that’s referenced via interpolation for rapid execution.
- add_observations(observations)[source]¶
update the local data structures using the new observations compute the interpolation arrays used for fast execution later this includes error estimates, which we compute by fitting beta distributions to the first 2 moments of the pdf and cdf, respectively
- cdf_alpha(ranks)[source]¶
NOTE: because force the cdf to go through (0,0) and (1,1), the error estimates are messed up near the end points because I can’t represent that with a beta function. This turns out to also affect the second-to-last point as well because of how np.interp works (the nan from the last point get mixed in)
- cdf_beta(ranks)[source]¶
NOTE: because force the cdf to go through (0,0) and (1,1), the error estimates are messed up near the end points because I can’t represent that with a beta function. This turns out to also affect the second-to-last point as well because of how np.interp works (the nan from the last point get mixed in)
- cdf_quantile(ranks, q)[source]¶
NOTE: because force the cdf to go through (0,0) and (1,1), the error estimates are messed up near the end points because I can’t represent that with a beta function. This turns out to also affect the second-to-last point as well because of how np.interp works (the nan from the last point get mixed in)
- compute()[source]¶
this works just like update, except it zero’s all the starting arrays and then delegates to update handing it all the current observations
- flush(max_samples=inf)[source]¶
remove the observations from the front of the list until the total number of observations is <=max_samples.
- grad(b)[source]¶
computes dloglike/dlogb at b
- dloglike/dlogb = (1/N) sum_i frac{sum_{j neq i} ((x_i-x_j)**2/b**2 - 1)
exp(-0.5*(x_i-x_j)**2/b**2)}{sum_{j neq i}
exp(-0.5*(x_i-x_j)**2/b**2)}
- loglike(b)[source]¶
compute the loglikelihood at this value of b
loglike = (1/N)sum_i log( (1/(N-1)) sum_{j \neq i}
- optimize(minb=0.0001, maxb=1.0, tol=0.0001, bounded=False, b_factor=0.08333333333333333, **kwargs)[source]¶
looks for the maximum of logL via Newton’s method for the zeros of dlogL/dlogb expects dlogL/dlogb to be monotonic in logb, which will likely be true. However, if it is not then the logic in this loop may fail.