Formalism¶
iDQ provides statistical inferences about the probability that a glitch exists within
In this way, iDQ provides robust automatic data quality information, improving Gravitational Wave (GW) search sensitivity, facilitating automated alerts in response to GW candidates along with easing the burden of deeper offline investigations into the large data set recorded in the neighborhood of each candidate.
2-class Classification¶
We stress that we use the term classification (more properly 2-class classification) to mean predicting whether or not a non-Gaussian noise artifact is present in
iDQ will provide the necessary data products to consider noise rejection as model selection in a Bayesian sense rather than post facto vetoes. In this framework, we consider 3 models:
pure Gaussian Noise,
Gaussian Noise and GW signals, and
Gaussian Noise and glitches.
such that the data in a single detector can be modeled as
In general, we will assume the ability to characterize
iDQ ’s fundamental data product should be thought of as a posterior probability that there is a non-Gaussian artifact in
iDQ approaches the classification problem via supervised learning techniques.
In particular, it estimates the posterior probability for a non-Gaussian noise artifact to be present in
and we specifically consider the glitch-model as the union of all possible glitch families without explicitly enumerating each family (
but we stress that the fundamental data product of the iDQ analysis is
This then defines our 2-class classification problem: separating samples into the glitch vs. clean models.
Decomposition of the statistical inference¶
Fundamentally, supervised learning for 2-class classification attempts to compute the likelihoods rather than the posteriors.
where we note that
by definition.
Pragmatically, supervised learning algorithms compute a rank
Each classifier determines a separate mapping
Definitions of glitch and clean models¶
Our basic notion of what defines a glitch is the non-Gaussian nature of the noise artifact.
For this reason, we base our definitions of when a glitch is present on the signal-to-noise ratio (gstlal
).
In the band-limited glitch framework, we will define separate clean models for each glitch frequency band so that we can compute the associated probabilities for glitches within that band alone.
This does mean that the clean model for a given bandwidth will overlap with the glitch models in other bandwidths.
Furthermore, becaues physical noise sources which generate glitches very near our selection thresholds may fall in either class within our training set based on random realizations of Gaussian noise, we also define a buffer around our thresholds (see dirty_bounds and dirty_window in Configuration).
This, in effect, requries clean samples to be “far away” from glitchy samples within our training sets.
We do note that this means it is possible for for sample sto be neither clean nor glitchy (i.e.: within dirty_bounds but outside of target_bounds), thereby breaking our assumption
Another possible way to avoid this is to use a probabilistic estimate for whether an event is likely to be a glitch according to
or something similar. With this in hand, we can train machine learning algorithms with weighed samples, or use regression techniques instead of classification.