Formalism

iDQ provides statistical inferences about the probability that a glitch exists within \(h(t)\) at latencies lower than calibrated \(h(t)\) in order to incorporate these inferences within online pipelines. This includes a scalable infrastructure for data transfer and discovery which naturally supports both online and offline inference (see Batch vs Stream modes) by standardizing the feature extraction and data formats used within each classifier (see Abstractions). This also provides a natural playground for investigations into novel learning techniques.

In this way, iDQ provides robust automatic data quality information, improving Gravitational Wave (GW) search sensitivity, facilitating automated alerts in response to GW candidates along with easing the burden of deeper offline investigations into the large data set recorded in the neighborhood of each candidate.

2-class Classification

We stress that we use the term classification (more properly 2-class classification) to mean predicting whether or not a non-Gaussian noise artifact is present in \(h(t)\). We do not mean predicting the type of non-Gaussian noise artifact a known artifact is, which we call categorization (more properly multi-class classification). iDQ could be extended to include categorization, but this is not the fundamental data product the pipeline will produce.

iDQ will provide the necessary data products to consider noise rejection as model selection in a Bayesian sense rather than post facto vetoes. In this framework, we consider 3 models:

  • pure Gaussian Noise,

  • Gaussian Noise and GW signals, and

  • Gaussian Noise and glitches.

such that the data in a single detector can be modeled as

\[d(t) = n(t) + s(t) + g(t)\]

In general, we will assume the ability to characterize \(n(t)\) as a stationary gaussian noise process described by a power spectrum valid over longer timesacles than either \(s(t)\) or \(g(t)\), which are treated as transients. \(s(t)\) is a known gravitational wave signal, e.g., a merging binary black hole system, and \(g(t)\) is any transient signal (duration \(\sim O(1)\) sec) that is not a gravitational wave. We note that this definition might be search specific. For example, a compact binary coalescence search might choose to consider waveforms that resemble cosmic string cusps as glitches even though searches for cosmic strings would not. However, iDQ adopts a rather general definition of glitch.

iDQ ’s fundamental data product should be thought of as a posterior probability that there is a non-Gaussian artifact in \(h(t)\) within an individual detector based on all information recorded at that detector. This could include \(h(t)\) itself for modeled searches, but we provide inferences based on only safe auxiliary information, more natural for un-modeled searches.

iDQ approaches the classification problem via supervised learning techniques. In particular, it estimates the posterior probability for a non-Gaussian noise artifact to be present in \(h(t)\) on a single-detector basis by making use of all information available, conceivably including both \(h(t)\) and auxiliary information: \(a(t)\).

\[p_g(t) = p(g|h(t), a(t))\]

and we specifically consider the glitch-model as the union of all possible glitch families without explicitly enumerating each family (\(g = \oplus_i g_i\)). From this, we may categorize the type of glitch present through conditional probabilities

\[p_gi(t) = p(g_i|g; h(t), a(t)) p(g|h(t), a(t))\]

but we stress that the fundamental data product of the iDQ analysis is \(p_g\) and not \(p_{g_i}\). Furthermore, we define a clean-model which is simply the complement of the glitch-model such that

\[p(g|h, a) + p(c|h, a) = 1\ \forall\ t\]

This then defines our 2-class classification problem: separating samples into the glitch vs. clean models.

Decomposition of the statistical inference

Fundamentally, supervised learning for 2-class classification attempts to compute the likelihoods rather than the posteriors.

\[\begin{split}p(h, a|g) & = \frac{p(g|h, a) p(h, a)}{p(g)} \\ p(h, a|c) & = \frac{p(c|h, a) p(h, a)}{p(c)}\end{split}\]

where we note that

\[p(h, a) = p(h, a|g)p(g) + p(h, a|c)p(c)\]

by definition. Pragmatically, supervised learning algorithms compute a rank \(r = r(h, a)\) that encapsulates their belief that the data is more similar to one class than another. We then calibrate these ranks into meaningful probability estimates through composition. Explicitly

\[p_g = p(g|h, a) = \frac{p(r(h, a)|g)p(g)}{(p(r(h, a)|g)p(g) + p(r(h, a)|c)p(c))}\]

Each classifier determines a separate mapping \(h, a \rightarrow r\). Therefore, different classifiers can produce a different estimate for \(p_g\).

Definitions of glitch and clean models

Our basic notion of what defines a glitch is the non-Gaussian nature of the noise artifact. For this reason, we base our definitions of when a glitch is present on the signal-to-noise ratio (\(\rho\)) in \(h(t)\), which is a measure of how likely the observed data could be generated by Gaussian noise alone. Historically, the definition of a glitch has been based on a simple threshold on \(\rho\). We continue this practice, but also allow users to downselect glitches based on other features, such as the ferquency or duration (see target_bounds in Configuration). Users can then add further specificity to these models by dividing the glitch samples into different frequency bands. While the formal expression for this will be similar to what we have shown for different glitch families, the definition of frequency bands will not depend on the actual noise background experienced (unlike the definition of glitch families). Furthermore, this naturally mirrors several filtering procedures already used in searches (e.g.: low- and high-frequency burst searches, multi-rate filtering within gstlal). In the band-limited glitch framework, we will define separate clean models for each glitch frequency band so that we can compute the associated probabilities for glitches within that band alone. This does mean that the clean model for a given bandwidth will overlap with the glitch models in other bandwidths.

Furthermore, becaues physical noise sources which generate glitches very near our selection thresholds may fall in either class within our training set based on random realizations of Gaussian noise, we also define a buffer around our thresholds (see dirty_bounds and dirty_window in Configuration). This, in effect, requries clean samples to be “far away” from glitchy samples within our training sets. We do note that this means it is possible for for sample sto be neither clean nor glitchy (i.e.: within dirty_bounds but outside of target_bounds), thereby breaking our assumption \(p(g|h,a) + p(c|h,a) = 1\ \forall\ t\). However, this is thought to be a small fraction of the total number of samples and to not significantly affect our results for typical configurations.

Another possible way to avoid this is to use a probabilistic estimate for whether an event is likely to be a glitch according to

\[p(h|g) = 1 - p(c) = 1 - N e^{-\rho^2/2}\]

or something similar. With this in hand, we can train machine learning algorithms with weighed samples, or use regression techniques instead of classification.