.. _formalism:

Formalism
####################################################################################################

*iDQ* provides statistical inferences about the probability that a glitch exists within :math:`h(t)` at latencies lower than calibrated :math:`h(t)` in order to incorporate these inferences within online pipelines.
This includes a scalable infrastructure for data transfer and discovery which naturally supports both online and offline inference (see :ref:`workflow-batch-vs-stream`) by standardizing the feature extraction and data formats used within each classifier (see :ref:`abstractions`).
This also provides a natural playground for investigations into novel learning techniques.

In this way, *iDQ* provides robust automatic data quality information, improving Gravitational Wave (GW) search sensitivity, facilitating automated alerts in response to GW candidates along with easing the burden of deeper offline investigations into the large data set recorded in the neighborhood of each candidate.

.. _formalism-2class:

2-class Classification
====================================================================================================

We stress that we use the term *classification* (more properly 2-class classification) to mean predicting whether or not a non-Gaussian noise artifact is present in :math:`h(t)`.
We do *not* mean predicting the type of non-Gaussian noise artifact a known artifact is, which we call *categorization* (more properly multi-class classification).
*iDQ* could be extended to include categorization, but this is not the fundamental data product the pipeline will produce.

*iDQ* will provide the necessary data products to consider noise rejection as model selection in a Bayesian sense rather than post facto vetoes.
In this framework, we consider 3 models:

* pure *Gaussian Noise*,
* *Gaussian Noise* and *GW signals*, and
* *Gaussian Noise* and *glitches*.

such that the data in a single detector can be modeled as

.. math::
    d(t) = n(t) + s(t) + g(t)

In general, we will assume the ability to characterize :math:`n(t)` as a stationary gaussian noise process described by a power spectrum valid over longer timesacles than either :math:`s(t)` or :math:`g(t)`, which are treated as transients.
:math:`s(t)` is a known gravitational wave signal, e.g., a merging binary black hole system, and :math:`g(t)` is any transient signal (duration :math:`\sim O(1)` sec) that is not a gravitational wave.
We note that this definition might be search specific.
For example, a compact binary coalescence search might choose to consider waveforms that resemble cosmic string cusps as glitches even though searches for cosmic strings would not.
However, *iDQ* adopts a rather general definition of glitch.

*iDQ* ’s fundamental data product should be thought of as a posterior probability that there is a non-Gaussian artifact in :math:`h(t)` within an individual detector based on all information recorded at that detector.
This could include :math:`h(t)` itself for modeled searches, but we provide inferences based on only *safe auxiliary information*, more natural for un-modeled searches.

*iDQ* approaches the classification problem via supervised learning techniques.
In particular, it estimates the posterior probability for a non-Gaussian noise artifact to be present in :math:`h(t)` on a single-detector basis by making use of all information available, conceivably including both :math:`h(t)` and auxiliary information: :math:`a(t)`.

.. math::
    p_g(t) = p(g|h(t), a(t))

and we specifically consider the *glitch-model* as the union of all possible glitch families without explicitly enumerating each family (:math:`g = \oplus_i g_i`).
From this, we may categorize the type of glitch present through conditional probabilities

.. math::
    p_gi(t) = p(g_i|g; h(t), a(t)) p(g|h(t), a(t))

but we stress that the fundamental data product of the *iDQ* analysis is :math:`p_g` and not :math:`p_{g_i}`.
Furthermore, we define a *clean-model* which is simply the complement of the *glitch-model* such that

.. math::
    p(g|h, a) + p(c|h, a) = 1\ \forall\ t

This then defines our 2-class classification problem: separating samples into the glitch vs. clean models.

.. _formalism-decomposition:

Decomposition of the statistical inference
----------------------------------------------------------------------------------------------------

Fundamentally, supervised learning for 2-class classification attempts to compute the likelihoods rather than the posteriors.

.. math::
    p(h, a|g) & = \frac{p(g|h, a) p(h, a)}{p(g)} \\ 
    p(h, a|c) & = \frac{p(c|h, a) p(h, a)}{p(c)}

where we note that

.. math::
    p(h, a) = p(h, a|g)p(g) + p(h, a|c)p(c)

by definition.
Pragmatically, supervised learning algorithms compute a rank :math:`r = r(h, a)` that encapsulates their belief that the data is more similar to one class than another.
We then calibrate these ranks into meaningful probability estimates through composition. Explicitly

.. math::
    p_g = p(g|h, a) = \frac{p(r(h, a)|g)p(g)}{(p(r(h, a)|g)p(g) + p(r(h, a)|c)p(c))}

Each classifier determines a separate mapping :math:`h, a \rightarrow r`.
Therefore, different classifiers can produce a different estimate for :math:`p_g`.

.. _formalism-model-definitions:

Definitions of glitch and clean models
----------------------------------------------------------------------------------------------------

Our basic notion of what defines a glitch is the non-Gaussian nature of the noise artifact.
For this reason, we base our definitions of when a glitch is present on the signal-to-noise ratio (:math:`\rho`) in :math:`h(t)`, which is a measure of how likely the observed data could be generated by Gaussian noise alone.
Historically, the definition of a glitch has been based on a simple threshold on :math:`\rho`.
We continue this practice, but also allow users to downselect glitches based on other features, such as the ferquency or duration (see *target_bounds* in :ref:`configuration`).
Users can then add further specificity to these models by dividing the *glitch* samples into different frequency bands.
While the formal expression for this will be similar to what we have shown for different glitch families, the definition of frequency bands will not depend on the actual noise background experienced (unlike the definition of glitch families).
Furthermore, this naturally mirrors several filtering procedures already used in searches (e.g.: low- and high-frequency burst searches, multi-rate filtering within ``gstlal``).
In the band-limited *glitch* framework, we will define separate *clean* models for each *glitch* frequency band so that we can compute the associated probabilities for *glitches* within that band alone.
This does mean that the *clean* model for a given bandwidth will overlap with the *glitch* models in other bandwidths.

Furthermore, becaues physical noise sources which generate glitches very near our selection thresholds may fall in either class within our training set based on random realizations of Gaussian noise, we also define a buffer around our thresholds (see `dirty_bounds` and `dirty_window` in :ref:`configuration`).
This, in effect, requries clean samples to be "far away" from glitchy samples within our training sets.
We do note that this means it is possible for for sample sto be neither clean nor glitchy (i.e.: within `dirty_bounds` but outside of `target_bounds`), thereby breaking our assumption :math:`p(g|h,a) + p(c|h,a) = 1\ \forall\ t`.
However, this is thought to be a small fraction of the total number of samples and to not significantly affect our results for typical configurations.

Another possible way to avoid this is to use a probabilistic estimate for whether an event is likely to be a glitch according to

.. math::
    p(h|g) = 1 - p(c) = 1 - N e^{-\rho^2/2}

or something similar.
With this in hand, we can train machine learning algorithms with weighed samples, or use regression techniques instead of classification.