Abstractions¶
There are several key abstractions implemented within iDQ. These are common tasks that may be performed in several different ways (i.e.: with different backends). This prompts some measure of standardization, and we declare several classes to address each task and canonize the associated iDQ API. In this way, users and code can interchange different backends easily without modifying the rest of their code.
Data Discovery¶
iDQ performs statistical inference based on a stream of auxiliary data.
However, it does not ingest the data directly from the detector.
Instead, the timeseries are first processed in some way, like an Event Trigger Generator
(ETG
).
Because different ETGs
may have different advantages, users may want to read from multiple data sources.
We therefore introduce a class that abstracts the process of retrieving data from one of many possible sources
This parent class declares the standard API that must be supported by all its children. Each extension of the parent class is specifically designed to read data from a different type of source. A full list of what sources are currently supported can be determined by enumerating the subclasses in :module:`idq.io.triggers`. This includes a synthetic data source, which is quite useful for testing purposes and is described in detail in Working with MockClassifierData objects.
Vectorization and Sampling¶
Assuming data can be read in from some source via idq.io.triggers.DataLoader
, we still need to represent it as vectorized input suitable for supervised machine learning.
This is often called feature extraction.
For each sample, we abstract the process of extracting specific features from the data streams in the following class:
Because we will almost always deal with sets of samples, rather than individual samples, we also represent a group of samples as a
idq.features.Dataset
Vectorization can be thought of as abstracting the process of “building a big matrix” to feed into supervised learning algorithms.
Although developers may be interested in defining new features, they are likely not interested in implementing those features separately for each auxiliary channel.
idq.features.Dataset
handles the actual production for each channel as part of its vectorization so that developers only have to define the procedure once and it can automatically be applied to triggers in any channel.
Furthermore, vectorization often involves the selection of features from one of multiple coincident triggers in each auxiliary channel and transforming the selected features in some way.
This is standardized through a idq.features.Selector
which contains the mechanisms to downselect sets of features via a idq.features.Downselect
as well as transforming various features through idq.features.ColumnTransformer
.
Specific downselects and transformations should be implemented as extensions of these classes. We currently support two simple approaches:
idq.features.DownselectLoudest
, which selects the loudest auxiliary trigger within some window when extracting features.idq.features.DeltaTimeTransformer
, which maps the absolute time to a relative (delta) time.
Learning¶
iDQ is a statistical inference pipeline; it therefore needs a way to perform statistical inference. This can be accomplished by many different algorithms and we standardize their iDQ API by declaring
Note, idq.classifiers.IncrementalSupervisedClassifier
is a subclass of idq.classifiers.SupervisedClassifier
that supports a slightly different notion of training (see Classifiers for more details).
These classes allow users to develop or update statistical models using the same commands for different backends. They are the “workhorse objects” within iDQ and are necessary in almost every situation. Specifically, they support the iDQ API for the tasks within the pipeline:
Calibration, and
LIST ALL SUBCLASSES HERE VIA AUTODOC
Models¶
Each classifier develops an internal model within its training process.
As we will need to track these models throughout the pipeline, we provide a standard object: idq.classifiers.ClassifierModel
.
This object tracks which data was used to generate the model so the pipeline can manage data provenance uniformly and transparently.
Each classifier will likely extend this class just as each classifier will extend idq.classifiers.SupervisedClassifier.
LIST ALL SUBCLASSES HERE VIA AUTODOC
Calibration¶
Our classification scheme requires classifiers to report results as a single floating point number for each idq.features.FeatureVector
(\(0 \leq r \leq 1\)), with \(r\) (called the rank) closer to 1 indicating higher degrees of belief that sample corresponds to a glitch.
However, \(r\) is not intrinsically meaningful; ranks from different classifiers are not directly comparable, nor are they particularly useful outside of a strict ordinal ranking.
Instead, we are interested in the conditioned probability distribution of each class over \(r\) from which we can calculate things like the detection efficiency and false alarm probability.
For more details on the structure of this calculation, see our formalism for idq.calibration.
The process of modeling distributions and estimating integrals thereof from sets of samples is common to all classifiers, and is therefore handled in a unified way with
idq.calibration.CalibrationMap
This object also manages provenance and tracks which data was used to generate the calibration, similar to idq.classifiers.ClassifierModel
.
Reporting Results¶
iDQ’s jobs generate a variety of data products, whether they be a idq.classifiers.ClassifierModel
or a idq.calibration.CalibrationMap
.
We need to record these in a standard way so that asynchronous processes can reference them easily.
In particular, we want a simple way to look up the preferred
product at any given time.
This is done with
What’s more, we need a standard iDQ API that can handle I/O for various different backends,
such as writing to a filesystem (idq.io.reporters.DiskReporter
).
Subclasses are defined for specific backends and even for specific data products.
Cadence and job management¶
iDQ manages parallel processes in a variety of circumstances in both the batch and streaming modes (see Batch vs Stream modes). To standardize this, we introduce a few objects that manage waiting and time-out logic in a standard way.
Users can control the rate at which analyses are performed by configuring these objects through the corresponding INI sections (see Configuration).