Configuration¶
The configuration file (in TOML format) is essential to running batch or streaming workflows within the iDQ framework. It defines which classifiers to use, where to find data and segment information, as well as specifying how data is transferred between the various tasks, i.e. training and evaluation jobs.
An example configuration file is provided in etc/config.toml
.
Analysis Settings¶
[general]¶
These are general parameters specifying high level information like where results are stored and which classifiers to run over. These will be used by the various tasks when running various workflows.
Here is such an example:
[general]
tag = "test"
instrument = "L1"
rootdir = "/path/to/analysis"
classifiers = ["ovl", "forest", "svm"]
The tag
and the rootdir
define the analysis name and where it will run.
The instrument
is used for generating timeseries containing various data
products. Finally, the classifiers here specify which of the defined
classifiers in the configuration we choose to run over. Here, we want to run an
analysis over the ovl
, forest
and svm
classifiers.
The following keyword arguments are required:
tag
instrument
rootdir
classifiers
[samples]¶
These parameters define everything needed for defining glitch and clean samples, such as the channel used to identify glitches, and various thresholds to distinguish between the two classes.
Here is such an example looking at Kleine-Welle features:
[samples]
target_channel = "L1_CAL-DELTAL_EXTERNAL_DQ_32_2048"
dirty_window = 0.1
[samples.target_bounds]
significance = ["35.0", "inf"]
frequency = [16, 2048]
[samples.dirty_bounds]
significance = ["25.0", "inf"]
frequency = [16, 2048]
The target_channel
defines which channel to look at when determining
whether a sample is a glitch or not. The target_bounds
defines min, max
values for various features within the target channel used to downselect
targets. In this case, we only consider a sample to be a glitch if its
significance is >= 35, without any restriction on frequency.
The dirty_bounds
and dirty_window
together define how we select clean
samples. First, all samples with significance above 25 are automatically
excluded. In addition, a window of 0.1 seconds is created around each dirty
sample and those times are excluded as well. Any segments that remain are
considered fair game for clean times, and all clean samples are generated from
these clean segments, sampled at a random_rate
specified in the various
jobs, i.e. training.
The following keyword arguments are required:
target_channel
target_bounds
dirty_bounds
dirty_window
In addition, the following optional keyword arguments can be passed in:
random_seed: set a seed to make results reproducible across runs
[features]¶
The [features]
section specifies how we discover features, determine which
columns we want to process as well as specifying what some specific columns
represent (like “snr” for determining the significance of the trigger).
Example:
[features]
flavor = "kw"
rootdir = "/path/to/triggers/"
columns = ['time', 'significance', 'frequency']
time = "time"
significance = "significance"
In the example above, we configure the kw
flavor which searches for
Kleine-Welle triggers. If trying to discover triggers in a non-standard
location (say for loading your own custom triggers), you’ll need to supply the
rootdir
kwarg.
The following keyword arguments are required:
flavor: a flavor of
DataLoader
, used to ingest featurescolumns: determine which columns to process
time: determine which column to use for determining target/clean times
significance: determine which column to use to determine significance
In addition, the following optional keyword arguments can be passed in:
nproc: how many cores to use when reading in data
In addition to these generic keyword arguments, different feature backends may have additional required and optional keyword arguments.
[segments]¶
This section sets up queries to DQSegDB and defines what segments to analyze.
Example:
[segments]
segdb_url = "https://segments.ligo.org"
intersect = "H1:DMT-ANALYSIS_READY:1"
The following keyword arguments are required:
segdb_url
intersect: select which segments we want to analyze
In addition, the following optional keyword arguments can be passed in:
exclude: select which segments we want to exclude
[condor]¶
These parameters specify various configuration options when submitting jobs
under the condor
workflow. All these keyword arguments are optional if
you’re using the block
or fork
workflows.
Example configuration:
[condor]
universe = "local"
accounting_group = "not.real"
accounting_group_user = "albert.einstein"
retry = 0
Workflow Settings¶
The four core parts of the analysis; training, evaluation, calibration, and timeseries generation, share many similarities in how they are configured. They constitute of three distinct parts; a general section, a reporting section that configures how data products are saved, and an optional stream section which contains stream-specific settings such as timeouts and processing cadences.
In addition to the four core processes, there are also reporting and monitoring jobs which each only need a single section to specify configuration variables.
Here is an example of a training configuration:
[train]
workflow = "block"
random_rate = 0.01
ignore_segdb = false
[train.reporting]
flavor = "pickle"
[train.stream]
stride = 5
delay = 60
The general section, [train]
, specifies the parallelization scheme,
workflow
, in which to train. In addition, it specifies a random_rate
in
which to generate clean samples from the clean segments defined from the
[samples]
section. Finally, there’s a special option here, ignore_segdb
which is used to optionally ignore the segments specified in the [segments]
section. This can be useful, for example, in generating timeseries where we may
want to generate timeseries for all times rather than restrict them to only
look at science-mode data.
[train.reporting]
is used to configure how models are persisted. In this
case, we specify the flavor
of the model reporter as pickle
to
serialize models using the pickle format.
Finally, there’s a [train.stream]
section which is required to run
stream-based workflows. In this case, we process incoming features in 5 second
strides and we allow incoming features to lag behind realtime up to 60 seconds.
Training¶
The training configuration consists of general workflow configuration in
[train]
, a [train.reporting]
section to configure the model reporter,
and optionally, a [train.stream]
section to specify stream-specific
parameters.
[train]¶
[train]
workflow = "block"
random_rate = 0.01
The following keyword arguments are required:
workflow: one of
block
,fork
,condor
random_rate: rate at which to sample clean features
In addition, the following optional keyword arguments can be passed in:
ignore_segdb: whether to ignore querying DQSegDB for segment information
[train.reporting]¶
[train.reporting]
flavor = "pickle"
The following keyword arguments are required:
flavor: a flavor of
Reporter
, used to load/save models
In addition, whatever keyword arguments are required by the specific Reporter
flavor.
[train.stream]¶
[train.stream]
stride = 5
delay = 60
No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments:
stride:: the length of time to process at a given time (in seconds)
delay:: the delay from real time to process data (in seconds)
Evaluation¶
Like the training configuration, the evaluation configuration takes basically
the same set of configuration options. The main difference is that the
flavor
of Reporter
used in [evaluate reporting]
needs to be a type
of quiver reporter.
[evaluate]¶
[evaluate]
workflow = "block"
log_level = 10
random_rate = 0.01
The following keyword arguments are required:
workflow: one of
block
,fork
,condor
log_level: specifies the verbosity of log messages
random_rate: rate at which to sample clean features
[evaluate.reporting]¶
[evaluate.reporting]
flavor = "dataset"
The following keyword arguments are required:
flavor: a flavor of
Reporter
, used to load/save quivers
In addition, whatever keyword arguments are required by the specific Reporter
flavor.
[evaluate.stream]¶
[evaluate.stream]
stride = 5
delay = 60
No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments:
stride:: the length of time to process at a given time (in seconds)
delay:: the delay from real time to process data (in seconds)
Calibration¶
The calibration jobs don’t need a mechanism in which to discover triggers; instead, the datasets generated from the evaluation jobs are used to calibrate the models produced by the training jobs.
[calibrate]¶
[calibrate]
workflow = "block"
log_level = 10
The following keyword arguments are required:
workflow: one of
block
,fork
,condor
log_level: specifies the verbosity of log messages
[calibrate.reporting]¶
[calibrate.reporting]
flavor = "calib"
The following keyword arguments are required:
flavor: a flavor of
Reporter
, used to load/save calibration maps
In addition, whatever keyword arguments are required by the specific Reporter
flavor.
Note
The [calibrate.stream]
section is not present because in the online workflow, the
calibration and evaluation jobs are tightly coupled. Therefore, the calibration jobs
use the configuration from [evaluate.stream]
.
Timeseries¶
The timeseries jobs are similar in configuration to the training and evaluation
jobs. One of the main differences is instead of passing in a random_rate
to
control the rate of generating clean samples, we pass in an srate
which
determines the sampling rate of produced timeseries.
[timeseries]¶
[timeseries]
workflow = "block"
log_level = 10
srate = 128
The following keyword arguments are required:
workflow: one of
block
,fork
,condor
log_level: specifies the verbosity of log messages
srate: the sample rate at which timeseries are produced
[timeseries.reporting]¶
[timeseries reporting]
flavor = "series:gwf"
The following keyword arguments are required:
flavor: a flavor of
Reporter
, used to load/save timeseries
In addition, whatever keyword arguments are required by the specific Reporter
flavor.
The following keyword arguments are optional:
shmdir: a second directory to save timeseries which stores the last N seconds of timeseries data, possibly to be picked up by another process.
[timeseries.stream]¶
[timeseries.stream]
stride = 5
delay = 60
No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments:
stride:: the length of time to process at a given time (in seconds)
delay:: the delay from real time to process data (in seconds)
Report¶
The reporting jobs generate summary pages and plots to give an overview of batch and stream jobs that were run.
Example:
[report]
legend = true
overview_only = truew
All keyword arguments are optional:
ignore_segdb: whether to skip DQSegDB queries for determining segments to process
legend: whether to display a legend in plots
annotate_gch: whether to annotate glitch samples
single_calib_plots: whether to display individual calibration plots for each bin. if not set, only display an overview across all bins.
overview_only: whether to only show the overview page rather than also show pages for each individual classifier
Monitor¶
WRITEME
Classifier Settings¶
Classifiers are special in their configuration in that each classifier has its
own section with a custom nickname. These classifiers are then selected by name
in the [general]
section for a given analysis. You can have more classifier
sections than classifiers you run over for an analysis.
Here’s an example of how this works:
[general]
classifiers = ["forest"]
...
[[classifier]]
name = "ovl"
flavor = "ovl"
incremental = 100
num_recalculate = 10
metric = "eff_fap"
[classifier.minima]
eff_fap = 3
poisson_signif = 5
use_percentage = 1e-3
[[classifier]]
name = "forest"
flavor = "sklearn:random_forest"
verbose = true
# feature vector options
default = -100
window = 0.1
whitener = "standard"
# parallelization options
num_cv_proc = 1
# hyperparameters
[classifier.params]
classifier__criterion = "gini"
classifier__n_estimators = 200
classifier__class_weight = "balanced"
Here, we configure two different classifiers, ovl
and forest
with
various configuration settings. In the [general]
section, only one
classifier
is called, forest
, which indicates that when we run
analyzes, we will only train, evaluate, etc. on the forest
classifier.
Default Settings¶
Instead of specifying some options across multiple sections, they can be inherited by putting them in the [defaults]
section so that they don’t need to be repeated multiple times.
The following can be passed in here:
workflow: one of
block
,fork
,condor
ignore_segdb: whether to skip DQSegDB queries for determining segments to process
log_level: the verbosity of logs
safe_channels_path: the auxiliary channels in which to determine features for training, evaluation, etc. This is distinct from the target channel in that only labels are generated from the target channel, but these channels are used to generate features. They should only contain safe channels.