Configuration

The configuration file (in TOML format) is essential to running batch or streaming workflows within the iDQ framework. It defines which classifiers to use, where to find data and segment information, as well as specifying how data is transferred between the various tasks, i.e. training and evaluation jobs.

An example configuration file is provided in etc/config.toml.

Analysis Settings

[general]

These are general parameters specifying high level information like where results are stored and which classifiers to run over. These will be used by the various tasks when running various workflows.

Here is such an example:

[general]
tag = "test"
instrument = "L1"
rootdir = "/path/to/analysis"

classifiers = ["ovl", "forest", "svm"]

The tag and the rootdir define the analysis name and where it will run. The instrument is used for generating timeseries containing various data products. Finally, the classifiers here specify which of the defined classifiers in the configuration we choose to run over. Here, we want to run an analysis over the ovl, forest and svm classifiers.

The following keyword arguments are required:

  • tag

  • instrument

  • rootdir

  • classifiers

[samples]

These parameters define everything needed for defining glitch and clean samples, such as the channel used to identify glitches, and various thresholds to distinguish between the two classes.

Here is such an example looking at Kleine-Welle features:

[samples]
target_channel = "L1_CAL-DELTAL_EXTERNAL_DQ_32_2048"
dirty_window = 0.1

[samples.target_bounds]
significance = ["35.0", "inf"]
frequency = [16, 2048]

[samples.dirty_bounds]
significance = ["25.0", "inf"]
frequency = [16, 2048]

The target_channel defines which channel to look at when determining whether a sample is a glitch or not. The target_bounds defines min, max values for various features within the target channel used to downselect targets. In this case, we only consider a sample to be a glitch if its significance is >= 35, without any restriction on frequency.

The dirty_bounds and dirty_window together define how we select clean samples. First, all samples with significance above 25 are automatically excluded. In addition, a window of 0.1 seconds is created around each dirty sample and those times are excluded as well. Any segments that remain are considered fair game for clean times, and all clean samples are generated from these clean segments, sampled at a random_rate specified in the various jobs, i.e. training.

The following keyword arguments are required:

  • target_channel

  • target_bounds

  • dirty_bounds

  • dirty_window

In addition, the following optional keyword arguments can be passed in:

  • random_seed: set a seed to make results reproducible across runs

[features]

The [features] section specifies how we discover features, determine which columns we want to process as well as specifying what some specific columns represent (like “snr” for determining the significance of the trigger).

Example:

[features]
flavor = "kw"
rootdir = "/path/to/triggers/"

columns = ['time', 'significance', 'frequency']
time = "time"
significance = "significance"

In the example above, we configure the kw flavor which searches for Kleine-Welle triggers. If trying to discover triggers in a non-standard location (say for loading your own custom triggers), you’ll need to supply the rootdir kwarg.

The following keyword arguments are required:

  • flavor: a flavor of DataLoader, used to ingest features

  • columns: determine which columns to process

  • time: determine which column to use for determining target/clean times

  • significance: determine which column to use to determine significance

In addition, the following optional keyword arguments can be passed in:

  • nproc: how many cores to use when reading in data

In addition to these generic keyword arguments, different feature backends may have additional required and optional keyword arguments.

[segments]

This section sets up queries to DQSegDB and defines what segments to analyze.

Example:

[segments]
segdb_url = "https://segments.ligo.org"

intersect = "H1:DMT-ANALYSIS_READY:1"

The following keyword arguments are required:

  • segdb_url

  • intersect: select which segments we want to analyze

In addition, the following optional keyword arguments can be passed in:

  • exclude: select which segments we want to exclude

[condor]

These parameters specify various configuration options when submitting jobs under the condor workflow. All these keyword arguments are optional if you’re using the block or fork workflows.

Example configuration:

[condor]
universe = "local"

accounting_group = "not.real"
accounting_group_user = "albert.einstein"

retry = 0

Workflow Settings

The four core parts of the analysis; training, evaluation, calibration, and timeseries generation, share many similarities in how they are configured. They constitute of three distinct parts; a general section, a reporting section that configures how data products are saved, and an optional stream section which contains stream-specific settings such as timeouts and processing cadences.

In addition to the four core processes, there are also reporting and monitoring jobs which each only need a single section to specify configuration variables.

Here is an example of a training configuration:

[train]
workflow = "block"

random_rate = 0.01

ignore_segdb = false

[train.reporting]
flavor = "pickle"

[train.stream]
stride = 5
delay = 60

The general section, [train], specifies the parallelization scheme, workflow, in which to train. In addition, it specifies a random_rate in which to generate clean samples from the clean segments defined from the [samples] section. Finally, there’s a special option here, ignore_segdb which is used to optionally ignore the segments specified in the [segments] section. This can be useful, for example, in generating timeseries where we may want to generate timeseries for all times rather than restrict them to only look at science-mode data.

[train.reporting] is used to configure how models are persisted. In this case, we specify the flavor of the model reporter as pickle to serialize models using the pickle format.

Finally, there’s a [train.stream] section which is required to run stream-based workflows. In this case, we process incoming features in 5 second strides and we allow incoming features to lag behind realtime up to 60 seconds.

Training

The training configuration consists of general workflow configuration in [train], a [train.reporting] section to configure the model reporter, and optionally, a [train.stream] section to specify stream-specific parameters.

[train]

[train]
workflow = "block"

random_rate = 0.01

The following keyword arguments are required:

  • workflow: one of block, fork, condor

  • random_rate: rate at which to sample clean features

In addition, the following optional keyword arguments can be passed in:

  • ignore_segdb: whether to ignore querying DQSegDB for segment information

[train.reporting]

[train.reporting]
flavor = "pickle"

The following keyword arguments are required:

  • flavor: a flavor of Reporter, used to load/save models

In addition, whatever keyword arguments are required by the specific Reporter flavor.

[train.stream]

[train.stream]
stride = 5
delay = 60

No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments:

  • stride:: the length of time to process at a given time (in seconds)

  • delay:: the delay from real time to process data (in seconds)

Evaluation

Like the training configuration, the evaluation configuration takes basically the same set of configuration options. The main difference is that the flavor of Reporter used in [evaluate reporting] needs to be a type of quiver reporter.

[evaluate]

[evaluate]
workflow = "block"
log_level = 10

random_rate = 0.01

The following keyword arguments are required:

  • workflow: one of block, fork, condor

  • log_level: specifies the verbosity of log messages

  • random_rate: rate at which to sample clean features

[evaluate.reporting]

[evaluate.reporting]
flavor = "dataset"

The following keyword arguments are required:

  • flavor: a flavor of Reporter, used to load/save quivers

In addition, whatever keyword arguments are required by the specific Reporter flavor.

[evaluate.stream]

[evaluate.stream]
stride = 5
delay = 60

No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments:

  • stride:: the length of time to process at a given time (in seconds)

  • delay:: the delay from real time to process data (in seconds)

Calibration

The calibration jobs don’t need a mechanism in which to discover triggers; instead, the datasets generated from the evaluation jobs are used to calibrate the models produced by the training jobs.

[calibrate]

[calibrate]
workflow = "block"
log_level = 10

The following keyword arguments are required:

  • workflow: one of block, fork, condor

  • log_level: specifies the verbosity of log messages

[calibrate.reporting]

[calibrate.reporting]
flavor = "calib"

The following keyword arguments are required:

  • flavor: a flavor of Reporter, used to load/save calibration maps

In addition, whatever keyword arguments are required by the specific Reporter flavor.

Note

The [calibrate.stream] section is not present because in the online workflow, the calibration and evaluation jobs are tightly coupled. Therefore, the calibration jobs use the configuration from [evaluate.stream].

Timeseries

The timeseries jobs are similar in configuration to the training and evaluation jobs. One of the main differences is instead of passing in a random_rate to control the rate of generating clean samples, we pass in an srate which determines the sampling rate of produced timeseries.

[timeseries]

[timeseries]
workflow = "block"
log_level = 10

srate = 128

The following keyword arguments are required:

  • workflow: one of block, fork, condor

  • log_level: specifies the verbosity of log messages

  • srate: the sample rate at which timeseries are produced

[timeseries.reporting]

[timeseries reporting]
flavor = "series:gwf"

The following keyword arguments are required:

  • flavor: a flavor of Reporter, used to load/save timeseries

In addition, whatever keyword arguments are required by the specific Reporter flavor.

The following keyword arguments are optional:

  • shmdir: a second directory to save timeseries which stores the last N seconds of timeseries data, possibly to be picked up by another process.

[timeseries.stream]

[timeseries.stream]
stride = 5
delay = 60

No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments:

  • stride:: the length of time to process at a given time (in seconds)

  • delay:: the delay from real time to process data (in seconds)

Report

The reporting jobs generate summary pages and plots to give an overview of batch and stream jobs that were run.

Example:

[report]

legend = true
overview_only = truew

All keyword arguments are optional:

  • ignore_segdb: whether to skip DQSegDB queries for determining segments to process

  • legend: whether to display a legend in plots

  • annotate_gch: whether to annotate glitch samples

  • single_calib_plots: whether to display individual calibration plots for each bin. if not set, only display an overview across all bins.

  • overview_only: whether to only show the overview page rather than also show pages for each individual classifier

Monitor

WRITEME

Classifier Settings

Classifiers are special in their configuration in that each classifier has its own section with a custom nickname. These classifiers are then selected by name in the [general] section for a given analysis. You can have more classifier sections than classifiers you run over for an analysis.

Here’s an example of how this works:

[general]
classifiers = ["forest"]

...

[[classifier]]
name = "ovl"
flavor = "ovl"

incremental = 100
num_recalculate = 10
metric = "eff_fap"

[classifier.minima]
eff_fap = 3
poisson_signif = 5
use_percentage = 1e-3

[[classifier]]
name = "forest"
flavor = "sklearn:random_forest"

verbose = true

# feature vector options
default = -100
window = 0.1
whitener = "standard"

# parallelization options
num_cv_proc = 1

# hyperparameters
[classifier.params]
classifier__criterion = "gini"
classifier__n_estimators = 200
classifier__class_weight = "balanced"

Here, we configure two different classifiers, ovl and forest with various configuration settings. In the [general] section, only one classifier is called, forest, which indicates that when we run analyzes, we will only train, evaluate, etc. on the forest classifier.

Default Settings

Instead of specifying some options across multiple sections, they can be inherited by putting them in the [defaults] section so that they don’t need to be repeated multiple times.

The following can be passed in here:

  • workflow: one of block, fork, condor

  • ignore_segdb: whether to skip DQSegDB queries for determining segments to process

  • log_level: the verbosity of logs

  • safe_channels_path: the auxiliary channels in which to determine features for training, evaluation, etc. This is distinct from the target channel in that only labels are generated from the target channel, but these channels are used to generate features. They should only contain safe channels.