.. _configuration:

Configuration
####################################################################################################

The configuration file (in TOML format) is essential to running batch or
streaming workflows within the iDQ framework. It defines which classifiers to
use, where to find data and segment information, as well as specifying how data
is transferred between the various tasks, i.e. training and evaluation jobs.

An example configuration file is provided in ``etc/config.toml``.

.. _configuration-analysis:

Analysis Settings
====================================================================================================

.. _configuration-general:

[general]
----------------------------------------------------------------------------------------------------

These are general parameters specifying high level information like where
results are stored and which classifiers to run over. These will be used by the
various tasks when running various workflows.

Here is such an example:

.. code:: bash

    [general]
    tag = "test"
    instrument = "L1"
    rootdir = "/path/to/analysis"

    classifiers = ["ovl", "forest", "svm"]

The ``tag`` and the ``rootdir`` define the analysis name and where it will run.
The ``instrument`` is used for generating timeseries containing various data
products. Finally, the classifiers here specify which of the defined
classifiers in the configuration we choose to run over. Here, we want to run an
analysis over the ``ovl``, ``forest`` and ``svm`` classifiers.

The following keyword arguments are required:

* **tag**
* **instrument**
* **rootdir**
* **classifiers**

.. _configuration-samples:

[samples]
----------------------------------------------------------------------------------------------------

These parameters define everything needed for defining glitch and clean
samples, such as the channel used to identify glitches, and various thresholds
to distinguish between the two classes.

Here is such an example looking at Kleine-Welle features:

.. code:: bash

    [samples]
    target_channel = "L1_CAL-DELTAL_EXTERNAL_DQ_32_2048"
    dirty_window = 0.1

    [samples.target_bounds]
    significance = ["35.0", "inf"]
    frequency = [16, 2048]

    [samples.dirty_bounds]
    significance = ["25.0", "inf"]
    frequency = [16, 2048]


The ``target_channel`` defines which channel to look at when determining
whether a sample is a glitch or not. The ``target_bounds`` defines min, max
values for various features within the target channel used to downselect
targets. In this case, we only consider a sample to be a glitch if its
significance is >= 35, without any restriction on frequency.

The ``dirty_bounds`` and ``dirty_window`` together define how we select clean
samples. First, all samples with significance above 25 are automatically
excluded. In addition, a window of 0.1 seconds is created around each dirty
sample and those times are excluded as well. Any segments that remain are
considered fair game for clean times, and all clean samples are generated from
these clean segments, sampled at a ``random_rate`` specified in the various
jobs, i.e. training.

The following keyword arguments are required:

* **target_channel**
* **target_bounds**
* **dirty_bounds**
* **dirty_window**

In addition, the following optional keyword arguments can be passed in:

* **random_seed**: set a seed to make results reproducible across runs

.. _configuration-features:

[features]
----------------------------------------------------------------------------------------------------

The ``[features]`` section specifies how we discover features, determine which
columns we want to process as well as specifying what some specific columns
represent (like "snr" for determining the significance of the trigger).

Example:

.. code:: bash

    [features]
    flavor = "kw"
    rootdir = "/path/to/triggers/"

    columns = ['time', 'significance', 'frequency']
    time = "time"
    significance = "significance"

In the example above, we configure the ``kw`` flavor which searches for
Kleine-Welle triggers. If trying to discover triggers in a non-standard
location (say for loading your own custom triggers), you'll need to supply the
``rootdir`` kwarg.

The following keyword arguments are required:

* **flavor:**  a flavor of ``DataLoader``, used to ingest features 
* **columns:** determine which columns to process
* **time:** determine which column to use for determining target/clean times
* **significance:** determine which column to use to determine significance

In addition, the following optional keyword arguments can be passed in:

* **nproc:** how many cores to use when reading in data

In addition to these generic keyword arguments, different feature backends may
have additional required and optional keyword arguments.

.. _configuration-segments:

[segments]
----------------------------------------------------------------------------------------------------

This section sets up queries to DQSegDB and defines what segments to analyze.

Example:

.. code:: bash

    [segments]
    segdb_url = "https://segments.ligo.org"

    intersect = "H1:DMT-ANALYSIS_READY:1"

The following keyword arguments are required:

* **segdb_url**
* **intersect:** select which segments we want to analyze

In addition, the following optional keyword arguments can be passed in:

* **exclude:** select which segments we want to exclude

.. _configuration-condor:

[condor]
----------------------------------------------------------------------------------------------------

These parameters specify various configuration options when submitting jobs
under the ``condor`` workflow.  All these keyword arguments are optional if
you're using the ``block`` or ``fork`` workflows.

Example configuration:

.. code:: bash

    [condor]
    universe = "local"

    accounting_group = "not.real"
    accounting_group_user = "albert.einstein"

    retry = 0

.. _configuration-workflow:

Workflow Settings
====================================================================================================

The four core parts of the analysis; training, evaluation, calibration, and
timeseries generation, share many similarities in how they are configured. They
constitute of three distinct parts; a general section, a reporting section that
configures how data products are saved, and an optional stream section which
contains stream-specific settings such as timeouts and processing cadences.

In addition to the four core processes, there are also reporting and monitoring
jobs which each only need a single section to specify configuration variables.

Here is an example of a training configuration:

.. code:: bash

    [train]
    workflow = "block"

    random_rate = 0.01

    ignore_segdb = false

    [train.reporting]
    flavor = "pickle"

    [train.stream]
    stride = 5
    delay = 60

The general section, ``[train]``, specifies the parallelization scheme,
``workflow``, in which to train. In addition, it specifies a ``random_rate`` in
which to generate clean samples from the clean segments defined from the
``[samples]`` section. Finally, there's a special option here, ``ignore_segdb``
which is used to optionally ignore the segments specified in the ``[segments]``
section. This can be useful, for example, in generating timeseries where we may
want to generate timeseries for all times rather than restrict them to only
look at science-mode data.

``[train.reporting]`` is used to configure how models are persisted. In this
case, we specify the ``flavor`` of the model reporter as ``pickle`` to
serialize models using the pickle format.

Finally, there's a ``[train.stream]`` section which is required to run
stream-based workflows. In this case, we process incoming features in 5 second
strides and we allow incoming features to lag behind realtime up to 60 seconds.


.. _configuration-train:

Training
----------------------------------------------------------------------------------------------------

The training configuration consists of general workflow configuration in
``[train]``, a ``[train.reporting]`` section to configure the model reporter,
and optionally, a ``[train.stream]`` section to specify stream-specific
parameters.

[train]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [train]
    workflow = "block"

    random_rate = 0.01

The following keyword arguments are required:

* **workflow:** one of ``block``, ``fork``, ``condor``
* **random_rate:** rate at which to sample clean features

In addition, the following optional keyword arguments can be passed in:

* **ignore_segdb:** whether to ignore querying DQSegDB for segment information


[train.reporting]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [train.reporting]
    flavor = "pickle"

The following keyword arguments are required:

* **flavor:** a flavor of ``Reporter``, used to load/save models

In addition, whatever keyword arguments are required by the specific ``Reporter`` flavor.

[train.stream]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [train.stream]
    stride = 5
    delay = 60

No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments:

* **stride:**: the length of time to process at a given time (in seconds)
* **delay:**: the delay from real time to process data (in seconds)

.. _configuration-evaluate:

Evaluation
----------------------------------------------------------------------------------------------------

Like the training configuration, the evaluation configuration takes basically
the same set of configuration options. The main difference is that the
``flavor`` of ``Reporter`` used in ``[evaluate reporting]`` needs to be a type
of quiver reporter.

[evaluate]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [evaluate]
    workflow = "block"
    log_level = 10

    random_rate = 0.01

The following keyword arguments are required:

* **workflow:** one of ``block``, ``fork``, ``condor``
* **log_level:** specifies the verbosity of log messages
* **random_rate:** rate at which to sample clean features

[evaluate.reporting]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [evaluate.reporting]
    flavor = "dataset"

The following keyword arguments are required:

* **flavor:** a flavor of ``Reporter``, used to load/save quivers

In addition, whatever keyword arguments are required by the specific ``Reporter`` flavor.

[evaluate.stream]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [evaluate.stream]
    stride = 5
    delay = 60


No keyword arguments are required, however it is strongly encouraged to set up
the following keyword arguments:

* **stride:**: the length of time to process at a given time (in seconds)
* **delay:**: the delay from real time to process data (in seconds)

.. _configuration-calibrate:

Calibration
----------------------------------------------------------------------------------------------------

The calibration jobs don't need a mechanism in which to discover triggers;
instead, the datasets generated from the evaluation jobs are used to calibrate
the models produced by the training jobs.

[calibrate]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [calibrate]
    workflow = "block"
    log_level = 10

The following keyword arguments are required:

* **workflow:** one of ``block``, ``fork``, ``condor``
* **log_level:** specifies the verbosity of log messages

[calibrate.reporting]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [calibrate.reporting]
    flavor = "calib"

The following keyword arguments are required:

* **flavor:** a flavor of ``Reporter``, used to load/save calibration maps

In addition, whatever keyword arguments are required by the specific ``Reporter`` flavor.

.. note::

    The ``[calibrate.stream]`` section is not present because in the online workflow, the
    calibration and evaluation jobs are tightly coupled. Therefore, the calibration jobs
    use the configuration from ``[evaluate.stream]``.

.. _configuration-timeseries:

Timeseries
----------------------------------------------------------------------------------------------------

The timeseries jobs are similar in configuration to the training and evaluation
jobs. One of the main differences is instead of passing in a ``random_rate`` to
control the rate of generating clean samples, we pass in an ``srate`` which
determines the sampling rate of produced timeseries.

[timeseries]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [timeseries]
    workflow = "block"
    log_level = 10

    srate = 128

The following keyword arguments are required:

* **workflow:** one of ``block``, ``fork``, ``condor``
* **log_level:** specifies the verbosity of log messages
* **srate:** the sample rate at which timeseries are produced

[timeseries.reporting]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [timeseries reporting]
    flavor = "series:gwf"

The following keyword arguments are required:

* **flavor:** a flavor of ``Reporter``, used to load/save timeseries

In addition, whatever keyword arguments are required by the specific ``Reporter`` flavor.

The following keyword arguments are optional:

* **shmdir:** a second directory to save timeseries which stores the last N
  seconds of timeseries data, possibly to be picked up by another process.

[timeseries.stream]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    [timeseries.stream]
    stride = 5
    delay = 60


No keyword arguments are required, however it is strongly encouraged to set up
the following keyword arguments:

* **stride:**: the length of time to process at a given time (in seconds)
* **delay:**: the delay from real time to process data (in seconds)

.. _configuration-report:

Report
----------------------------------------------------------------------------------------------------

The reporting jobs generate summary pages and plots to give an overview of
batch and stream jobs that were run.

Example:

.. code:: bash

    [report]

    legend = true
    overview_only = truew

All keyword arguments are optional:

* **ignore_segdb:** whether to skip DQSegDB queries for determining segments to process
* **legend:** whether to display a legend in plots
* **annotate_gch:** whether to annotate glitch samples
* **single_calib_plots:** whether to display individual calibration plots for each bin. if not set, only display an overview across all bins.
* **overview_only:** whether to only show the overview page rather than also show pages for each individual classifier

.. _configuration-monitor:

Monitor
----------------------------------------------------------------------------------------------------

WRITEME

.. _configuration-classifier:

Classifier Settings
====================================================================================================

Classifiers are special in their configuration in that each classifier has its
own section with a custom nickname. These classifiers are then selected by name
in the ``[general]`` section for a given analysis. You can have more classifier
sections than classifiers you run over for an analysis.

Here's an example of how this works:

.. code:: bash

    [general]
    classifiers = ["forest"]

    ...

    [[classifier]]
    name = "ovl"
    flavor = "ovl"

    incremental = 100
    num_recalculate = 10
    metric = "eff_fap"

    [classifier.minima]
    eff_fap = 3
    poisson_signif = 5
    use_percentage = 1e-3

    [[classifier]]
    name = "forest"
    flavor = "sklearn:random_forest"

    verbose = true

    # feature vector options
    default = -100
    window = 0.1
    whitener = "standard"

    # parallelization options
    num_cv_proc = 1

    # hyperparameters
    [classifier.params]
    classifier__criterion = "gini"
    classifier__n_estimators = 200
    classifier__class_weight = "balanced"

Here, we configure two different classifiers, ``ovl`` and ``forest`` with
various configuration settings. In the ``[general]`` section, only one
``classifier`` is called, ``forest``, which indicates that when we run
analyzes, we will only train, evaluate, etc. on the ``forest`` classifier.

.. _configuration-default:

Default Settings
====================================================================================================

Instead of specifying some options across multiple sections, they can be inherited by putting them in the ``[defaults]`` section so that they don't need to be repeated multiple times. 

The following can be passed in here:

* **workflow:** one of ``block``, ``fork``, ``condor``
* **ignore_segdb:** whether to skip DQSegDB queries for determining segments to process
* **log_level:** the verbosity of logs
* **safe_channels_path:** the auxiliary channels in which to determine features for training, evaluation, etc. This is distinct from the target channel in that only labels are generated from the target channel, but these channels are used to generate features. They should only contain safe channels.