.. _configuration: Configuration #################################################################################################### The configuration file (in TOML format) is essential to running batch or streaming workflows within the iDQ framework. It defines which classifiers to use, where to find data and segment information, as well as specifying how data is transferred between the various tasks, i.e. training and evaluation jobs. An example configuration file is provided in ``etc/config.toml``. .. _configuration-analysis: Analysis Settings ==================================================================================================== .. _configuration-general: [general] ---------------------------------------------------------------------------------------------------- These are general parameters specifying high level information like where results are stored and which classifiers to run over. These will be used by the various tasks when running various workflows. Here is such an example: .. code:: bash [general] tag = "test" instrument = "L1" rootdir = "/path/to/analysis" classifiers = ["ovl", "forest", "svm"] The ``tag`` and the ``rootdir`` define the analysis name and where it will run. The ``instrument`` is used for generating timeseries containing various data products. Finally, the classifiers here specify which of the defined classifiers in the configuration we choose to run over. Here, we want to run an analysis over the ``ovl``, ``forest`` and ``svm`` classifiers. The following keyword arguments are required: * **tag** * **instrument** * **rootdir** * **classifiers** .. _configuration-samples: [samples] ---------------------------------------------------------------------------------------------------- These parameters define everything needed for defining glitch and clean samples, such as the channel used to identify glitches, and various thresholds to distinguish between the two classes. Here is such an example looking at Kleine-Welle features: .. code:: bash [samples] target_channel = "L1_CAL-DELTAL_EXTERNAL_DQ_32_2048" dirty_window = 0.1 [samples.target_bounds] significance = ["35.0", "inf"] frequency = [16, 2048] [samples.dirty_bounds] significance = ["25.0", "inf"] frequency = [16, 2048] The ``target_channel`` defines which channel to look at when determining whether a sample is a glitch or not. The ``target_bounds`` defines min, max values for various features within the target channel used to downselect targets. In this case, we only consider a sample to be a glitch if its significance is >= 35, without any restriction on frequency. The ``dirty_bounds`` and ``dirty_window`` together define how we select clean samples. First, all samples with significance above 25 are automatically excluded. In addition, a window of 0.1 seconds is created around each dirty sample and those times are excluded as well. Any segments that remain are considered fair game for clean times, and all clean samples are generated from these clean segments, sampled at a ``random_rate`` specified in the various jobs, i.e. training. The following keyword arguments are required: * **target_channel** * **target_bounds** * **dirty_bounds** * **dirty_window** In addition, the following optional keyword arguments can be passed in: * **random_seed**: set a seed to make results reproducible across runs .. _configuration-features: [features] ---------------------------------------------------------------------------------------------------- The ``[features]`` section specifies how we discover features, determine which columns we want to process as well as specifying what some specific columns represent (like "snr" for determining the significance of the trigger). Example: .. code:: bash [features] flavor = "kw" rootdir = "/path/to/triggers/" columns = ['time', 'significance', 'frequency'] time = "time" significance = "significance" In the example above, we configure the ``kw`` flavor which searches for Kleine-Welle triggers. If trying to discover triggers in a non-standard location (say for loading your own custom triggers), you'll need to supply the ``rootdir`` kwarg. The following keyword arguments are required: * **flavor:** a flavor of ``DataLoader``, used to ingest features * **columns:** determine which columns to process * **time:** determine which column to use for determining target/clean times * **significance:** determine which column to use to determine significance In addition, the following optional keyword arguments can be passed in: * **nproc:** how many cores to use when reading in data In addition to these generic keyword arguments, different feature backends may have additional required and optional keyword arguments. .. _configuration-segments: [segments] ---------------------------------------------------------------------------------------------------- This section sets up queries to DQSegDB and defines what segments to analyze. Example: .. code:: bash [segments] segdb_url = "https://segments.ligo.org" intersect = "H1:DMT-ANALYSIS_READY:1" The following keyword arguments are required: * **segdb_url** * **intersect:** select which segments we want to analyze In addition, the following optional keyword arguments can be passed in: * **exclude:** select which segments we want to exclude .. _configuration-condor: [condor] ---------------------------------------------------------------------------------------------------- These parameters specify various configuration options when submitting jobs under the ``condor`` workflow. All these keyword arguments are optional if you're using the ``block`` or ``fork`` workflows. Example configuration: .. code:: bash [condor] universe = "local" accounting_group = "not.real" accounting_group_user = "albert.einstein" retry = 0 .. _configuration-workflow: Workflow Settings ==================================================================================================== The four core parts of the analysis; training, evaluation, calibration, and timeseries generation, share many similarities in how they are configured. They constitute of three distinct parts; a general section, a reporting section that configures how data products are saved, and an optional stream section which contains stream-specific settings such as timeouts and processing cadences. In addition to the four core processes, there are also reporting and monitoring jobs which each only need a single section to specify configuration variables. Here is an example of a training configuration: .. code:: bash [train] workflow = "block" random_rate = 0.01 ignore_segdb = false [train.reporting] flavor = "pickle" [train.stream] stride = 5 delay = 60 The general section, ``[train]``, specifies the parallelization scheme, ``workflow``, in which to train. In addition, it specifies a ``random_rate`` in which to generate clean samples from the clean segments defined from the ``[samples]`` section. Finally, there's a special option here, ``ignore_segdb`` which is used to optionally ignore the segments specified in the ``[segments]`` section. This can be useful, for example, in generating timeseries where we may want to generate timeseries for all times rather than restrict them to only look at science-mode data. ``[train.reporting]`` is used to configure how models are persisted. In this case, we specify the ``flavor`` of the model reporter as ``pickle`` to serialize models using the pickle format. Finally, there's a ``[train.stream]`` section which is required to run stream-based workflows. In this case, we process incoming features in 5 second strides and we allow incoming features to lag behind realtime up to 60 seconds. .. _configuration-train: Training ---------------------------------------------------------------------------------------------------- The training configuration consists of general workflow configuration in ``[train]``, a ``[train.reporting]`` section to configure the model reporter, and optionally, a ``[train.stream]`` section to specify stream-specific parameters. [train] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [train] workflow = "block" random_rate = 0.01 The following keyword arguments are required: * **workflow:** one of ``block``, ``fork``, ``condor`` * **random_rate:** rate at which to sample clean features In addition, the following optional keyword arguments can be passed in: * **ignore_segdb:** whether to ignore querying DQSegDB for segment information [train.reporting] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [train.reporting] flavor = "pickle" The following keyword arguments are required: * **flavor:** a flavor of ``Reporter``, used to load/save models In addition, whatever keyword arguments are required by the specific ``Reporter`` flavor. [train.stream] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [train.stream] stride = 5 delay = 60 No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments: * **stride:**: the length of time to process at a given time (in seconds) * **delay:**: the delay from real time to process data (in seconds) .. _configuration-evaluate: Evaluation ---------------------------------------------------------------------------------------------------- Like the training configuration, the evaluation configuration takes basically the same set of configuration options. The main difference is that the ``flavor`` of ``Reporter`` used in ``[evaluate reporting]`` needs to be a type of quiver reporter. [evaluate] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [evaluate] workflow = "block" log_level = 10 random_rate = 0.01 The following keyword arguments are required: * **workflow:** one of ``block``, ``fork``, ``condor`` * **log_level:** specifies the verbosity of log messages * **random_rate:** rate at which to sample clean features [evaluate.reporting] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [evaluate.reporting] flavor = "dataset" The following keyword arguments are required: * **flavor:** a flavor of ``Reporter``, used to load/save quivers In addition, whatever keyword arguments are required by the specific ``Reporter`` flavor. [evaluate.stream] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [evaluate.stream] stride = 5 delay = 60 No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments: * **stride:**: the length of time to process at a given time (in seconds) * **delay:**: the delay from real time to process data (in seconds) .. _configuration-calibrate: Calibration ---------------------------------------------------------------------------------------------------- The calibration jobs don't need a mechanism in which to discover triggers; instead, the datasets generated from the evaluation jobs are used to calibrate the models produced by the training jobs. [calibrate] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [calibrate] workflow = "block" log_level = 10 The following keyword arguments are required: * **workflow:** one of ``block``, ``fork``, ``condor`` * **log_level:** specifies the verbosity of log messages [calibrate.reporting] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [calibrate.reporting] flavor = "calib" The following keyword arguments are required: * **flavor:** a flavor of ``Reporter``, used to load/save calibration maps In addition, whatever keyword arguments are required by the specific ``Reporter`` flavor. .. note:: The ``[calibrate.stream]`` section is not present because in the online workflow, the calibration and evaluation jobs are tightly coupled. Therefore, the calibration jobs use the configuration from ``[evaluate.stream]``. .. _configuration-timeseries: Timeseries ---------------------------------------------------------------------------------------------------- The timeseries jobs are similar in configuration to the training and evaluation jobs. One of the main differences is instead of passing in a ``random_rate`` to control the rate of generating clean samples, we pass in an ``srate`` which determines the sampling rate of produced timeseries. [timeseries] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [timeseries] workflow = "block" log_level = 10 srate = 128 The following keyword arguments are required: * **workflow:** one of ``block``, ``fork``, ``condor`` * **log_level:** specifies the verbosity of log messages * **srate:** the sample rate at which timeseries are produced [timeseries.reporting] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [timeseries reporting] flavor = "series:gwf" The following keyword arguments are required: * **flavor:** a flavor of ``Reporter``, used to load/save timeseries In addition, whatever keyword arguments are required by the specific ``Reporter`` flavor. The following keyword arguments are optional: * **shmdir:** a second directory to save timeseries which stores the last N seconds of timeseries data, possibly to be picked up by another process. [timeseries.stream] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash [timeseries.stream] stride = 5 delay = 60 No keyword arguments are required, however it is strongly encouraged to set up the following keyword arguments: * **stride:**: the length of time to process at a given time (in seconds) * **delay:**: the delay from real time to process data (in seconds) .. _configuration-report: Report ---------------------------------------------------------------------------------------------------- The reporting jobs generate summary pages and plots to give an overview of batch and stream jobs that were run. Example: .. code:: bash [report] legend = true overview_only = truew All keyword arguments are optional: * **ignore_segdb:** whether to skip DQSegDB queries for determining segments to process * **legend:** whether to display a legend in plots * **annotate_gch:** whether to annotate glitch samples * **single_calib_plots:** whether to display individual calibration plots for each bin. if not set, only display an overview across all bins. * **overview_only:** whether to only show the overview page rather than also show pages for each individual classifier .. _configuration-monitor: Monitor ---------------------------------------------------------------------------------------------------- WRITEME .. _configuration-classifier: Classifier Settings ==================================================================================================== Classifiers are special in their configuration in that each classifier has its own section with a custom nickname. These classifiers are then selected by name in the ``[general]`` section for a given analysis. You can have more classifier sections than classifiers you run over for an analysis. Here's an example of how this works: .. code:: bash [general] classifiers = ["forest"] ... [[classifier]] name = "ovl" flavor = "ovl" incremental = 100 num_recalculate = 10 metric = "eff_fap" [classifier.minima] eff_fap = 3 poisson_signif = 5 use_percentage = 1e-3 [[classifier]] name = "forest" flavor = "sklearn:random_forest" verbose = true # feature vector options default = -100 window = 0.1 whitener = "standard" # parallelization options num_cv_proc = 1 # hyperparameters [classifier.params] classifier__criterion = "gini" classifier__n_estimators = 200 classifier__class_weight = "balanced" Here, we configure two different classifiers, ``ovl`` and ``forest`` with various configuration settings. In the ``[general]`` section, only one ``classifier`` is called, ``forest``, which indicates that when we run analyzes, we will only train, evaluate, etc. on the ``forest`` classifier. .. _configuration-default: Default Settings ==================================================================================================== Instead of specifying some options across multiple sections, they can be inherited by putting them in the ``[defaults]`` section so that they don't need to be repeated multiple times. The following can be passed in here: * **workflow:** one of ``block``, ``fork``, ``condor`` * **ignore_segdb:** whether to skip DQSegDB queries for determining segments to process * **log_level:** the verbosity of logs * **safe_channels_path:** the auxiliary channels in which to determine features for training, evaluation, etc. This is distinct from the target channel in that only labels are generated from the target channel, but these channels are used to generate features. They should only contain safe channels.