Running the Batch Pipeline¶
This tutorial goes through the steps of setting up a configuration file needed by iDQ to run the batch pipeline.
Configuration file¶
In order to run one-off or batch tasks, you’ll need to provide iDQ with a
TOML-formatted file. An example configuration is located at
etc/config.toml
. Below is a guide that will get you started with common
configuration options, an exhaustive list of options is located at
Configuration.
Common options that will be used throughout batch jobs:
[general]
tag = "test"
instrument = "L1"
rootdir = "/path/to/analysis"
classifiers = ["classifier1", "classifier2"]
Options for defining glitch/clean samples:
[samples]
target_channel = "channel_name"
dirty_window = time_window (in seconds)
[samples.target_bounds]
significance = [lower_bound, upper_bound]
frequency = [lower_bound, upper_bound]
[samples.dirty_bounds]
significance = [lower_bound, upper_bound]
frequency = [lower_bound, upper_bound]
Glitches are determined by looking at a specified target channel and finding features that fall within these target bounds.
Clean samples are determined by first finding all times that fall within dirty bounds and removing all times within the dirty window specified. Whatever time is left will be sampled to generate clean samples.
Options for reading input features:
[features]
flavor = "omicron"
columns = ["time", "snr", "frequency"]
time = "time"
significance = "snr"
frequency = "frequency"
This section tells iDQ where and how to read input features, define which columns to use and also which columns are used to used to determine the time, significance and frequency.
In this case, we use the omicron
flavor which searches for Omicron triggers. We want
to use the three columns defined above and in addition, we define that determine times
from triggers by using the time
column, assigning the significance
for determining
target and dirty bounds are determined by looking at the snr
column.
Options for batch jobs:
Possible job options are:
train
evaluate
calibrate
timeseries
[job]
workflow = "workflow_type"
random_rate = rate (train/evaluate only)
min_stride = stride (train only)
srate = rate (timeseries only)
[job.reporting]
flavor = "reporter_type"
whatever kwargs are needed by this reporter
Classifier options:
Here, you’ll be creating a section, one per classifier, with the keyword arguments needed for that particular classifier. For example, for a support vector machine classifier:
[[classifier]]
name = "svm"
flavor = "sklearn:svm"
# feature vector options
default = 0
window = 0.1
whitener = "standard"
# parallelization options
num_cv_proc = 8
# hyperparameters
[classifier.params]
classifier__C: 100
classifier__gamma = 10
Segment options:
In addition, you’ll need to provide a way for iDQ to query DQSegDB for valid segments.
[segments]
segdb_url = "https://segments.ligo.org"
intersect = "H1:DMT-ANALYSIS_READY:1"
Condor options:
If you’re planning on using condor workflows in any part of iDQ, you’ll also have to specify options for condor submission as well.
[condor]
universe = "vanilla"
retry = 3
accounting_group = "your.accounting.group"
accounting_group_user = "albert.einstein"
After you’ve set up your configuration file, you’re ready to launch one-off or batch iDQ tasks (running the full workflow).
One-off Tasks¶
idq-train
:
usage: idq-train [-h] [-q | -v] [-e EXCLUDE EXCLUDE] CONFIG START END
positional arguments:
CONFIG
START
END
options:
-h, --help show this help message and exit
-q, --quiet If set, only display warnings and errors.
-v, --verbose If set, display additional logging messages.
-e EXCLUDE EXCLUDE, --exclude EXCLUDE EXCLUDE
exclude this segment from the analysis. Can be
repeated to excludemultiple segments. Useful for
round-robin training/evaluation.
idq-evaluate
:
usage: idq-evaluate [-h] [-q | -v] CONFIG START END
positional arguments:
CONFIG
START
END
options:
-h, --help show this help message and exit
-q, --quiet If set, only display warnings and errors.
-v, --verbose If set, display additional logging messages.
idq-calibrate
:
usage: idq-calibrate [-h] [-q | -v] CONFIG START END
positional arguments:
CONFIG
START
END
options:
-h, --help show this help message and exit
-q, --quiet If set, only display warnings and errors.
-v, --verbose If set, display additional logging messages.
idq-timeseries
usage: idq-timeseries [-h] [-q | -v] CONFIG START END
positional arguments:
CONFIG
START
END
options:
-h, --help show this help message and exit
-q, --quiet If set, only display warnings and errors.
-v, --verbose If set, display additional logging messages.
Batch Tasks¶
idq-batch
:
usage: idq-batch [-h] [-q | -v] [-w WORKFLOW] [-i INITIAL_LOOKBACK]
[--skip-timeseries] [--skip-report] [-c] [-n NUM_BINS]
[-N NUM_SEGS_PER_BIN] [-b]
CONFIG START END
positional arguments:
CONFIG
START
END
options:
-h, --help show this help message and exit
-q, --quiet If set, only display warnings and errors.
-v, --verbose If set, display additional logging messages.
-w WORKFLOW, --workflow WORKFLOW
workflow for launching batch jobs
-i INITIAL_LOOKBACK, --initial-lookback INITIAL_LOOKBACK
if causal batch is specified, that look back this much
before t_start to use as a seed to evaluate starting
at t_start.
--skip-timeseries do not generate timeseries
--skip-report do not generate report
-c, --causal use causal round-robin binning
-n NUM_BINS, --num-bins NUM_BINS
the number of round-robin bins to generate.Divisions
are made according to walltime
-N NUM_SEGS_PER_BIN, --num-segs-per-bin NUM_SEGS_PER_BIN
the number of segments per bin within the round-robin
procedure. If this is greater than 1, segments will be
organized in a checkerboard pattern in order to sample
from the entire range in both training and evaluation.
Note, this is only used if --causal is NOT supplied.
-b, --block if supplied, this process will block until the DAG has
completed. Used when workflow=condor.