Working with MockClassifierData objects

This brief tutorial will lay out the ideas behind our idq.io.MockClassifierData objects and, in particular, how to configure them.

Architecture

The basic idea behind idq.io.MockClassifierData is to simulate various “trigger streams”, each of which may be witnessed (imperfectly) by multiple auxiliary channels. In this way, we can quickly specify correlated triggers between arbitrarily many auxiliary channels as well as specifying the ability of each aux channel to witness the signal.

Typically, we will generate something like several correlated channels between the target_channel and aux channels, with the aux channels containing some level of measurement error (called “jitters”). Each channel will additionally witness separate noise streams that are witnessed by each channel separately. In this way, we build up a mixture-model for the feature distributions actually observed.

idq.io.MockClassifierData objects support a small number of columns (time, frequency, and snr) and synthetic triggers are generated assuming independnet distributions for each feature within each stream. More complicated simulation algorithms could be developed in the future if the need appears, but for now this is thought to be sufficient.

Configuration

We configure idq.io.MockClassifierData objects via a simple INI file. Each section corresponds to a different stream of triggers and requires a few common options

  • rate is the Poisson rate of triggers in that stream, measured in Hz

  • frequency distribution is the distribution of frequencies in that stream

  • snr distribution is the distribution of snrs in that stream

Both frequency distribution and snr distribution follow a simple convetion. These are based off the scipy.stats.rv_continuous distributions in that any distribution available there is also available here. Users can specify which distribution they’d like to use and then a space-delimited list of all the necessary parameters for that distribution (see [scipy.stats docs](https://docs.scipy.org/doc/scipy/reference/stats.html)). Note, order matters for the parameters and they’re all expected to be floats. An example of this is included below.

The rest of the options in each section are interpreted as channel names. These are expected to follow another simple syntax similar to what is used for things like target_bounds in the main iDQ INI file. Specifically, each channel can specify “jitters” for each feature in that stream corresponding to the imperfect ability of that channel to witness that stream. These are specified via feature_name and then the standard deviation of a Gaussian jitter around the true feature’s value. To wit, if a channel specifies time 0.01 then the time features of the triggers from this stream associated with that channel will be given by the time of the actual triggers simulated for that stream plus random offsets drawn from a zero-mean Gaussian with standard deviation of 0.01 sec. Channels can specify jitters for multiple features by listing them on consecutive lines. If channels are included without specifying any jitters, they will witness that stream perfectly. Again, an example config file is provided below.:

[correlated stream]
rate = 0.1
frequency distribution = norm 300 50
snr distribution = pareto 3

### because both channels witness this stream, we add jitters to both
channel a =
    time 0.001
    frequency 5
    snr 0.1

channel b =
    time 0.010
    frequency 1
    snr 0.1

[noise for channel a]
rate = 1
frequency distribution = norm 500 100
snr distribution = pareto 4

### because only channel a witnesses this stream, we let it witness it perfectly
channel a =

[noise for channel b]
rate = 1
frequency distribution = norm 500 100
snr distribution = pareto 4

### because only channel b witnesses this stream, we let it witness it perfectly
channel b =

Usage

idq.io.MockClassifierData can be used in both the batch and streaming pipelines just like any other idq.io.ClassifierData. It requires a single kwarg: config, which is the path to the config file described above. All data will be generated on-the-fly and cached in an intuitive way so that repeated calls to MockClassifierData.triggers returns reproducible results.

One might use idq.io.MockClassifierData objects in several scenarios, such as

  • unit tests (need a source of triggers)

  • benchmarking * need a known “optimal solution” for classifier performance * latency measurements for passing data through the iDQ infrastructure

Dynamic Mock Classifier Data

idq.io.DynamicMockClassifierData extends idq.io.UmbrellaClassifierData to allow for time-varying correaltions. Specifically, users must specify a single option in the INI file following our target_bounds syntax describing which idq.io.MockClassifierData INI file to use and an associate time range. For example:

[train data discovery]
flavor = DynamicMockClassifierData
time = time
configs = config1.ini -inf 10
          config2.ini 10 100
          config3.ini 100 +inf

Upon instantiation, a idq.io.MockClassifierData object will be created for each relevant time segment, having been passed the appropriate config file. We also explicitly check whether the specified segments overlap, raising a ValueError if that is found. Users can specify as many or as few configs as they’d like. idq.io.DynamicMockClassifierData effectively reduces to idq.io.MockClassifierData if users specify a single config with infinite range (-inf +inf).