.. _mock-classifier-data:

Working with MockClassifierData objects
####################################################################################################

This brief tutorial will lay out the ideas behind our :class:`idq.io.MockClassifierData` objects and, in particular, how to configure them.

.. contents::

.. _mcd-architecture:

Architecture
====================================================================================================

The basic idea behind :class:`idq.io.MockClassifierData` is to simulate various "trigger streams", each of which may be witnessed (imperfectly) by multiple auxiliary channels.
In this way, we can quickly specify correlated triggers between arbitrarily many auxiliary channels as well as specifying the ability of each aux channel to witness the signal.

Typically, we will generate something like several correlated channels between the `target_channel` and aux channels, with the aux channels containing some level of measurement error (called "jitters").
Each channel will additionally witness separate noise streams that are witnessed by each channel separately.
In this way, we build up a mixture-model for the feature distributions actually observed.

:class:`idq.io.MockClassifierData` objects support a small number of `columns` (`time`, `frequency`, and `snr`) and synthetic triggers are generated assuming independnet distributions for each feature within each stream.
More complicated simulation algorithms could be developed in the future if the need appears, but for now this is thought to be sufficient.

.. _mcd-configuration:

Configuration
====================================================================================================

We configure :class:`idq.io.MockClassifierData` objects via a simple INI file.
Each section corresponds to a different stream of triggers and requires a few common options

  * `rate` is the Poisson rate of triggers in that stream, measured in Hz
  * `frequency distribution` is the distribution of frequencies in that stream
  * `snr distribution` is the distribution of snrs in that stream

Both `frequency distribution` and `snr distribution` follow a simple convetion.
These are based off the `scipy.stats.rv_continuous` distributions in that any distribution available there is also available here.
Users can specify which distribution they'd like to use and then a space-delimited list of all the necessary parameters for that distribution (see [scipy.stats docs](https://docs.scipy.org/doc/scipy/reference/stats.html)).
**Note**, order matters for the parameters and they're all expected to be floats.
An example of this is included below.

The rest of the options in each section are interpreted as channel names.
These are expected to follow another simple syntax similar to what is used for things like `target_bounds` in the main iDQ INI file.
Specifically, each channel can specify "jitters" for each feature in that stream corresponding to the imperfect ability of that channel to witness that stream.
These are specified via `feature_name` and then the standard deviation of a Gaussian jitter around the true feature's value.
To wit, if a channel specifies `time 0.01` then the `time` features of the triggers from this stream associated with that channel will be given by the time of the actual triggers simulated for that stream plus random offsets drawn from a zero-mean Gaussian with standard deviation of 0.01 sec.
Channels can specify jitters for multiple features by listing them on consecutive lines.
If channels are included without specifying any jitters, they will witness that stream perfectly.
Again, an example config file is provided below.::

    [correlated stream]
    rate = 0.1
    frequency distribution = norm 300 50
    snr distribution = pareto 3

    ### because both channels witness this stream, we add jitters to both
    channel a =
        time 0.001
        frequency 5
        snr 0.1

    channel b =
        time 0.010
        frequency 1
        snr 0.1

    [noise for channel a]
    rate = 1
    frequency distribution = norm 500 100
    snr distribution = pareto 4

    ### because only channel a witnesses this stream, we let it witness it perfectly
    channel a =

    [noise for channel b]
    rate = 1
    frequency distribution = norm 500 100
    snr distribution = pareto 4

    ### because only channel b witnesses this stream, we let it witness it perfectly
    channel b =


.. _mcd-usage:

Usage
====================================================================================================

:class:`idq.io.MockClassifierData` can be used in both the batch and streaming pipelines just like any other :class:`idq.io.ClassifierData`.
It requires a single *kwarg*: `config`, which is the path to the config file described above.
All data will be generated on-the-fly and cached in an intuitive way so that repeated calls to `MockClassifierData.triggers` returns reproducible results.

One might use :class:`idq.io.MockClassifierData` objects in several scenarios, such as

  * unit tests (need a source of triggers)
  * benchmarking
    * need a known "optimal solution" for classifier performance
    * latency measurements for passing data through the iDQ infrastructure

Dynamic Mock Classifier Data
====================================================================================================

:class:`idq.io.DynamicMockClassifierData` extends :class:`idq.io.UmbrellaClassifierData` to allow for time-varying correaltions.
Specifically, users must specify a single option in the INI file following our `target_bounds` syntax describing which :class:`idq.io.MockClassifierData` INI file to use and an associate time range.
For example::

    [train data discovery]
    flavor = DynamicMockClassifierData
    time = time
    configs = config1.ini -inf 10
              config2.ini 10 100
              config3.ini 100 +inf

Upon instantiation, a :class:`idq.io.MockClassifierData` object will be created for each relevant time segment, having been passed the appropriate config file.
We also explicitly check whether the specified segments overlap, raising a ValueError if that is found.
Users can specify as many or as few configs as they'd like.
:class:`idq.io.DynamicMockClassifierData` effectively reduces to :class:`idq.io.MockClassifierData` if users specify a single config with infinite range (`-inf +inf`).