Data Registration

Registration involves recording the physical location, sizes and checksums (Adler32 and MD5) of files in the rucio database. Files are initially registered at a specific RSE.

This page discusses:

  • Ad hoc File Registration: create datasets and register individual files, or lists of files, interactively.

    • Typical use cases: testing and development.

  • Using DiskCache: create datasets and register files whose locations, types and times are recorded in an LDAS_tools diskcache. Can run interactively or as a background process.

    • Typical use cases: production environments, rolling buffers.

Registration is performed using the gwrucio_registrarutility:

usage: gwrucio_registrar [-h] -p PUB_SCRIPT [--dry-run] [--verbose]
                             [--lifetime LIFETIME] [--force-checksums]
                             {add-files,daemon} ...

Command line tool to register LIGO/Virgo datasets into rucio. Data may be
registered as individual files, ascii lists of files, or registered on the fly
as a background process monitoring a DiskCacheFile.

positional arguments:
  {add-files,daemon}
    add-files           Register individual files.
    daemon              Monitor a diskcache and register files on the fly.

optional arguments:
  -h, --help            show this help message and exit
  -r REG_SCRIPT, --reg-script REG_SCRIPT
                        YAML instructions for end point and data naming
  --dry-run             Find files, construct replica list but don't actually
                        upload to rucio
  --verbose             Print all logging info
  --lifetime LIFETIME   Dataset lifetime in seconds
  --force-checksums     Compute checksums and register files even if they are
                        already present

This utility registers files and attaches them to a dataset (simply a collection of files).

All registration operations with gwrucio_registrar require a registration script which defines the dataset to which files will be attached. The registration script is a small YAML file (much like JSON psets in LDR). Here’s an example which defines an ER8 C02 h(t) dataset:

H-H1_HOFT_C02:
  scope: "ER8"
  regexp: "H-H1_HOFT_C02"
  minimum-gps: 1125969920
  maximum-gps: 2000000000
  rse: LIGO-WA-ARCHIVE
  • The section heading H-H1_HOFT_C02 is used to name the dataset to be registered. This is the rset. The DID of this dataset in rucio will be ER8:H-H1_HOFT_C02 (i.e., scope:rset-name). The dataset will be created if it does not already exist.

  • scope: determines the scope for the dataset and any associated files.

  • regexp: a pattern used to identify files when using a diskcache.

  • minimum/maximum-gps: Used to identify files within some time range when using a diskcache.

  • rse: the files will be registered as being at LIGO-WA-ARCHIVE.

The registration script may contain any number of rsets.

Register A List Of Files

In this example, we will use the add-files sub-command:

usage: gwrucio_registrar add-files [-h] --rset PSET files [files ...]

positional arguments:
  files        Files for registration

optional arguments:
  -h, --help   show this help message and exit
  --rset RSET  Registration set in the YAML configuration you wish to register
               (only 1 permitted at this time)

The add-files command requires specification of the rset to be registered: it is assumed that the files being supplied belong to a single rset. After that, the files for registration are just supplied as positional arguments.

In this example we register a single frame file from ER8 and attach it to an rset H-H1_HOFT_C02. Using the registration file listed earlier on this page:

(gwrucio) $ export OMP_NUM_THREADS=10
(gwrucio) $ gwrucio_registrar -r ER8-HOFT_C02.yml \
    add-files --rset H-H1_HOFT_C02 \
    /archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259/H-H1_HOFT_C02-1125986304-4096.gwf
2018-12-12 18:24:48,781	INFO	Rset contains: /archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259/H-H1_HOFT_C02-1125986304-4096.gwf
2018-12-12 18:24:49,988	INFO	1 new files to register
2018-12-12 18:24:49,988	INFO	Computing file checksums
2018-12-12 18:24:50,335	INFO	Time spent on checksums: 0.01 mins [0.35 s]
2018-12-12 18:24:50,577	INFO	Registering files
2018-12-12 18:24:50,799	INFO	Files registered
2018-12-12 18:24:50,801	INFO	Total uptime: 2.0204 sec.

where we note that gwrucio_registrar uses parallel processing via the multiprocessing python module to speed up checksum calculations.

We now examine the replica in rucio.

List all DIDs in the ER8 scope:

(gwrucio) $ rucio list-dids ER8:*
+-------------------------+--------------+
| SCOPE:NAME              | [DID TYPE]   |
|-------------------------+--------------|
| ER8:H-H1_HOFT_C02       | DATASET      |
+-------------------------+--------------+

Show the members of the ER8:H-H1_HOFT_C02 dataset:

(gwrucio) $ rucio list-content ER8:H-H1_HOFT_C02
+---------------------------------------+--------------+
| SCOPE:NAME                            | [DID TYPE]   |
|---------------------------------------+--------------|
| ER8:H-H1_HOFT_C02-1125986304-4096.gwf | FILE         |
+---------------------------------------+--------------+

Details of all replicas of this file:

(gwrucio) $ rucio list-file-replicas ER8:H-H1_HOFT_C02-1125986304-4096.gwf
+---------+-----------------------------------+------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| SCOPE   | NAME                              | FILESIZE   | ADLER32   | RSE: REPLICA                                                                                                                                        |
|---------+-----------------------------------+------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------|
| ER8     | H-H1_HOFT_C02-1125986304-4096.gwf | 12.409 MB  | 9126a173  | LIGO-WA-ARCHIVE: gsiftp://ldas-pcdev6.ligo-wa.caltech.edu:2811/archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259/H-H1_HOFT_C02-1125986304-4096.gwf |
+---------+-----------------------------------+------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+

Finally, a list of files can be added using shell expansion. For example:

(gwrucio) $ find /archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259 -name *gwf -type f > ER-hoft_C02-H1-H-H1_HOFT_C02-11259.txt
(gwrucio) $ gwrucio_registrar -r ER8-HOFT_C02.yml \
    add-files --rset H-H1_HOFT_C02 \
    $(< ER-hoft_C02-H1-H-H1_HOFT_C02-11259.txt)

Register Files From DiskCache

This example assumes the presence of an LDAS_tools diskcache file. The examples directory of this repository contains scripts which:

  1. Rsync a directory of frame files, one file at a time with a 60 second delay, to a user-specified location to simulate frame production.

  2. Launch the DiskCache daemon to produce a diskcache file which updates as frames arrive.

This simulated frame production process can be controlled via a Makefile. For the rest of this exercise, we assume that the diskcache daemon is running.

The subcommand to register data from a diskcache is daemon:

usage: gwrucio_registrar daemon [-h] [--run-once] [--force-check]
                                    [--daemon-sleep DAEMON_SLEEP]
                                    [cachefile]

positional arguments:
  cachefile             Path to diskcache ascii dump [default:
                        /var/lib/diskcache/frame_cache_dump]

optional arguments:
  -h, --help            show this help message and exit
  --run-once            Run a single iteration
  --force-check         Always attempt to register files (regardless of
                        whether diskcache has been modified)
  --daemon-sleep DAEMON_SLEEP
                        Seconds to wait between checking diskcache for new
                        entries

When running in daemon mode, rucio performs the following workflow for each rset defined in the reg file:

  1. Creates a dataset defined in the reg-file (unless it already exists).

  2. Scans the diskcache to identify any new files present since its last pass.

  3. Registers new files and attaches them to the current dataset.

  4. Pauses a configurable number of seconds.

  5. Repeat.

The diskcache information is found by parsing the ascii dump of the cache using the DiskCache module originally from LDR.

One-time Registration

In this example, we use the daemon sub-command for a one-off registration of a dataset from a diskcache file.

First, define a reg-file:

H-H1_HOFT_C00:
  scope: "ER9"
  regexp: "H-H1_HOFT_C00"
  minimum-gps: 1151991808
  maximum-gps: 2000000000
  rse: LIGO-WA-ARCHIVE

Note that we have just chosen a subset of ER9 H1_HOFT_C00 data for this demonstration. Should we want to extend the dataset, we can increase the maximum-gps in the reg-file and repeat the registration; files registered the first time will be skipped when we repeat the process.

Finally, we register this data with:

(gwrucio) $ gwrucio_registrar -r ER9-H1_HOFT_C00.yml \
                        daemon --run-once    \
                        /home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump
2018-12-12 21:00:45,476	INFO	Starting gwrucio_registrar as daemon
2018-12-12 21:00:45,478	INFO	H-H1_HOFT_C00: reading diskcache [/home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump]
2018-12-12 21:00:45,481	INFO	--------------------------------------------------
2018-12-12 21:00:45,481	INFO	H-H1_HOFT_C00: looking for new data
2018-12-12 21:00:47,487	INFO	9 new files to register
2018-12-12 21:00:47,487	INFO	Computing file checksums
2018-12-12 21:00:48,944	INFO	Time spent on checksums: 0.02 mins [1.46 s]
2018-12-12 21:00:52,476	INFO	Registering files
2018-12-12 21:00:56,016	INFO	Files registered
2018-12-12 21:00:56,017	INFO	Total uptime: 10.5417 sec.

where /home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump is the ASCII version of the frame cache we created for the exercise.

Finally, check the data has been registered (this time using a filter to restrict the list of DIDs to files, rather than datasets and containers):

(gwrucio) $ rucio list-dids ER9:* --filter type=file
+---------------------------------------+--------------+
| SCOPE:NAME                            | [DID TYPE]   |
|---------------------------------------+--------------|
| ER9:H-H1_HOFT_C00-1151848448-4096.gwf | FILE         |
| ER9:H-H1_HOFT_C00-1151852544-4096.gwf | FILE         |
<snip>

Registration As A Daemon

Running gwrucio_registrar as a daemon is almost identical to the above; we just remove --run-once option and run as a background process or as a daemon under e.g., supervisord.

The command to run as a background process is best contained in a script (see e.g., examples/start_daemon).

#!/bin/sh -e

export OMP_NUM_THREADS=10
register_cmd="gwrucio_registrar"
configfile="ER10-HOFT_C02.yml"
daemon_sleep=30
cachefile="/home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump"
logfile="${register_cmd}_stdout_stderr.log"

## Build command
cmdline="${register_cmd} -r ${configfile} \
    daemon --daemon-sleep ${daemon_sleep} \
    ${cachefile}"

echo "Executing:"
echo $cmdline
echo "Outputting logs to: ${logfile}"

## Run process in background
nohup ${cmdline} > ${logfile} 2>&1 &
echo $! > register_pid
echo Process started as `cat register_pid`

exit 0

In this example, the reg-file contains to r-sets (i.e., datasets to be registered).:

H-H1_HOFT_C02:
  scope: "ER10"
  regexp: "H-H1_HOFT_C02"
  minimum-gps: 1163174417
  maximum-gps: 1164556817
  rse: LIGO-CIT-ARCHIVE

L-L1_HOFT_C02:
  scope: "ER10"
  regexp: "L-L1_HOFT_C02"
  minimum-gps: 1161964817
  maximum-gps: 1164556817
  rse: LIGO-CIT-ARCHIVE

This process produces datasets ER10:H-H1_HOFT_C02 and ER10:L-L1_HOFT_C02 which will continuously update as long as new files whose names match the given regexp patterns arrive in the diskcache within the specified time intervals.