Data Registration¶
Registration involves recording the physical location, sizes and checksums (Adler32 and MD5) of files in the rucio database. Files are initially registered at a specific RSE.
This page discusses:
Ad hoc File Registration: create datasets and register individual files, or lists of files, interactively.
Typical use cases: testing and development.
Using DiskCache: create datasets and register files whose locations, types and times are recorded in an LDAS_tools diskcache. Can run interactively or as a background process.
Typical use cases: production environments, rolling buffers.
Registration is performed using the
gwrucio_registrar
utility:
usage: gwrucio_registrar [-h] -p PUB_SCRIPT [--dry-run] [--verbose]
[--lifetime LIFETIME] [--force-checksums]
{add-files,daemon} ...
Command line tool to register LIGO/Virgo datasets into rucio. Data may be
registered as individual files, ascii lists of files, or registered on the fly
as a background process monitoring a DiskCacheFile.
positional arguments:
{add-files,daemon}
add-files Register individual files.
daemon Monitor a diskcache and register files on the fly.
optional arguments:
-h, --help show this help message and exit
-r REG_SCRIPT, --reg-script REG_SCRIPT
YAML instructions for end point and data naming
--dry-run Find files, construct replica list but don't actually
upload to rucio
--verbose Print all logging info
--lifetime LIFETIME Dataset lifetime in seconds
--force-checksums Compute checksums and register files even if they are
already present
This utility registers files and attaches them to a dataset (simply a collection of files).
All registration operations with gwrucio_registrar
require a registration
script which defines the dataset to which files will be attached. The
registration script is a small YAML file (much like JSON psets in LDR). Here’s
an example which defines an ER8 C02 h(t) dataset:
H-H1_HOFT_C02:
scope: "ER8"
regexp: "H-H1_HOFT_C02"
minimum-gps: 1125969920
maximum-gps: 2000000000
rse: LIGO-WA-ARCHIVE
The section heading
H-H1_HOFT_C02
is used to name the dataset to be registered. This is therset
. The DID of this dataset in rucio will beER8:H-H1_HOFT_C02
(i.e.,scope:rset-name
). The dataset will be created if it does not already exist.scope: determines the scope for the dataset and any associated files.
regexp: a pattern used to identify files when using a diskcache.
minimum/maximum-gps: Used to identify files within some time range when using a diskcache.
rse: the files will be registered as being at
LIGO-WA-ARCHIVE
.
The registration script may contain any number of rsets.
Register A List Of Files¶
In this example, we will use the add-files
sub-command:
usage: gwrucio_registrar add-files [-h] --rset PSET files [files ...]
positional arguments:
files Files for registration
optional arguments:
-h, --help show this help message and exit
--rset RSET Registration set in the YAML configuration you wish to register
(only 1 permitted at this time)
The add-files
command requires specification of the rset to be registered: it
is assumed that the files being supplied belong to a single rset. After that,
the files for registration are just supplied as positional arguments.
In this example we register a single frame file from ER8 and attach it to an
rset H-H1_HOFT_C02
. Using the registration file listed earlier on this page:
(gwrucio) $ export OMP_NUM_THREADS=10
(gwrucio) $ gwrucio_registrar -r ER8-HOFT_C02.yml \
add-files --rset H-H1_HOFT_C02 \
/archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259/H-H1_HOFT_C02-1125986304-4096.gwf
2018-12-12 18:24:48,781 INFO Rset contains: /archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259/H-H1_HOFT_C02-1125986304-4096.gwf
2018-12-12 18:24:49,988 INFO 1 new files to register
2018-12-12 18:24:49,988 INFO Computing file checksums
2018-12-12 18:24:50,335 INFO Time spent on checksums: 0.01 mins [0.35 s]
2018-12-12 18:24:50,577 INFO Registering files
2018-12-12 18:24:50,799 INFO Files registered
2018-12-12 18:24:50,801 INFO Total uptime: 2.0204 sec.
where we note that gwrucio_registrar
uses parallel processing via the
multiprocessing
python module to speed up checksum calculations.
We now examine the replica in rucio.
List all DIDs in the ER8 scope:
(gwrucio) $ rucio list-dids ER8:*
+-------------------------+--------------+
| SCOPE:NAME | [DID TYPE] |
|-------------------------+--------------|
| ER8:H-H1_HOFT_C02 | DATASET |
+-------------------------+--------------+
Show the members of the ER8:H-H1_HOFT_C02
dataset:
(gwrucio) $ rucio list-content ER8:H-H1_HOFT_C02
+---------------------------------------+--------------+
| SCOPE:NAME | [DID TYPE] |
|---------------------------------------+--------------|
| ER8:H-H1_HOFT_C02-1125986304-4096.gwf | FILE |
+---------------------------------------+--------------+
Details of all replicas of this file:
(gwrucio) $ rucio list-file-replicas ER8:H-H1_HOFT_C02-1125986304-4096.gwf
+---------+-----------------------------------+------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| SCOPE | NAME | FILESIZE | ADLER32 | RSE: REPLICA |
|---------+-----------------------------------+------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------|
| ER8 | H-H1_HOFT_C02-1125986304-4096.gwf | 12.409 MB | 9126a173 | LIGO-WA-ARCHIVE: gsiftp://ldas-pcdev6.ligo-wa.caltech.edu:2811/archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259/H-H1_HOFT_C02-1125986304-4096.gwf |
+---------+-----------------------------------+------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
Finally, a list of files can be added using shell expansion. For example:
(gwrucio) $ find /archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259 -name *gwf -type f > ER-hoft_C02-H1-H-H1_HOFT_C02-11259.txt
(gwrucio) $ gwrucio_registrar -r ER8-HOFT_C02.yml \
add-files --rset H-H1_HOFT_C02 \
$(< ER-hoft_C02-H1-H-H1_HOFT_C02-11259.txt)
Register Files From DiskCache¶
This example assumes the presence of an LDAS_tools diskcache file. The examples directory of this repository contains scripts which:
Rsync a directory of frame files, one file at a time with a 60 second delay, to a user-specified location to simulate frame production.
Launch the DiskCache daemon to produce a diskcache file which updates as frames arrive.
This simulated frame production process can be controlled via a Makefile. For the rest of this exercise, we assume that the diskcache daemon is running.
The subcommand to register data from a diskcache is daemon
:
usage: gwrucio_registrar daemon [-h] [--run-once] [--force-check]
[--daemon-sleep DAEMON_SLEEP]
[cachefile]
positional arguments:
cachefile Path to diskcache ascii dump [default:
/var/lib/diskcache/frame_cache_dump]
optional arguments:
-h, --help show this help message and exit
--run-once Run a single iteration
--force-check Always attempt to register files (regardless of
whether diskcache has been modified)
--daemon-sleep DAEMON_SLEEP
Seconds to wait between checking diskcache for new
entries
When running in daemon
mode, rucio performs the following workflow for each
rset defined in the reg file:
Creates a dataset defined in the reg-file (unless it already exists).
Scans the diskcache to identify any new files present since its last pass.
Registers new files and attaches them to the current dataset.
Pauses a configurable number of seconds.
Repeat.
The diskcache information is found by parsing the ascii dump of the cache using the
DiskCache
module originally from LDR.
One-time Registration¶
In this example, we use the daemon
sub-command for a one-off registration of
a dataset from a diskcache file.
First, define a reg-file:
H-H1_HOFT_C00:
scope: "ER9"
regexp: "H-H1_HOFT_C00"
minimum-gps: 1151991808
maximum-gps: 2000000000
rse: LIGO-WA-ARCHIVE
Note that we have just chosen a subset of ER9 H1_HOFT_C00
data for this
demonstration. Should we want to extend the dataset, we can increase the
maximum-gps
in the reg-file and repeat the registration; files registered
the first time will be skipped when we repeat the process.
Finally, we register this data with:
(gwrucio) $ gwrucio_registrar -r ER9-H1_HOFT_C00.yml \
daemon --run-once \
/home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump
2018-12-12 21:00:45,476 INFO Starting gwrucio_registrar as daemon
2018-12-12 21:00:45,478 INFO H-H1_HOFT_C00: reading diskcache [/home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump]
2018-12-12 21:00:45,481 INFO --------------------------------------------------
2018-12-12 21:00:45,481 INFO H-H1_HOFT_C00: looking for new data
2018-12-12 21:00:47,487 INFO 9 new files to register
2018-12-12 21:00:47,487 INFO Computing file checksums
2018-12-12 21:00:48,944 INFO Time spent on checksums: 0.02 mins [1.46 s]
2018-12-12 21:00:52,476 INFO Registering files
2018-12-12 21:00:56,016 INFO Files registered
2018-12-12 21:00:56,017 INFO Total uptime: 10.5417 sec.
where /home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump
is the
ASCII version of the frame cache we created for the exercise.
Finally, check the data has been registered (this time using a filter to restrict the list of DIDs to files, rather than datasets and containers):
(gwrucio) $ rucio list-dids ER9:* --filter type=file
+---------------------------------------+--------------+
| SCOPE:NAME | [DID TYPE] |
|---------------------------------------+--------------|
| ER9:H-H1_HOFT_C00-1151848448-4096.gwf | FILE |
| ER9:H-H1_HOFT_C00-1151852544-4096.gwf | FILE |
<snip>
Registration As A Daemon¶
Running gwrucio_registrar
as a daemon is almost identical to the above;
we just remove --run-once
option and run as a background process or as a
daemon under e.g., supervisord.
The command to run as a background process is best contained in a script (see
e.g.,
examples/start_daemon
).
#!/bin/sh -e
export OMP_NUM_THREADS=10
register_cmd="gwrucio_registrar"
configfile="ER10-HOFT_C02.yml"
daemon_sleep=30
cachefile="/home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump"
logfile="${register_cmd}_stdout_stderr.log"
## Build command
cmdline="${register_cmd} -r ${configfile} \
daemon --daemon-sleep ${daemon_sleep} \
${cachefile}"
echo "Executing:"
echo $cmdline
echo "Outputting logs to: ${logfile}"
## Run process in background
nohup ${cmdline} > ${logfile} 2>&1 &
echo $! > register_pid
echo Process started as `cat register_pid`
exit 0
In this example, the reg-file contains to r-sets (i.e., datasets to be registered).:
H-H1_HOFT_C02:
scope: "ER10"
regexp: "H-H1_HOFT_C02"
minimum-gps: 1163174417
maximum-gps: 1164556817
rse: LIGO-CIT-ARCHIVE
L-L1_HOFT_C02:
scope: "ER10"
regexp: "L-L1_HOFT_C02"
minimum-gps: 1161964817
maximum-gps: 1164556817
rse: LIGO-CIT-ARCHIVE
This process produces datasets ER10:H-H1_HOFT_C02
and ER10:L-L1_HOFT_C02
which will continuously update as long as new files whose names match the given
regexp patterns arrive in the diskcache within the specified time intervals.