# Data Registration Registration involves recording the physical location, sizes and checksums (Adler32 and MD5) of files in the rucio database. Files are initially registered at a specific RSE. This page discusses: * [Ad hoc File Registration](#ad-hoc-file-registration): create datasets and register individual files, or lists of files, interactively. * Typical use cases: testing and development. * [Using DiskCache](#using-diskcache): create datasets and register files whose locations, types and times are recorded in an [LDAS_tools](https://git.ligo.org/ldastools/LDAS_Tools) diskcache. Can run interactively or as a background process. * Typical use cases: production environments, rolling buffers. Registration is performed using the `gwrucio_registrar`utility: ``` usage: gwrucio_registrar [-h] -p PUB_SCRIPT [--dry-run] [--verbose] [--lifetime LIFETIME] [--force-checksums] {add-files,daemon} ... Command line tool to register LIGO/Virgo datasets into rucio. Data may be registered as individual files, ascii lists of files, or registered on the fly as a background process monitoring a DiskCacheFile. positional arguments: {add-files,daemon} add-files Register individual files. daemon Monitor a diskcache and register files on the fly. optional arguments: -h, --help show this help message and exit -r REG_SCRIPT, --reg-script REG_SCRIPT YAML instructions for end point and data naming --dry-run Find files, construct replica list but don't actually upload to rucio --verbose Print all logging info --lifetime LIFETIME Dataset lifetime in seconds --force-checksums Compute checksums and register files even if they are already present ``` This utility registers files and attaches them to a dataset (simply a collection of files). All registration operations with `gwrucio_registrar` require a registration script which defines the dataset to which files will be attached. The registration script is a small YAML file (much like JSON psets in LDR). Here's an example which defines an ER8 C02 h(t) dataset: ``` H-H1_HOFT_C02: scope: "ER8" regexp: "H-H1_HOFT_C02" minimum-gps: 1125969920 maximum-gps: 2000000000 rse: LIGO-WA-ARCHIVE ``` * The section heading `H-H1_HOFT_C02` is used to name the dataset to be registered. This is the `rset`. The DID of this dataset in rucio will be `ER8:H-H1_HOFT_C02` (i.e., `scope:rset-name`). The dataset will be created if it does not already exist. * scope: determines the scope for the dataset and any associated files. * regexp: a pattern used to identify files *when using a diskcache*. * minimum/maximum-gps: Used to identify files within some time range *when using a diskcache*. * rse: the files will be registered as being at `LIGO-WA-ARCHIVE`. The registration script may contain any number of rsets. ## Register A List Of Files In this example, we will use the `add-files` sub-command: ``` usage: gwrucio_registrar add-files [-h] --rset PSET files [files ...] positional arguments: files Files for registration optional arguments: -h, --help show this help message and exit --rset RSET Registration set in the YAML configuration you wish to register (only 1 permitted at this time) ``` The `add-files` command requires specification of the rset to be registered: it is assumed that the files being supplied belong to a single rset. After that, the files for registration are just supplied as positional arguments. In this example we register a single frame file from ER8 and attach it to an rset `H-H1_HOFT_C02`. Using the registration file listed earlier on this page: ``` (gwrucio) $ export OMP_NUM_THREADS=10 (gwrucio) $ gwrucio_registrar -r ER8-HOFT_C02.yml \ add-files --rset H-H1_HOFT_C02 \ /archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259/H-H1_HOFT_C02-1125986304-4096.gwf 2018-12-12 18:24:48,781 INFO Rset contains: /archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259/H-H1_HOFT_C02-1125986304-4096.gwf 2018-12-12 18:24:49,988 INFO 1 new files to register 2018-12-12 18:24:49,988 INFO Computing file checksums 2018-12-12 18:24:50,335 INFO Time spent on checksums: 0.01 mins [0.35 s] 2018-12-12 18:24:50,577 INFO Registering files 2018-12-12 18:24:50,799 INFO Files registered 2018-12-12 18:24:50,801 INFO Total uptime: 2.0204 sec. ``` where we note that `gwrucio_registrar` uses parallel processing via the [`multiprocessing`](https://docs.python.org/2/library/multiprocessing.html) python module to speed up checksum calculations. We now examine the replica in rucio. List all DIDs in the ER8 scope: ``` (gwrucio) $ rucio list-dids ER8:* +-------------------------+--------------+ | SCOPE:NAME | [DID TYPE] | |-------------------------+--------------| | ER8:H-H1_HOFT_C02 | DATASET | +-------------------------+--------------+ ``` Show the members of the `ER8:H-H1_HOFT_C02` dataset: ``` (gwrucio) $ rucio list-content ER8:H-H1_HOFT_C02 +---------------------------------------+--------------+ | SCOPE:NAME | [DID TYPE] | |---------------------------------------+--------------| | ER8:H-H1_HOFT_C02-1125986304-4096.gwf | FILE | +---------------------------------------+--------------+ ``` Details of all replicas of this file: ``` (gwrucio) $ rucio list-file-replicas ER8:H-H1_HOFT_C02-1125986304-4096.gwf +---------+-----------------------------------+------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ | SCOPE | NAME | FILESIZE | ADLER32 | RSE: REPLICA | |---------+-----------------------------------+------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------| | ER8 | H-H1_HOFT_C02-1125986304-4096.gwf | 12.409 MB | 9126a173 | LIGO-WA-ARCHIVE: gsiftp://ldas-pcdev6.ligo-wa.caltech.edu:2811/archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259/H-H1_HOFT_C02-1125986304-4096.gwf | +---------+-----------------------------------+------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+ ``` Finally, a list of files can be added using shell expansion. For example: ``` (gwrucio) $ find /archive/frames/ER8/hoft_C02/H1/H-H1_HOFT_C02-11259 -name *gwf -type f > ER-hoft_C02-H1-H-H1_HOFT_C02-11259.txt (gwrucio) $ gwrucio_registrar -r ER8-HOFT_C02.yml \ add-files --rset H-H1_HOFT_C02 \ $(< ER-hoft_C02-H1-H-H1_HOFT_C02-11259.txt) ``` ## Register Files From DiskCache This example assumes the presence of an [LDAS_tools](https://git.ligo.org/ldastools/LDAS_Tools) diskcache file. The [examples](https://git.ligo.org/james-clark/gwrucio/tree/master/examples/diskcache) directory of this repository contains scripts which: 1. Rsync a directory of frame files, one file at a time with a 60 second delay, to a user-specified location to simulate frame production. 1. Launch the DiskCache daemon to produce a diskcache file which updates as frames arrive. This simulated frame production process can be controlled via a [Makefile](https://git.ligo.org/james-clark/gwrucio/blob/master/examples/diskcache/Makefile). For the rest of this exercise, we assume that the diskcache daemon is running. The subcommand to register data from a diskcache is `daemon`: ``` usage: gwrucio_registrar daemon [-h] [--run-once] [--force-check] [--daemon-sleep DAEMON_SLEEP] [cachefile] positional arguments: cachefile Path to diskcache ascii dump [default: /var/lib/diskcache/frame_cache_dump] optional arguments: -h, --help show this help message and exit --run-once Run a single iteration --force-check Always attempt to register files (regardless of whether diskcache has been modified) --daemon-sleep DAEMON_SLEEP Seconds to wait between checking diskcache for new entries ``` When running in `daemon` mode, rucio performs the following workflow for each rset defined in the reg file: 1. Creates a dataset defined in the reg-file (unless it already exists). 1. Scans the diskcache to identify any new files present since its last pass. 1. Registers new files and attaches them to the current dataset. 1. Pauses a configurable number of seconds. 1. Repeat. The diskcache information is found by parsing the ascii dump of the cache using the [`DiskCache`](https://git.ligo.org/lvcomputing/ligo-data-replicator/blob/master/python-ldrv1-master/ldrv1/diskcache.py) module originally from LDR. #### One-time Registration In this example, we use the `daemon` sub-command for a one-off registration of a dataset from a diskcache file. First, define a reg-file: ``` H-H1_HOFT_C00: scope: "ER9" regexp: "H-H1_HOFT_C00" minimum-gps: 1151991808 maximum-gps: 2000000000 rse: LIGO-WA-ARCHIVE ``` Note that we have just chosen a subset of ER9 `H1_HOFT_C00` data for this demonstration. Should we want to extend the dataset, we can increase the `maximum-gps` in the reg-file and repeat the registration; files registered the first time will be skipped when we repeat the process. Finally, we register this data with: ``` (gwrucio) $ gwrucio_registrar -r ER9-H1_HOFT_C00.yml \ daemon --run-once \ /home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump 2018-12-12 21:00:45,476 INFO Starting gwrucio_registrar as daemon 2018-12-12 21:00:45,478 INFO H-H1_HOFT_C00: reading diskcache [/home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump] 2018-12-12 21:00:45,481 INFO -------------------------------------------------- 2018-12-12 21:00:45,481 INFO H-H1_HOFT_C00: looking for new data 2018-12-12 21:00:47,487 INFO 9 new files to register 2018-12-12 21:00:47,487 INFO Computing file checksums 2018-12-12 21:00:48,944 INFO Time spent on checksums: 0.02 mins [1.46 s] 2018-12-12 21:00:52,476 INFO Registering files 2018-12-12 21:00:56,016 INFO Files registered 2018-12-12 21:00:56,017 INFO Total uptime: 10.5417 sec. ``` where `/home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump` is the ASCII version of the frame cache we created for the exercise. Finally, check the data has been registered (this time using a filter to restrict the list of DIDs to files, rather than datasets and containers): ``` (gwrucio) $ rucio list-dids ER9:* --filter type=file +---------------------------------------+--------------+ | SCOPE:NAME | [DID TYPE] | |---------------------------------------+--------------| | ER9:H-H1_HOFT_C00-1151848448-4096.gwf | FILE | | ER9:H-H1_HOFT_C00-1151852544-4096.gwf | FILE | ``` ### Registration As A Daemon Running `gwrucio_registrar` as a daemon is almost identical to the above; we just remove `--run-once` option and run as a background process or as a daemon under e.g., [supervisord](http://supervisord.org/). The command to run as a background process is best contained in a script (see e.g., [`examples/start_daemon`](https://git.ligo.org/james-clark/gwrucio/blob/master/examples/gwrucio_registrar/start_daemon)). ``` #!/bin/sh -e export OMP_NUM_THREADS=10 register_cmd="gwrucio_registrar" configfile="ER10-HOFT_C02.yml" daemon_sleep=30 cachefile="/home/jclark/Projects/ligo-rucio/diskcache/frame_cache_dump" logfile="${register_cmd}_stdout_stderr.log" ## Build command cmdline="${register_cmd} -r ${configfile} \ daemon --daemon-sleep ${daemon_sleep} \ ${cachefile}" echo "Executing:" echo $cmdline echo "Outputting logs to: ${logfile}" ## Run process in background nohup ${cmdline} > ${logfile} 2>&1 & echo $! > register_pid echo Process started as `cat register_pid` exit 0 ``` In this example, the reg-file contains to r-sets (i.e., datasets to be registered).: ``` H-H1_HOFT_C02: scope: "ER10" regexp: "H-H1_HOFT_C02" minimum-gps: 1163174417 maximum-gps: 1164556817 rse: LIGO-CIT-ARCHIVE L-L1_HOFT_C02: scope: "ER10" regexp: "L-L1_HOFT_C02" minimum-gps: 1161964817 maximum-gps: 1164556817 rse: LIGO-CIT-ARCHIVE ``` This process produces datasets `ER10:H-H1_HOFT_C02` and `ER10:L-L1_HOFT_C02` which will continuously update as long as new files whose names match the given regexp patterns arrive in the diskcache within the specified time intervals.