Skip to content

Architecture

This page explains how the Arrakis system works and how the Python client interacts with it. Understanding the architecture is not required to use the library, but it helps when debugging connection issues or reasoning about streaming behavior.

Overview

Arrakis is a distributed timeseries data system built on two core technologies:

  • Apache Arrow Flight for request/response communication (metadata queries, historical data retrieval, publisher registration).
  • Apache Kafka for real-time data streaming and publishing.

The Python client abstracts both behind a unified API.

Request Flow

When you call a function like arrakis.fetch() or arrakis.find(), the client:

  1. Opens a gRPC connection to the Arrakis Flight server.
  2. Sends a flight descriptor encoding the request type and parameters as JSON.
  3. Receives a FlightInfo response listing one or more endpoints.
  4. Connects to each endpoint and reads Arrow record batches.

The supported request types are:

Request Description API
Find Search for channels by pattern and filters find(), count()
Describe Get metadata for named channels describe()
Stream Open a data stream (Flight or Kafka) stream()
Publish Get Kafka producer configuration Publisher.enter()
Count Count matching channels count()

The client also supports Flight actions for server-level queries that don't involve data transfer:

Action Description API
server-info Server version, backend type, capabilities server_info()
scope-map Available domains and endpoint routing domains()

Metadata queries (find, count, describe) accept an optional time parameter that routes the request to the appropriate backend for that point in time (e.g., a historical archive vs the live system).

Streaming Architecture

Streaming is more complex than simple request/response. Depending on the server configuration, data may arrive via Arrow Flight endpoints or Kafka topics.

Flight streaming

For historical data, the server typically returns Flight endpoints. The client reads record batches from each endpoint in parallel using MultiFlightReader.

Kafka streaming

For live data, the server may return a Kafka endpoint. The client creates a KafkaReader that subscribes to the relevant Kafka topics and deserializes Arrow record batches from each message.

The Multiplexer

Regardless of transport, incoming blocks feed into the BlockMuxStream (arrakis.mux.BlockMuxStream). The multiplexer:

  • Maintains a timed queue per stream, where each queue has a known stride (block duration) and timeout.
  • Synchronizes blocks across streams so that all channels are aligned to the same timestamp before being yielded.
  • Fills gaps when data does not arrive before the timeout. Gap blocks contain NumPy masked arrays with all values masked.
  • Aligns queues by draining elements from queues that are ahead until all queues start at the same timestamp.

The overall stride of the muxer is the least common multiple of the individual stream strides. The timeout is the maximum of the individual stream latencies.

For single-stream scenarios (e.g., a client reading pre-muxed data from a Flight endpoint), timeout-based gap filling is disabled to avoid racing incoming data.

Publishing Architecture

Publishing reverses the data flow:

  1. The Publisher registers with the server via a Flight descriptor, receiving Kafka producer configuration (broker addresses, authentication).
  2. Each published SeriesBlock is converted to Arrow record batches, serialized to IPC format, and sent to the appropriate Kafka topic.
  3. Channels are grouped by partition -- each partition corresponds to a Kafka topic (arrakis-{partition_id}). Channels within a partition share the same data type and are identified by a compact integer index rather than the full channel name string.

Data Serialization

Arrakis uses two record batch layouts:

Column-based (used for publishing via Flight)
One time column (int64 nanoseconds) followed by one column per channel, each containing a list array of samples.
Row-based (used for Kafka)
Three columns: time (int64), id (uint32 channel index), and data (list of samples). Multiple channels share the same batch, identified by index rather than name.

The row-based format is more compact for Kafka because it avoids repeating channel name strings in every message.