Architecture¶
This page explains how the Arrakis system works and how the Python client interacts with it. Understanding the architecture is not required to use the library, but it helps when debugging connection issues or reasoning about streaming behavior.
Overview¶
Arrakis is a distributed timeseries data system built on two core technologies:
- Apache Arrow Flight for request/response communication (metadata queries, historical data retrieval, publisher registration).
- Apache Kafka for real-time data streaming and publishing.
The Python client abstracts both behind a unified API.
Request Flow¶
When you call a function like arrakis.fetch() or arrakis.find(), the client:
- Opens a gRPC connection to the Arrakis Flight server.
- Sends a flight descriptor encoding the request type and parameters as JSON.
- Receives a FlightInfo response listing one or more endpoints.
- Connects to each endpoint and reads Arrow record batches.
The supported request types are:
| Request | Description | API |
|---|---|---|
| Find | Search for channels by pattern and filters | find(), count() |
| Describe | Get metadata for named channels | describe() |
| Stream | Open a data stream (Flight or Kafka) | stream() |
| Publish | Get Kafka producer configuration | Publisher.enter() |
| Count | Count matching channels | count() |
The client also supports Flight actions for server-level queries that don't involve data transfer:
| Action | Description | API |
|---|---|---|
server-info |
Server version, backend type, capabilities | server_info() |
scope-map |
Available domains and endpoint routing | domains() |
Metadata queries (find, count, describe) accept an optional time
parameter that routes the request to the appropriate backend for that point in
time (e.g., a historical archive vs the live system).
Streaming Architecture¶
Streaming is more complex than simple request/response. Depending on the server configuration, data may arrive via Arrow Flight endpoints or Kafka topics.
Flight streaming¶
For historical data, the server typically returns Flight endpoints. The client
reads record batches from each endpoint in parallel using
MultiFlightReader.
Kafka streaming¶
For live data, the server may return a Kafka endpoint. The client creates a
KafkaReader that subscribes to the relevant Kafka topics and deserializes
Arrow record batches from each message.
The Multiplexer¶
Regardless of transport, incoming blocks feed into the BlockMuxStream (arrakis.mux.BlockMuxStream). The multiplexer:
- Maintains a timed queue per stream, where each queue has a known stride (block duration) and timeout.
- Synchronizes blocks across streams so that all channels are aligned to the same timestamp before being yielded.
- Fills gaps when data does not arrive before the timeout. Gap blocks contain NumPy masked arrays with all values masked.
- Aligns queues by draining elements from queues that are ahead until all queues start at the same timestamp.
The overall stride of the muxer is the least common multiple of the individual stream strides. The timeout is the maximum of the individual stream latencies.
For single-stream scenarios (e.g., a client reading pre-muxed data from a Flight endpoint), timeout-based gap filling is disabled to avoid racing incoming data.
Publishing Architecture¶
Publishing reverses the data flow:
- The
Publisherregisters with the server via a Flight descriptor, receiving Kafka producer configuration (broker addresses, authentication). - Each published
SeriesBlockis converted to Arrow record batches, serialized to IPC format, and sent to the appropriate Kafka topic. - Channels are grouped by partition -- each partition corresponds to a
Kafka topic (
arrakis-{partition_id}). Channels within a partition share the same data type and are identified by a compact integer index rather than the full channel name string.
Data Serialization¶
Arrakis uses two record batch layouts:
- Column-based (used for publishing via Flight)
- One
timecolumn (int64 nanoseconds) followed by one column per channel, each containing a list array of samples. - Row-based (used for Kafka)
- Three columns:
time(int64),id(uint32 channel index), anddata(list of samples). Multiple channels share the same batch, identified by index rather than name.
The row-based format is more compact for Kafka because it avoids repeating channel name strings in every message.