Concepts
This section provides an overview of the concepts that are important to understand when working with Kaspr.
Application (App)
An app (KasprApp) is a program that runs all the components of a distributed stream processing application. It is composed of agents (stream processors), tasks, channels, tables, and web views to perform useful work. We can have multiple instances of an app to distribute a processing and scale in a horizontal manner.
An app can be seen as a service in a microservice architecture. It’s common to have many different apps, each responsible for a set of functions, that are part of a larger or complex system. …
Stream
A stream is an infinite sequence of events coming from a Kafka topic or channel. In Kaspr, a stream is implicitly created through an (KasprAgent), which manages the stream’s lifecycle, determines what enters the stream, and defines how its events are processed.
An event serves as a general wrapper for messages. Each event contains references to both the serialized and deserialized versions of the key and value messages, along with additional metadata, such as the message offset.
Agent
An agent (KasprAgent) is a distributed system that processes events in a stream. Each agent runs within an app, and an app can host multiple agents. An agent consumes data from an input source, such as a Kafka topic or a channel, and performs one or more processing operations on either individual events or batches of events.
Streams can be divided among agents either in a round-robin fashion or by partitioning them based on the message key. This determines how the stream is distributed across available agent instances within all app instances. For instance, if the stream’s messages are keyed by account ID and the value is a high score, the partitioning ensures that all messages with the same account ID are consistently processed by the same agent instance.
Agents are at the core of stream processing in Kaspr, capable of performing a variety of operations, including transformations and aggregations, right out of the box. Additionally, agents can define custom processing logic using arbitrary Python code, providing flexibility for more complex operations.
Table
A (KasprTable) provides durable, fault-tolerant memory for stream processing. While similar to a database, a Table differs in key ways: instead of residing on a remote host and offering a rich query interface, a Table is a simple key-value store embedded directly within an application. This local embedding allows for ultra-fast reads and writes.
Each Table is backed by a Kafka topic, often compacted and referred to as a changelog topic. Every record written to a Table is also published to its changelog topic. This design enables the system to rebuild the entire state of the Table in case of a failure, ensuring data consistency and fault tolerance.
Internally, a Table leverages an embedded RocksDB database. The data capacity of RocksDB is limited by the disk size of the machine, not its memory, making it suitable for managing large datasets.
Tables play a critical role in enabling applications to store state in a fault-tolerant manner, allowing stream processors to perform stateful computations such as aggregations and data enrichments.
There are two types of Tables: normal and global.
- Normal: A normal Table is distributed across instances of an application, as it is partitioned based on the partitions of the underlying changelog topic. In a multi-instance setup, each application instance holds a subset of the Table’s keys. However, in a single-instance setup, a normal Table behaves similarly to a global Table.
- Global: A global Table, by contrast, provides each application instance with a complete copy of the data. This distinction becomes important when scaling an application to run across multiple instances. Unlike normal Tables, which divide the dataset among instances, global Tables replicate the entire dataset to each instance. This flexibility allows developers to choose the appropriate Table type based on their application’s requirements for scalability and data locality.
Task
A task (KasprTask) represents arbitrary work that is performed asynchronously in the background, independent of agents. Tasks can be defined to run as one-time operations, on fixed time intervals, or on a recurring schedule using loops or cron expressions.
Tasks operate within an app, and an app can run multiple tasks simultaneously.
Some examples of how tasks can be used include:
- Polling external APIs and publishing data to a topic or channel
- Reading from a (KasprTable) and performing an action, such as making a POST request to an HTTP endpoint
- Triggering scheduled business processes
- Generating synthetic data
- Tasks provide a flexible way to perform background operations without blocking other processes within the app.
Tasks provide a powerful way to handle asynchronos operations in a kaspr application, enabling a wide variety of background processing needs.
Web View
TODO
Channel
A channel (KasprChannel) is an in-memory buffer or queue used for sending and receiving messages within a local application (process) only. In Kaspr, channels function similarly to Kafka topics, enabling communication between agents within the same app. However, unlike Kafka topics, messages in channels do not persist across application restarts, meaning they are temporary and are lost when the app is restarted.