Architecture of Kafka Connect
Kafka Connect for HPE Ezmeral Data Fabric Streams has the following major models in its design: connector, worker, and data.
Connector Model
A connector is defined by specifying a Connector class and configuration options to control what data is copied and how to format it.
- Each Connector instance is responsible for defining and updating a set of Tasks that actually copy the data.
- Kafka Connect manages the Tasks; the Connector is only responsible for generating the set of Tasks and indicating to the framework when they need to be updated.
- Source and Sink Connectors/Tasks are distinguished in the API to ensure the simplest possible API for both.
- Source - Source tasks ingest data from data storage systems and stream the data to HPE Ezmeral Data Fabric Streams.
- Sink - Sink tasks stream data from HPE Ezmeral Data Fabric Streams to other storage systems.
- JDBC Source Connector
The Kafka JDBC source connector is a type connector used to stream data from relational databases into HPE Ezmeral Data Fabric Streams topics. JDBC Source Connector for HPE Ezmeral Data Fabric Streams supports integration with Hive 2.1.
- JDBC Sink Connector
The Kafka JDBC sink connector is a type connector used to stream data from HPE Ezmeral Data Fabric Streams topics to relational databases that have a JDBC driver.
- HDFS Sink Connector
The Kafka HDFS sink connector is a type connector used to stream data from HPE Ezmeral Data Fabric Streams to file system. By default, the resulting data is produced to file system in Avro format. In addition, Parquet files can be written to file system.
Worker Model
A Kafka Connect for HPE Ezmeral Data Fabric Streams cluster consists of a set of Worker processes that are containers that execute Connectors and Tasks. A worker is a JVM process with a REST API that is able to execute streaming tasks.
- Workers automatically coordinate with each other to distribute work and provide scalability and fault tolerance.
- The Workers distribute work among any available processes, but are not responsible for management of the processes;
- Any process management strategy can be used for Workers. For example, cluster management tools like YARN or Mesos, configuration management tools like Chef or Puppet, or direct management of process lifecycles.
Data Model
Connectors copy streams of messages from a partitioned input stream to a partitioned output stream, where at least one of the input or output is always Kafka.
- Each of these streams is an ordered set messages where each message has an associated offset.
- The format and semantics of these offsets are defined by the Connector to support integration with a wide variety of systems; however, to achieve certain delivery semantics in the face of faults requires that offsets are unique within a stream and streams can seek to arbitrary offsets.
- Message contents are represented by Connectors in a serialization-agnostic format.
- Pluggable Converters are available for storing this data in a variety of serialization formats.
- Schemas are built-in, allowing important metadata about the format of messages to be propagated through complex data pipelines. However, schema-free data can also be use when a schema is simply unavailable.