Batch Data Replication in the Analytical Data Warehouse Landscape
Filling a data warehouse or data lake is generally the first major step toward making the analytical environment available for core functionality and end-user work. The effectiveness of this implementation determines both the cost and duration of the entire data warehouse project, as well as the timelines for delivering individual data services.
In this publication we share experience implementing batch loading of large data volumes into analytical warehouses and explain when batch loading is the right choice and when the online approach is more appropriate. We also describe how years of experience solving such tasks was embodied in the industrial data replication tool Alphyn Data Replicator.
What to Choose: Batch or Real-Time?
Over years of designing data integration and loading solutions, we have developed the following principle: if batch data loading is feasible -- choose it and don't chase minimum latency without real need for it. Real-time replication should only be used when it is justified from a cost and service level perspective.
User requirements should always be viewed with healthy skepticism. If asked directly, you will almost always hear: "Of course we want all data to load into our warehouse in real time." In practice this often turns out to be an unjustified "wish list" item. There are no rational reasons, regulatory processes, or reports that actually require the most current data -- and none are foreseen in the near future. And even if such processes exist somewhere, they typically involve a small set of objects, and the real-time requirement can be scoped to those objects only.
When to Consider Batch Integration
- Maintaining lag from the source of less than 15 minutes is not required;
- The source system (SS) can provide data in batch mode without impacting its performance -- extraction can occur from the primary database at any time, during a dedicated maintenance window, or a standby server with read access is available;
- The speed of data extraction from the source system in each iteration does not exceed the permissible lag;
- Loading is required not only from DBMS sources but also from integration file systems (flat files) or via FTP protocol;
- Bulk loading configuration is needed (entire databases/schemas without manual per-object setup), for example for initial loads or periodic full schema/database copies;
- Basic data quality checks (duplicates, null values in keys) are required as part of the loading process;
- A mode is required that other tools don't support, or a heterogeneous replication scenario is being considered -- for example, S3<->HDFS, Greenplum<->HDFS, HDFS<->HDFS, etc.
Contraindications for Batch Replication -- When to Use Real-Time and CDC
- Source lag of less than 15 minutes must be maintained;
- The SS cannot deliver data in batch mode with the required SLA (performance limitations, critical tasks affected by batch extraction, no read replica, etc.);
- The DBMS lacks mechanisms for efficient batch data export (including to files);
- Incremental extraction of an object is needed but logical increment cannot be identified;
- There are non-logged physical row deletions;
- The SS is supported by Change Data Capture (CDC) tools and it is possible to work directly with change logs. CDC operation does not create additional load on the source DBMS;
- Data transformations in the transfer stream from source to sink are required -- for example, online enrichment, lookups with conditions;
- Data loading into sinks that do not support batch import is required.
Important: Keep in mind that the sink may have its own constraints. A sink may not tolerate online changes or even simple online inserts, in which case both approaches must be combined with different tools. Even when choosing the online approach, an initial batch load is often necessary, as is a periodic corrective batch replication for reconciliation. This is the so-called lambda architecture approach, where data is loaded in real time but is periodically rewritten for a specific interval in batch mode. Batch rewriting is also commonly used for data quality correction.
Alphyn Data Replicator
A Brief History
Ten years ago we first encountered the task of massive, reliable, heterogeneous replication from multiple DBMS systems into one target without using specialized commercial ETL tools. The client's requirements were: structure synchronization with type transformation without quality loss; complete schema migration; and incremental operation mode.
The target system was Vertica. A pilot prototype was successful, but the project didn't materialize as the client abandoned Vertica in the target architecture. The idea sat in a drawer for a couple of years, until Hadoop-based data lake projects began to appear. That's when everything designed came to fruition and ideas were validated in serious practice. Some solutions from projects of that era still carry terabytes of data every day.
In the time since, neither open-source projects nor commercial products emerged that adequately addressed this task. Alphyn Data Replicator is not a repackaging or adaptation of any open-source project. It is a tool designed and built from scratch, meeting the requirements described above and based on solutions proven by years of project work.
Build Your Own or Choose a Ready Solution?
A difficult choice facing system owners and potential clients: should you invest in developing your own data warehouse loading solution, or select a ready-made tool? Assuming the ready tool meets all technical and functional requirements.
First and foremost, consider development timelines -- even a simple loading implementation without incremental mode, without metadata change tracking, and without data quality checks will require a qualified team and considerable time. Alphyn Data Replicator as a ready tool allows you to begin data replication to the warehouse immediately after installation. As soon as the DWH infrastructure is in place, you can immediately provide data for analysis to users and services. Your own best solution -- one that accounts for all the details of your unique source systems -- can be developed later.
Alphyn Data Replicator Functional Capabilities
When working in a Big Data environment, the ELT (Extract, Load, Transform) paradigm is typically followed -- processing data in the environment where it is stored. In practice this means the replication tool implementing E and L should perform extraction from the source and loading to the sink. This is the core functional baseline.
Alphyn Data Replicator supports the following source data extraction modes:
- Full extraction mode -- Snapshot;
- Logical increment selection:
- Increment extraction from the source by any scalar deterministic function;
- Time window extraction with lower and upper bounds;
- Partition range capture from a partitioned source, including automatic range or specific partition list determination by the tool;
- Data extraction based on "manual" selection conditions.
For each extraction iteration, the tool builds extract queries based on the last successful session with the object. Data Replicator also supports Pre-SQL and Post-SQL processing, which is useful when data for selection needs to be prepared before extraction and when "hygiene" cleanup is needed afterward.
Data extraction can run in multi-threaded mode, where multiple data exchange sessions are created simultaneously on the source system, each reading its own range. Range definition and the number of sessions are determined automatically. Data Replicator first samples the source to estimate the total volume, calculate the number of sessions based on the resources allocated to the engine performing the data selection, and it ensures the total number of sessions does not exceed a configured limit to prevent DDoS-ing the source system. If the source system is distributed, the tool can balance requests across nodes in round-robin fashion. For each source, data exchange availability windows can also be defined by schedule.
In a normal landscape, dozens or even hundreds of objects are loaded simultaneously. Task formation therefore accounts for priorities, since source system resources for extraction may be limited while the number of objects and volume of extracted data can greatly exceed them. Data Replicator has an internal scheduler that forms task queues according to priorities configured by the administrator between sources and systems.
Data Application in the Target Warehouse
Alphyn Data Replicator was created not merely as a tool to bring data to the warehouse, but as a means to rapidly create a primary storage layer or Operational Data Store (ODS). Data delivery to the target layer is performed according to the selected scenario.
Available scenarios:
- Simple insert (append only);
- SCD Type 1 (SCD1) mode -- new rows are inserted, changed rows are updated;
- SCD1 with physical deletion handling at the source within the captured data range. Data Replicator compares source and sink and searches for deleted rows. In the sink, depending on configuration, either deletes those rows or marks them with a logical deletion flag;
- SCD4 history preservation mode -- for each source table, two tables are created in the sink:
- SCD1 table reflecting the current state of the source;
- HIST-satellite storing the full change history.
SCD4 mode is especially useful when the source system does not preserve change history or periodically deletes it on its side. History is needed for retrospective data analysis or for history reconstruction in the DWH in analytical layers and data marts.
The extraction and application process can be synchronous or asynchronous -- for cases where multiple extraction sessions should be "collapsed" into a single application session.
Solution Architecture
Alphyn Data Replicator is essentially a framework-type tool. Its operating principle is to generate executable code in all systems participating in data exchange, based on internal metadata created by built-in methods. The generated code is sent for execution to other frameworks, DBMS systems, and processing engines. The solution's role is to orchestrate all code generation processes and monitor task execution. In this architectural approach, the tool itself requires very few system resources -- others do the heavy lifting for it.
Example 1: "Loading data from a relational DBMS into Greenplum"
- Data extraction from the source can be done via the PXF framework or via the extraction engine;
- The extraction engine extracts data and delivers it to Greenplum, where data is applied to the target table according to the selected scenario. The tool generates executable code for both extraction and application on Greenplum's SQL dialect with appropriate parameters and manages its execution;
- After data application, vacuum is triggered if needed and statistics are collected.
The extraction engine can run in standalone mode, in YARN Hadoop, or in a Kubernetes environment -- for example when populating the Alphyn Lakehouse platform.
Example 2: "Loading data from Teradata into a Lakehouse (S3 object storage)"
- Data extraction occurs through Teradata Parallel Transporter (TPT). Data Replicator generates TPT jobs, manages their launch and operation, and monitors writing to S3 Storage;
- After writing to object storage, Data Replicator applies data according to the selected update scenario on one of the Lakehouse-supported SQL engines. The tool generates executable code with the required parameters and manages its execution;
- After application, statistics are collected on target objects.
When writing to HDFS or S3 targets, the Apache Iceberg table format is supported in addition to traditional file formats.
Alphyn Data Replicator itself is a Kubernetes containerized application, placing it in the cloud-ready class. PostgreSQL is used as the internal metadata store.
Supported Sources
The following connectors are currently shipped with the tool:
- Oracle;
- MS SQL Server;
- Postgres;
- SAP IQ;
- SAP ASE;
- MySQL;
- SAP HANA;
- Teradata;
- Greenplum;
- MariaDB;
- sFTP.
In most cases, adding a new connector for a relational or MPP DBMS does not require significant effort -- it is sufficient to write queries to the source data dictionary per the tool's requirements, add data type conversion, and select the appropriate driver. If data extraction from a new source cannot be implemented through standard interface reads, a code generation module for a third-party framework must be added.
Cross-Cluster Replication
Cross-cluster data replication enables not only fast data movement between different role clusters of the system, but also organizing a fault-tolerant leg in the infrastructure -- especially when the DR cluster is located in a separate data center. Meanwhile the sink always retains the ability to work with data on the target receiving ongoing change commits.
Between object stores or Hadoop systems, file exchange occurs with metadata catalog invalidation. Incremental change updates are also available for this exchange -- and if the source cluster has many small files, they are repacked to target size during transfer to the backup leg. When using the Iceberg table format, the Iceberg snapshot on the sink side is invalidated.
For Hadoop systems not using Iceberg format but requiring ACID compliance, Data Replicator has a built-in change isolation service based on HDFS snapshot switching. This guarantees the continuity of user transactions and isolation of changes on the receiving side.
Available cross-cluster replication targets and directions:
- Greenplum <-> S3;
- Greenplum <-> HDFS;
- HDFS <-> S3;
- S3 <-> S3;
- HDFS <-> HDFS;
- Greenplum <-> Greenplum.
Cross-cluster replication can be organized in two modes. The first guarantees data delivery from the source system simultaneously to two or more independent targets. Data is extracted from the SS, written to both clusters of the target system, and upon completion applied to both clusters according to the selected scenario. This approach guarantees a fixed consistent state of primary source data in case of failover from the production cluster to the standby. The source system is queried exactly once.
The second mode involves copying derived data between clusters. Copying can be scheduled or triggered via API. In practice this works as follows: an ETL tool updates data in a calculated data mart and calls the API to copy changes to the target object on the backup target.
Using both cross-cluster replication modes of Alphyn Data Replicator essentially implements the "Double ETL" fault-tolerant principle out of the box:
- Source system is queried once;
- Data is written simultaneously to both legs;
- Derived data (analytical layers, data marts) are calculated in the production system;
- Derived data changes are synchronized to the backup leg;
- The backup leg is always in sync with the primary production system for both primary and derived data;
- Lag depends only on the constraints of the data transfer channel between data centers;
- When roles are swapped between PROD and DR, Data Replicator can also reverse the replication direction.
Data Quality Controls
All internal processes for creating derived data within the warehouse, or user-built processes, must operate on data they can trust from a quality perspective. Built-in data quality checks at the loading stage are the best way to earn that trust. Alphyn Data Replicator implements the following checks:
- Quantitative reconciliation between source and sink. Especially relevant when using logical increment mode but the source can receive rows retroactively. In this case Data Replicator can automatically re-capture the data range from the source accounting for late-arriving rows;
- Primary key uniqueness. Not all target sink systems have built-in uniqueness constraints and checks, and a primary key cannot always be defined on the sink side. Sometimes the source system has no unique key at all -- in this case a deduplication mode is available on the sink side;
- Mandatory field population check. Checks for NULL values in fields designated as mandatory;
- Source schema mutation check. Checks for DDL convergence between source and sink. Depending on the type of change and tool configuration, three scenarios are possible:
- Changes are inherited from source to sink;
- Changes are ignored with notification in logs;
- Loading processes are stopped when changes require deliberate administrator intervention.
Integration Capabilities
In most cases, populating the primary storage layer (Raw Data or ODS) is only the first step in the processing and transformation pipeline when deploying a data warehouse, followed by numerous dependency-linked processes. Typically the DWH landscape has an orchestrator managing the order of all loading and processing steps from source extraction to materialization in the target storage layer or derived data export to a consumer system. For integration with external orchestrators and schedulers, two ready documented interface types are available: the procedural Data Replicator API and a REST API.
Graphical Interface
Alphyn Data Replicator has an intuitive user graphical interface. The core idea behind the GUI implementation was to make the data loading creation and management function accessible to business users first, and then to help administrators create, configure, schedule, run, and maintain data replication processes. The GUI reduces qualification requirements for personnel working with the system.
All GUI interaction with the tool's backend occurs through the REST API, so the choice of interaction method always remains with end users: manage via REST services, work with the GUI, use the built-in native API, use the command line, or use all methods simultaneously depending on the situation.
Development Roadmap
Alphyn Data Replicator has a development roadmap formed and adjusted according to user requests and market trends. Currently we are implementing an export module to deliver derived data from the warehouse to consumer systems. Data Replicator continues to expand the list of supported source systems. Kafka is being added in the near term.
We are expanding the list of target Big Data engines responsible for applying data according to the selected scenario. During 2025 we plan to add additional SQL engines within the Alphyn Lakehouse platform and adapt efficient data application scenarios for them. New source capture modes with greater variability for reducing load during data extraction are planned. We also plan to add the ability to install extensions on the source system side to standardize tool communication with sources through a standard tool API.
Right now we are improving Teradata integration capabilities using the Native Object Store (NOS) functionality. The product team is already testing multi-apply replication, where data is extracted from the source system once but can be applied to multiple heterogeneous target systems -- for example, extract from Oracle and deliver to three different systems simultaneously: Hadoop, Greenplum, and S3 Lakehouse.