The AWS EMR Migration Strategy
- the infrastructure demands for a data-driven company is a moving target that varies with limited predictability and control;
- At the same time, the company was inherently overwhelmed by the efforts and data engineering resources required to effectively self-manage a large scale distributed database platform.
As a solution, Mactores devised a strategic plan to:
- Migrate existing Apache HBase systems running on the Cloudera platform to a fully-managed AWS EMR platform integrated with the AWS S3 storage solutions
- Migrate all tables and associated processes from the Cloudera cluster to Amazon EMR
- Reconfigure protocol buffers and Apache HBase client from the existing MapReduce jobs to support Apache HBase on Amazon EMR
- Migrate all MapReduce Jobs to a separate transient Amazon EMR cluster and initiate the cluster from Jenkins workflow
- Migrate HBase 1.4 to HBase 2.2.6 on Amazon EMR 6.2.0
Journey to a Fully Managed AWS EMR Hbase Platform
Assess
The Assessment phase involves building a solid business case and an action plan based on the prior assessment and evaluation. This phase combines the people, process, and technology required to adopt and execute distributed database capabilities within a secure, automated, and efficient AWS environment. This phase's overarching goals include:
- Mobilizing teams and preparing for the Amazon EMR migration. The transition was designed to be user-centric, reducing the learning curve and transforming teams to maximize the value of the cloud environment.
- Defining and automating policies: security, operations, and compliance.
- Running cloud-based Hadoop platform in production capacity to improve performance across several benchmarks and metrics.
- AWS EMR Architecture with S3 Storage: Configured 3 x r6g.12xlarge Master nodes, 9 x r6g.12xlarge core nodes on Regional Servers, and added additional nodes for autoscaling.
- Configurations: Pre-validated EMR-specific configurations applied to Apache HBase, Amazon S3 Region Server, WAL, BlockCache, and Memstore.
- Data Migration: Cloudera to S3 HBase migration with a single snapshot one week before the cut-off date, followed by incremental periodic updates until final migration.
- Stress Testing: YSCB framework was used to gather performance benchmarks throughout the migration process.
Figure 1: Shows the final deployment architecture to perform the migration from self-managed HBase on a Hadoop Cluster to Fully Managed Amazon EMR HBase.
Migrate
The Migrate phase extends the action tasks from the Assess phase and applies them to migrate data workloads at scale. This ongoing and iterative process covers the process, technology, and design best practices about Amazon EMR migration and prepares the organization for a fully managed AWS data platform offering. The following tasks were involved in the migration process:
- AWS EMR Architecture for MapReduce Jobs: 1 x m5.2xlarge Master nodes, 5 x m5.4xlarge core nodes of Region Servers, and autoscaling of core nodes for MapReduce Jobs.
- EMR Configurations: Amazon EMR-specific HDFS and MapReduce configurations based on the pre-validated Apache Hadoop platform previously running on the Cloudera cluster.
- Workload Migration: From MapReduce to Amazon EMR v5.30.0 by recompiling the MapReduce code.
- Performance Testing: Writing tests and executing MapReduce Jobs on Amazon EMR cluster with Amazon EMR HBase for benchmark comparisons.
The HBase RegionServer core nodes installed on the EC2 cluster offer improved Read performance by caching data and using efficient in-memory filters. The HBase Write Ahead Log (WAL) offers durable write performance for data stored in the on-cluster HDFS. Similarly, the Task Nodes write the HBase WAL requests in HDFS running on the Core Nodes with the Amazon EMR File System implementation providing convenient storage of persistent data on the S3 storage platform for strong read-after-write consistency.
- Snapshot: Create Snapshot from source → Export Snapshot to S3 bucket of the EMR cluster → Restore table from S3 bucket
- Evaluation: This option was the easiest to use and required a low overall runtime, but the performance and table availability improvements were minimal.
- Table Exports: Create Snapshot from source -> Export snapshot to S3 bucket -> Import snapshot from the S3 bucket -> Clone Snapshot
- Evaluation: Easy to use, but limited performance and availability improvements at the cost of the high overall runtime.
- Copy Table: Use CopyTable to transfer individual tables from the source to the target cluster.
- Evaluation: A complex task that requires the highest overall runtime but offers significant performance and table availability improvements.
Modernize
The modernization phase largely covers the implementation plan from the Assessment phase and action tasks from the Migrate phase. The modernization phase aims to achieve the design goals of high scalability; an efficient and highly compatible open systems architecture; and a managed-service offering with maximal visibility, transparency, and control over the underlying infrastructure. These goals were achieved by implementing the following modernization processes:
- Migration: Each table is migrated from Cloudera HBase 1.4 to Amazon EMR running the HBase 2.2.6
- Validation: Data consistency and data integration are validated and guaranteed across all MapReduce Jobs and other applications.
- Support: Ensure performance improvements, compatibility, and integration of data workloads and apps interacting with the HBase Client and protobuf for the HBase 2.2.6 implementation on Amazon EMR clusters.
- Strategy: Devising a strategic plan to perform the cutover with the Flipboard team, streamlining the transition process, and ensuring that the company can maximize Amazon EMR performance improvements immediately following the cutover.
Performance Comparison and Results
- Cluster 1: EMR with HDFS storage mode (3 m5.2xlarge Master nodes & 15 i3.8xlarge core nodes)
- Cluster 2: EMR with S3 storage mode (3 r5a.8xlarge Master nodes & 15 r5a.8xlarge core nodes)
- Cluster 3: EMR with S3 storage mode (3 r6g.8xlarge Master nodes & 15 r6g.8xlarge core nodes)
- Cluster 4: EMR with S3 storage mode (3 r6g.12xlarge Master nodes & 9 r6g.12xlarge core nodes)
The first cluster was chosen to simulate the configurations from the Cloudera HBase cluster. Mactores enabled Flipboard to realize the following improvements by migrating from Cloudera HBase clusters to a fully managed Amazon EMR HBase data platform:
Throughput (Ops/sec)
|
Average Read Latency
|
Average Write Latency
|
|
Cluster 1
|
73117
|
6128
|
34769
|
Cluster 2
|
115021
|
3449
|
22497
|
Cluster 3
|
219114
|
2047
|
11516
|
Cluster 4
|
219045
|
2336
|
11282
|
Read Requests
|
Write Requests
|
|
EC2 Based Cloudera Cluster
|
100,000
|
55,000
|
Cluster 4 on AWS EMR
|
300,000
|
150,000
|
Conclusion
Flipboard was, therefore, able to achieve 30% performance improvements on the Amazon EMR platform while also realizing the benefits of a fully-managed cloud service. The company is no longer overwhelmed by the complex and resource-intensive tasks of self-managing a big data platform as it scales its business across a growing user base. Mactores is already pursuing further steps for modernization and planning to migrate tables to DynamoDB and reevaluating new business use cases with the Amazon EMR platform.