Mactores Blog

Migrate High Volume Cloudera HBase Data Platform | Mactores

Written by Bal Heroor | Aug 16, 2022 5:00:00 PM
The social media company Flipboard started with a mission to create a user-centric news curation platform that connects every user with the most relevant news stories from across the Internet. The idea was driven by the growing need for a unified data platform that identifies and aggregates the most engaging stories published across various online media channels.
 
From a user perspective, news aggregation is a dynamic and rapidly evolving problem - every user demands convenient access to the most interesting news stories based on ever-changing preferences. From a technology perspective, the solution involves a unified and managed serverless cloud platform capable of running large-scale distributed big data workloads, integrated with open-source Big Data frameworks, and ready to run advanced AI applications at scale.
 
Flipboard currently has over 100 million active users generating vast volumes of data. The company's existing Cloudera HBase platform failed to support scaling their big data workload processing demands. The platform saw, on average, Read requests of 300k/second and Write requests of around 120k/second.
 
The Cloudera cluster running on the AWS EC2 across 5,600 regions and processing 40TB of information on the AWS EBS was entirely self-managed. The growing demands on big data processing, throughput performance, and good business analytics were overwhelming challenges for the workforce managing the Cloudera platform with limited technology capabilities.
 
The cloud migration and modernization process streamlined Flipboard's distributed database capabilities, allowing the social media platform to support user spikes at scale, maximize throughput performance and prepare to expand the user base exponentially.
 
Let's take an in-depth view of the cloud migration and data platform modernization process for Flipboard:
 

The AWS EMR Migration Strategy

Flipboard had originally adopted the Cloudera HBase platform to realize the goals of a high-speed distributed database, scalable Hadoop operations, and the flexibility to manage vast volumes of data. The company took advantage of the variety of AWS Infrastructure as a Service (IaaS) capabilities and developed a platform that best satisfied its technology demands for the foreseeable future.
 
In doing so, the company identified two key challenges:
 
  • the infrastructure demands for a data-driven company is a moving target that varies with limited predictability and control;
  • At the same time, the company was inherently overwhelmed by the efforts and data engineering resources required to effectively self-manage a large scale distributed database platform.

As a solution, Mactores devised a strategic plan to:

  • Migrate existing Apache HBase systems running on the Cloudera platform to a fully-managed AWS EMR platform integrated with the AWS S3 storage solutions
  • Migrate all tables and associated processes from the Cloudera cluster to Amazon EMR
  • Reconfigure protocol buffers and Apache HBase client from the existing MapReduce jobs to support Apache HBase on Amazon EMR
  • Migrate all MapReduce Jobs to a separate transient Amazon EMR cluster and initiate the cluster from Jenkins workflow
  • Migrate HBase 1.4 to HBase 2.2.6 on Amazon EMR 6.2.0

Journey to a Fully Managed AWS EMR Hbase Platform

The migration and data platform modernization process started off with a thorough assessment and evaluation of the company's current state of cloud readiness. This approach allowed Mactores to establish a migration plan that was well aligned with Flipboard's long-term business goals and end-user expectations.
 
The assessment also identified gaps and opportunities throughout the migration journey. The result was a high stakeholder commitment and motivated teams from Flipboard, ready to maximize the value potential of the most advanced AWS EMR capabilities to power advanced big data workloads on the managed cloud platform. The migration process was accelerated by following an iterative approach across three phases:
 

Assess

The Assessment phase involves building a solid business case and an action plan based on the prior assessment and evaluation. This phase combines the people, process, and technology required to adopt and execute distributed database capabilities within a secure, automated, and efficient AWS environment. This phase's overarching goals include:


  • Mobilizing teams and preparing for the Amazon EMR migration. The transition was designed to be user-centric, reducing the learning curve and transforming teams to maximize the value of the cloud environment.
  • Defining and automating policies: security, operations, and compliance.
  • Running cloud-based Hadoop platform in production capacity to improve performance across several benchmarks and metrics.


To realize these goals, Mactores performed the following cloud migration, provisioning, and management tasks:

  • AWS EMR Architecture with S3 Storage: Configured 3 x r6g.12xlarge Master nodes, 9 x r6g.12xlarge core nodes on Regional Servers, and added additional nodes for autoscaling.
  • Configurations: Pre-validated EMR-specific configurations applied to Apache HBase, Amazon S3 Region Server, WAL, BlockCache, and Memstore.
  • Data Migration: Cloudera to S3 HBase migration with a single snapshot one week before the cut-off date, followed by incremental periodic updates until final migration.
  • Stress Testing: YSCB framework was used to gather performance benchmarks throughout the migration process.

Figure 1: Shows the final deployment architecture to perform the migration from self-managed HBase on a Hadoop Cluster to Fully Managed Amazon EMR HBase.

Migrate

The Migrate phase extends the action tasks from the Assess phase and applies them to migrate data workloads at scale. This ongoing and iterative process covers the process, technology, and design best practices about Amazon EMR migration and prepares the organization for a fully managed AWS data platform offering. The following tasks were involved in the migration process:


  • AWS EMR Architecture for MapReduce Jobs: 1 x m5.2xlarge Master nodes, 5 x m5.4xlarge core nodes of Region Servers, and autoscaling of core nodes for MapReduce Jobs.
  • EMR Configurations: Amazon EMR-specific HDFS and MapReduce configurations based on the pre-validated Apache Hadoop platform previously running on the Cloudera cluster.
  • Workload Migration: From MapReduce to Amazon EMR v5.30.0 by recompiling the MapReduce code.
  • Performance Testing: Writing tests and executing MapReduce Jobs on Amazon EMR cluster with Amazon EMR HBase for benchmark comparisons.

The HBase RegionServer core nodes installed on the EC2 cluster offer improved Read performance by caching data and using efficient in-memory filters. The HBase Write Ahead Log (WAL) offers durable write performance for data stored in the on-cluster HDFS. Similarly, the Task Nodes write the HBase WAL requests in HDFS running on the Core Nodes with the Amazon EMR File System implementation providing convenient storage of persistent data on the S3 storage platform for strong read-after-write consistency.


Specifically, Mactores evaluated three data migration options and recommended an optimal mix for various MapReduce Jobs. The following options were exercised:
Figure 2 Shows the various options available to migrate data from the existing self-managed Hive on Hadoop Cluster to Fully Managed HBase on Amazon EMR

  • Snapshot: Create Snapshot from source → Export Snapshot to S3 bucket of the EMR cluster → Restore table from S3 bucket
  • Evaluation: This option was the easiest to use and required a low overall runtime, but the performance and table availability improvements were minimal.
  • Table Exports: Create Snapshot from source -> Export snapshot to S3 bucket -> Import snapshot from the S3 bucket -> Clone Snapshot
  • Evaluation: Easy to use, but limited performance and availability improvements at the cost of the high overall runtime.
  • Copy Table: Use CopyTable to transfer individual tables from the source to the target cluster.
  • Evaluation: A complex task that requires the highest overall runtime but offers significant performance and table availability improvements.


Modernize

The modernization phase largely covers the implementation plan from the Assessment phase and action tasks from the Migrate phase. The modernization phase aims to achieve the design goals of high scalability; an efficient and highly compatible open systems architecture; and a managed-service offering with maximal visibility, transparency, and control over the underlying infrastructure. These goals were achieved by implementing the following modernization processes:


  • Migration: Each table is migrated from Cloudera HBase 1.4 to Amazon EMR running the HBase 2.2.6
  • Validation: Data consistency and data integration are validated and guaranteed across all MapReduce Jobs and other applications.
  • Support: Ensure performance improvements, compatibility, and integration of data workloads and apps interacting with the HBase Client and protobuf for the HBase 2.2.6 implementation on Amazon EMR clusters.
  • Strategy: Devising a strategic plan to perform the cutover with the Flipboard team, streamlining the transition process, and ensuring that the company can maximize Amazon EMR performance improvements immediately following the cutover.


Performance Comparison and Results

Mactores conducted stress testing to discover performance bottlenecks across the network using the Yahoo Cloud Serving Benchmark (YCSB) across a set of common workloads. The tests were conducted across various operations such as 100% Read, 100% Write, and 50% of Reading and Write operations on the following clusters:

  • Cluster 1: EMR with HDFS storage mode (3 m5.2xlarge Master nodes & 15 i3.8xlarge core nodes)
  • Cluster 2: EMR with S3 storage mode (3 r5a.8xlarge Master nodes & 15 r5a.8xlarge core nodes)
  • Cluster 3: EMR with S3 storage mode (3 r6g.8xlarge Master nodes & 15 r6g.8xlarge core nodes)
  • Cluster 4: EMR with S3 storage mode (3 r6g.12xlarge Master nodes & 9 r6g.12xlarge core nodes)


The first cluster was chosen to simulate the configurations from the Cloudera HBase cluster. Mactores enabled Flipboard to realize the following improvements by migrating from Cloudera HBase clusters to a fully managed Amazon EMR HBase data platform:



 
Throughput (Ops/sec)
Average Read Latency
Average Write Latency 
Cluster 1
73117
6128
34769
Cluster 2
115021
3449
22497
Cluster 3
219114
2047
11516
Cluster 4
219045
2336
11282
Mactores further compared the Cloudera cluster with the AWS EMR Cluster 4 configurations and found 3X improved performance across Read and Write requests:

 

 
Read Requests
Write Requests
EC2 Based Cloudera Cluster
100,000
55,000
Cluster 4 on AWS EMR
300,000
150,000

 

Conclusion

Flipboard was, therefore, able to achieve 30% performance improvements on the Amazon EMR platform while also realizing the benefits of a fully-managed cloud service. The company is no longer overwhelmed by the complex and resource-intensive tasks of self-managing a big data platform as it scales its business across a growing user base. Mactores is already pursuing further steps for modernization and planning to migrate tables to DynamoDB and reevaluating new business use cases with the Amazon EMR platform.

 
If you want to know your path forward in achieving highly scalable performance along with the successful deployment of Apache HBase on Amazon EMR, let's talk.