Elastic MapReduce Essentials:
- Amazon EMR is a service which deploys out EC2 instances based off of the Hadoop big data framework.
EMR is used to analyze and process vast amounts of data.
EMR also supports other distributed frameworks, such as
- Apache Spark
- HBase
- Presto
- Flink
General EMR Workflow
- Data stored in S3, DynamoDB, or Redshift is sent to EMR.
- The data is mapped to a "cluster" of Hadoop Master/Slave nodes for processing.
- Computations (code/created by the developer) are used to process the data.
- The processed data is then reduced to a single output set of return information.
Other Important EMR Facts:
- You (the admin) has the ability to access the underlying operating system.
- You can add user data to EC2 instances launched into the cluster via bootstrapping.
- EMR takes advantage of parallel processing for faster processing of data.
- You can resize a running cluster at any time, and you can deploy multiple cluster.
EMR Slave Nodes:
There are two types of slave nodes:
Core node:
- A slave node has software components which run tasks AND stores data in the Hadoop Distributed File System (HDFS) on your cluster.
- The core nodes do the "heavy lifting" with the data.
Task node:
- A slave node that has software components which only run tasks.
- Task nodes are optional.
EMR Map Phase:
- Mapping is a function that defines the processes which splits the large data file for processing.
- During the mapping phase, the data is split into 128 MB "chunks".
- The larger the instance size used in our EMR cluster, the more chunks you can map and process at the same time.
- If there are more chunks than nodes/mappers, the chunks will queue for processing.
EMR Reduce Phase:
- Reducing is a function that aggregates the split data back into one data source.
- Reduced data needs to be stored (in a service like S3) as data processed by the EMR cluster is not persistent.