Apache Shuffle on YARN
You can disable Direct Shuffle and enable Apache Shuffle by modifying the configuration
options in the yarn-site.xml
and mapred-site.xml
files. This
page describes how to configure Apache Shuffle for MapReduce applications.
The shuffling phase in Hadoop is the process of transferring mappers intermediate output to the reducers. Direct shuffle increases the load on file system disks. You can enable the Apache Shuffle to reduce the load on file system disks.
Configuration for Apache Shuffle
Add the following property to
yarn-site.xml
file:<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
Add the following properties to mapred-site.xml
file:
<property>
<name>mapreduce.job.shuffle.provider.services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>mapreduce.job.reduce.shuffle.consumer.plugin.class</name>
<value>org.apache.hadoop.mapreduce.task.reduce.Shuffle</value>
</property>
<property>
<name>mapreduce.job.map.output.collector.class</name>
<value>org.apache.hadoop.mapred.MapTask$MapOutputBuffer</value>
</property>
<property>
<name>mapred.ifile.outputstream</name>
<value>org.apache.hadoop.mapred.IFileOutputStream</value>
</property>
<property>
<name>mapred.ifile.inputstream</name>
<value>org.apache.hadoop.mapred.IFileInputStream</value>
</property>
<property>
<name>mapred.local.mapoutput</name>
<value>true</value>
</property>
<property>
<name>mapreduce.task.local.output.class</name>
<value>org.apache.hadoop.mapred.YarnOutputFiles</value>
</property>