Managing Chunk Size
Describes the considerations for managing the chunk size for map tasks.
Files in the HPE Ezmeral Data Fabric filesystem are split into chunks (similar to Hadoop blocks) that are normally 256 MB by default. Any multiple of 65,536 bytes is a valid chunk size, but tuning the size correctly is important:
- Smaller chunk sizes result in larger numbers of map tasks, which can result in lower performance due to task scheduling overhead.
- Larger chunk sizes require more memory to sort the map task output, which can crash the JVM or add significant garbage collection overhead. HPE Ezmeral Data Fabric can deliver a single stream at upwards of 300 MB per second, making it possible to use larger chunks than in the stock Hadoop implementation. Generally, it is wise to set the chunk size between 64 MB and 256 MB.
Chunk size is set at the directory level. Files inherit the chunk size settings of the directory that contains them, as do subdirectories on which chunk size has not been explicitly set. Any files written by a Hadoop application, whether using the file APIs or over NFS for the HPE Ezmeral Data Fabric, use chunk size specified by the settings for the directory where the file is written. If you change a directory's chunk size settings after writing a file, the file will keep the old chunk size settings. Further writes to the file will use the file's current chunk size.
0
), when an application makes
a request for block size, HPE
Ezmeral Data Fabric will return 1073741824 (1GB); however, hadoop mfs
commands will continue to
display 0
for chunk size.Configuring Chunk Size
Chunk size also affects parallel processing and random disk I/O during MapReduce applications. A
higher chunk size means less parallel processing because there are fewer map inputs, and
therefore fewer mappers. A lower chunk size improves parallelism, but results in higher
random disk I/O during shuffle because there are more map outputs. Set the
io.sort.mb
parameter to a value between 120% and 150% of the chunk
size.
- For most purposes, set the chunk size to the default 256 MB and set the value of
the
io.sort.mb
parameter to the default 380 MB. - On very small clusters or nodes with not much RAM, set the chunk size to 128 MB
and set the value of the
io.sort.mb
parameter to 190 MB. - If application-level compression is in use, the
io.sort.mb
parameter should be at least 380 MB.
store.parquet.block-size
parameter in Drill so that the Parquet
block size is the same as the chunk size in the HPE Ezmeral Data Fabric filesystem. See Configuring the Parquet
Block Size for more information.Setting Chunk Size
You can set the chunk size for a given directory in two ways:
- Change the
ChunkSize
attribute in the .dfs_attributes file at the top level of the directory - Use the command
hadoop mfs -setchunksize <size> <directory>
For example, if the volume test
is NFS-mounted at
/mapr/my.cluster.com/projects/test
you can set the
chunk size to 268,435,456 bytes by editing the file
/mapr/my.cluster.com/projects/test/.dfs_attributes
and
setting ChunkSize=268435456
. To accomplish the same
thing from the hadoop
shell, use the following
command:
hadoop mfs -setchunksize 268435456 /mapr/my.cluster.com/projects/test