hadoop distcp
The hadoop distcp
command is a tool used for large
inter- and intra-cluster copying.
It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
Syntax
hadoop [ Generic Options ] distcp
[-p[erbugp] ]
[-i ]
[-log ]
[-m ]
[-overwrite ]
[-update ]
[-f <URI list> ]
[-filelimit <n> ]
[-sizelimit <n> ]
[-delete ]
<source>
<destination>
Parameters
Command Options
The following command options are supported for the hadoop
distcp
command:
Parameter |
Description |
---|---|
|
Specify the source URL. |
|
Specify the destination URL. |
-blocksperchunk <number-of-blocks-per-chunk> |
Number of blocks per chunk. When specified, this option splits files
into chunks to copy the files in parallel. If the option is set to a
positive value, files with more blocks than this value are split into chunks
of <number-of-blocks-per-chunk> blocks to be transferred
in parallel and reassembled at the destination. By default,
<number-of-blocks-per-chunk> is 0 and files are
transmitted in their entirety without splitting. |
|
Preserve
|
|
Ignore failures. As explained in the below, this option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted. |
|
Write logs to |
|
Maximum number of simultaneous copies. Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput. See Map Sizing. |
|
Overwrite destination. If a map fails and |
|
Overwrite if |
|
Use list at <URI list> as source list. This is equivalent to listing each source on the command line. The value of <URI list> must be a fully qualified URI. |
|
Limit the total number of files to be <= n. See Symbolic Representations. |
|
Limit the total size to be <= n bytes. See Symbolic Representations. |
|
Delete the files existing in the |
Generic Options
The hadoop distcp
command supports the following
generic options: -conf <configuration file>
,
-D <property=value>
, -fs <local|file
system URI>
, -jt
<local|jobtracker:port>
, -files
<file1,file2,file3,...>
, -libjars
<libjar1,libjar2,libjar3,...>
, and -archives
<archive1,archive2,archive3,...>
.
For more information on generic options, see Generic
Options.
Symbolic Representations
The parameter <n>
in -filelimit
and -sizelimit
can be specified with symbolic
representation. For example,
- 1230k = 1230 * 1024 = 1259520
- 891g = 891 * 1024^3 = 956703965184
Map Sizing
The hadoop distcp
command attempts to size each map
comparably so that each copies roughly the same number of bytes.
Note that files are the finest level of granularity, so increasing
the number of simultaneous copiers (i.e. maps) may not always
increase the number of simultaneous copies nor the overall
throughput.
If -m
is not specified, distcp
will
attempt to schedule work for min (total_bytes /
bytes.per.map, 20 * num_task_trackers)
where
bytes.per.map
defaults to 256MB.
Tuning the number of maps to the size of the source and destination clusters, the size of the copy, and the available bandwidth is recommended for long-running and regularly run jobs.
Examples
For all of the below examples, the cluster name must be specified in the mapr-clusters.conf configuration file.
Basic inter-cluster copying
The hadoop
distcp
command is most often used to copy files between
clusters:
hadoop distcp maprfs://cluster1/foo \
maprfs://cluster2/bar
The command in the example expands the namespace
under /foo/bar
on cluster1 into a temporary file, partitions its
contents among a set of map tasks, and starts a copy on each NodeManager node from
cluster1 to cluster2. Note that the hadoop distcp
command expects
absolute paths.
Only those files that do not already exist in the destination are copied over from the source directory.
Updating files between clusters
Use the hadoop distcp -update
command to synchronize
changes between
clusters.
$ hadoop distcp -update maprfs://cluster1/foo maprfs://cluster2/bar/foo
Files
in the /foo
subtree are copied from cluster1 to cluster2 only if the
size of the source file is different from that of the size of the destination file.
Otherwise, the files are skipped over.
Note that using the
-update
option changes distributed copy interprets the source and
destination paths making it necessary to add the trailing /foo
subdirectory in the second cluster.
Overwriting files between clusters
By default, distributed copy skips files that already exist in the
destination directory, but you can overwrite those files using the
-overwrite
option. In this example, multiple source directories are
specified:
$ hadoop distcp -overwrite maprfs://cluster1/foo/a \
maprfs://cluster1/foo/b \
maprfs://cluster2/bar
As with using the -update
option,
using the -overwrite
changes the way that the source and destination
paths are interpreted by distributed copy: the contents of the source directories are
compared to the contents of the destination directory. The distributed copy aborts in
case of a conflict.
Intra-cluster copying of files and directories
Thehadoop distcp
command can be used to copy
files and directories in a cluster to another directory in the same cluster. - Copy a file into an existing target directory:
$ hadoop distcp /test/file.log /test/dir1 #verify the result of the distcp command with the hadoop fs -ls command $ hadoop fs -ls -R /test drwxr-xr-x - username username 1 2022-10-14 10:37 /test/dir1 -rw-r--r-- 3 username username 15 2022-10-14 10:37 /test/dir1/file.log -rwxr-xr-x 3 username username 15 2022-10-14 10:29 /test/file.log
- Copy a file and a directory to an existing target
directory:
$ hadoop distcp /test/file.log /test/dir1 /test/dir2 #verify the result of the distcp command with the hadoop fs -ls command $ hadoop fs -ls -R /test drwxr-xr-x - username username 1 2022-10-14 10:37 /test/dir1 -rw-r--r-- 3 username username 15 2022-10-14 10:37 /test/dir1/file.log drwxr-xr-x - username username 2 2022-10-14 10:40 /test/dir2 drwxr-xr-x - username username 1 2022-10-14 10:40 /test/dir2/dir1 -rw-r--r-- 3 username username 15 2022-10-14 10:40 /test/dir2/dir1/file.log -rw-r--r-- 3 username username 15 2022-10-14 10:40 /test/dir2/file.log -rwxr-xr-x 3 username username 15 2022-10-14 10:29 /test/file.log
Migrating Data from HDFS to file system
The
hadoop distcp
command can be used to migrate data from an HDFS
cluster to a file system where the HDFS cluster uses the same
version of the RPC protocol as that used by Data Fabric.
For a discussion, see Copying Data from Apache
Hadoop.
$ hadoop distcp namenode1:50070/foo maprfs:///bar
You must specify the IP address and HTTP port (usually 50070) for the namenode on the HDFS cluster.