About Release 7.9.0
This site contains documentation for HPE Ezmeral Data Fabric release 7.9.0, including installation, configuration, administration, and reference content, as well as content for the associated ecosystem components and drivers.
7.9.0 Installation
This section contains information about installing HPE Ezmeral Data Fabric software. It also contains information about how to migrate data and applications from an Apache Hadoop cluster to a HPE Ezmeral Data Fabric cluster.
7.9.0 Data Fabric
HPE Ezmeral Data Fabric is the industry-leading data platform for AI and analytics that solves enterprise business needs.
7.9.0 Administration
This section describes how to manage the nodes and services that make up a cluster.
7.9.0 Development
This section contains information related to application development for Ezmeral ecosystem components and HPE Ezmeral Data Fabric products, including the file system, Database (Key-Value and JSON), and Event Streams.
Other Docs
This section contains release-independent information, including: Installer documentation, Ecosystem release notes, interoperability matrices, security vulnerabilities, and links to other Data Fabric version documentation.
Glossary
Definitions for commonly used terms in MapR Converged Data Platform environments.
- .dfs_attributes
  A special file in every directory, for controlling the compression and chunk size used for the directory and its subdirectories.
- .rw
  A special mount point in the root-level volume (or read-only mirror) that points to the writable original copy of the volume.
- .snapshot
  A special directory in the top level of each volume, containing all the snapshots for that volume.
- access control expression (ACE)
  A Boolean expression that defines a combination of users, groups, or roles that have access to an object stored natively such as a directory, file, or HPE Ezmeral Data Fabric Database table.
- access control list (ACL)
  A list of permissions attached to an object. An ACL specifies users or system processes that can perform specific actions on an object.
- access policy
  An ACL or policy in JSON format that describes user access. Grants accounts and IAM users permissions to perform resource operations, such as putting objects in a bucket. You associate access policies with accounts, users, buckets, and objects.
- account
  Relates to Object Store. An account is a unique administrative unit that owns buckets, policies, and users. A default account exists automatically upon installation of Object Store. You cannot create IAM users in the default account and applications cannot access buckets in the default account. An administrator can create accounts and then create IAM users and buckets in those accounts. Applications can then access buckets as the IAM users in those accounts.
- accountable entity (AE)
  In the Control System, a user or group whose use of a volume can be subject to quotas. Using the Control System, you can set or modify quotas that limit the space used by all the volumes owned by an accountable entity.
- accounting entity (AE)
  In the CLI, a user or group whose use of a volume can be subject to quotas. Using the CLI, you can set or modify quotas that limit the space used by all the volumes owned by the accounting entity.
- administrator
  A user or users with special privileges to administer the cluster or cluster resources. Administrative functions can include managing hardware resources, users, data, services, security, and availability.
- advisory quota
  An advisory disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the advisory quota, an alert is sent.
- air gap
  Physical isolation between a computer system and unsecured networks. To enhance security, air-gapped computer systems are disconnected from other systems and networks.
- application containers
  Lightweight, stand-alone executables that include verything needed to run an application. Application containers are typically available for Linux and Windows applications.
- binary table
  Key-value and columnar database with HBase API. Supports Apache HBase tables and databases and also provides a native implementation of the HBase API for optimized performance on the Data Fabric platform.
- bitmask
  A binary number in which each bit controls a single toggle.
- bucket
  Container for objects. Access policies control user and application access to buckets.
- chunk
  Files in the file system are split into chunks (similar to Hadoop blocks) that are normally 256 MB by default. Any multiple of 65,536 bytes is a valid chunk size, but tuning the size correctly is important. Files inherit the chunk size settings of the directory that contains them, as do subdirectories on which chunk size has not been explicitly set. Any files written by a Hadoop application, whether via the file APIs or over NFS, use chunk size specified by the settings for the directory where the file is written.
- client node
  A node that runs the mapr-client that can access every cluster node and is used to access the cluster. Also referred to as an "edge node." Client nodes and edge nodes are NOT part of a Data Fabric cluster.
- cluster admin
  The Data Fabric user.
- cluster node
  A node that is part of a Data Fabric cluster. Cluster nodes can be used for data, compute, or both data and compute.
- coalesce
  The interval of time during which READ, WRITE, or GETATTR operations on one file from one IP address or UID are logged only once for a particular operation, if auditing is enabled.
- composite ID
  A unique, internal integer that maps to a security policy or set of security policies. A composite ID is stored with a resource instead of a security policy to optimize storage space.
- compute node
  A compute node is used to process data using a compute engine (for example, YARN, Hive, Spark, or Drill). A compute node is by definition a Data Fabric cluster node.
- container
  The unit of shared storage in a Data Fabric cluster. Every container is either a name container or a data container.
- container location database (CLDB)
  A service, running on one or more Data Fabric nodes, that maintains the locations of services, containers, and other cluster information.
- core
  The minimum complement of software packages required to construct a Data Fabric cluster. These packages include mapr-core, mapr-core-internal, mapr-cldb, mapr-apiserver, mapr-fileserver, mapr-zookeeper, and others. Note that ecosystem components are not part of core.
- custom resource (CR)
  In Kubernetes, the plan or blueprint for building and maintaining an application. Custom resources are specified as .yaml files.
- custom resource definition (CRD)
  In Kubernetes, a list of valid fields that defines the shape of a custom resource (CR).
- data-access gateway
  A service that acts as a proxy and gateway for translating requests between lightweight client applications and the Data Fabric cluster.
- data compaction
  A process that enables users to remove empty or deleted space in the database and to compact the database to occupy contiguous space.
- data container
  One of the two types of containers in a Data Fabric cluster. Data containers typically have a cascaded configuration (master replicates to replica1, replica1 replicates to replica2, and so on). Every data container is either a master container, an intermediate container, or a tail container depending on its replication role.
- Data Fabric
  A collection of nodes that work together under a unified architecture, along with the services or technologies running on that architecture. A fabric is similar to a Linux cluster. Fabrics help you manage your data, making it possible to access, integrate, model, analyze, and provision your data seamlessly.
- Data Fabric administrator
  The "Data Fabric user." The user that cluster services run as (typically named mapr or hadoop) on each node.
- Data Fabric gateway
  A gateway that supports table and stream replication. The Data Fabric gateway mediates one-way communication between a source Data Fabric cluster and a destination cluster. The Data Fabric gateway also applies updates from JSON tables to their secondary indexes and propagates Change Data Capture (CDC) logs.
- Data Fabric user
  The user that cluster services run as (typically named mapr or hadoop) on each node. The Data Fabric user, also known as the "Data Fabric admin," has full privileges to administer the cluster. The administrative privilege, with varying levels of control, can be assigned to other users as well.
- data node
  A data node has the function of storing data and always runs FileServer. A data node is by definition a Data Fabric cluster node.
- desired replication factor
  The number of copies of a volume that should be maintained by the Data Fabric cluster for normal operation.
- developer preview
  A label for a feature or collection of features that have usage restrictions. Developer previews are not tested for production environments, and should be used with caution.
- disk space balancer
  The disk space balancer is a tool that balances disk space usage on a cluster by moving containers between storage pools. Whenever a storage pool is over 70% full (or a threshold defined by the cldb.balancer.disk.threshold.percentage parameter), the disk space balancer distributes containers to other storage pools that have lower utilization than the average for that cluster. The disk space balancer aims to ensure that the percentage of space used on all of the disks in the node is similar.
- disktab
  A file on each node, containing a list of the node's disks that have been configured for use by the file system.
- Docker containers
  The application containers used by Docker software. Docker is a leading proponent of OS virtualization using application containers ("containerization").
- Domain
  Relates to Object Store. A domain is a management entity for accounts and users. The number of users, the amount of disk space, number of buckets in each of the accounts, total number of accounts, and the number of disabled accounts are all tracked within a domain. Currently, Object Store only supports the primary domain; you cannot create additional domains. Administrators can create multiple accounts in the primary domain.
- domain user
  Relates to Object Store. A domain user is a cluster security principal authenticated through AD/LDAP. Domain users only exist in the default account. Domain users can log in to the Object Store UI with their domain username and password.
- dump file
  A file containing data from a volume for distribution or restoration. There are two types of dump files: full dump files containing all data in a volume, and incremental dump files that contain changes to a volume between two points in time.
- Ecosystem Pack (EEP)
  A selected set of stable, interoperable, and widely used components from the Hadoop Ecosystem that are fully supported on the Data Fabric platform.
- edge cluster
  A small-footprint edition of the HPE Ezmeral Data Fabric designed to capture, process, and analyze IoT data close to the source of the data.
- edge node
  A node that runs the mapr-client that can access every cluster node and is used to access the cluster. Also referred to as a "client node." Client nodes and edge nodes are NOT part of a Data Fabric cluster.
- entity
  A user or group.
- epoch
  A sequence number that identifies all copies that have the latest updates for a container. The larger the number, the most up-to-date the copy of the container. The CLDB uses the epoch to ensure that an out-of-date copy cannot become the master for the container.
- filelet
  A filelet, also called an fid, is a 256MB shard of a file. A 1 GB file for instance is comprised of the following filelets: 64K (primary fid)+(256MB-64KB)+256MB+256MB+256MB.
- file system
  The NFS-mountable, distributed, high-performance HPE Ezmeral Data Fabric data-storage system.
- full dump file
- gateway node
  A node on which a mapr-gateway is installed. A gateway node is by definition a Data Fabric cluster node.
- global namespace (GNS)
  The data plane that connects HPE Ezmeral Data Fabric deployments. The global namespace is a mechanism that aggregates disparate and remote data sources and provides a namespace that encompasses all of your infrastructure and deployments. Global namespace technology lets you manage globally deployed data as a single resource. Because of the global namespace, you can view and run multiple fabrics as a single, logical, and local fabric. The global namespace is designed to span multiple edge nodes, on-prem data centers, and clouds.
- HBase
  A distributed storage system, designed to scale to a very large size, for managing massive amounts of structured data.
- HPE Ezmeral Data Fabric
  A software-as-a-service (SaaS) platform for the hybrid enterprise with data distributed from edge to core to cloud. The federated global namespace integrates files, objects, tables, and streaming data and offers consumption-based pricing. Far-flung deployments run in a single, logical view no matter where the data is located.
- HPE Ezmeral Data Fabric – Customer Managed
  A platform for data-driven analytics, ML, and AI workloads that also serves as a secure data store and provides file storage, NoSQL databases, object storage, and event streams. The patented file-system architecture was designed and built for performance, reliability, and scalability.
- HPE Ezmeral Data Fabric Edge
  A small-footprint edition of the HPE Ezmeral Data Fabric designed to capture, process, and analyze IoT data close to the source of the data. Also referred to as an "edge cluster."
- heartbeat
  A signal sent by each FileServer and NFS node every second to provide information to the CLDB about the node's health and resource usage.
- IAM users
  Relates to Object Store. An IAM (Identity and Access Management) user represents an actual user or an application. An administrator creates IAM users in an Object Store account and assigns access policies to them to control user and application access to resources in the account.
- incremental dump file
- Installer
  A program that simplifies installation of the HPE Ezmeral Data Fabric. The Installer guides you through the process of installing a cluster with Data Fabric services and ecosystem components. You can also use the Installer to update a previously installed cluster with additional nodes, services, and ecosystem components. And you can use the Installer to upgrade a cluster to a newer core version if the cluster was installed using the Installer or an Installer Stanza.
- Installer node
  The node on which you run the Installer program. The Installer node can be a node in the cluster that you plan to install; or, it can be a node that is not part of the cluster. But certain prerequisites must be met if the Installer node is not one of the nodes in the cluster to be installed.
- Kubernetes Interfaces for Data Fabric
  A set of Docker containers that provide persistent storage for Kubernetes objects through the file system. Once the Docker containers are installed, both a Kubernetes FlexVolume Driver and a Kubernetes Dynamic Volume Provisioner are available for static and dynamic provisioning of Data Fabric storage.
- log compaction
  A process that purges messages previously published to a topic partition, retaining the latest version.
- MAST Gateway
  A gateway that serves as a centralized entry point for all the operations that need to be performed on tiered storage.
- minimum replication factor
  The minimum number of copies of a volume that should be maintained by the Data Fabric cluster for normal operation. When the replication factor falls below this minimum, re-replication occurs as aggressively as possible to restore the replication level. If any containers in the CLDB volume fall below the minimum replication factor, writes are disabled until aggressive re-replication restores the minimum level of replication.
- mirror
  A read-only physical copy of a volume.
- MOSS
  MOSS is the acronym for Multithreaded Object Store Server.
- name container
  A container in a Data Fabric cluster that holds a volume's namespace information and file chunk locations, and the first 64 KB of each file in the volume.
- Network File System (NFS)
  A protocol that allows a user on a client computer to access files over a network as though they were stored locally.
- node
  An individual physical or virtual machine in a cluster.
- NodeManager (NM)
  A data service that works with the ResourceManager to host the YARN resource containers that run on each data node.
- object
  File and metadata that describes the file. You upload an object into a bucket. You can then download, open, move, or delete the object.
- Object Store
  Object and metadata storage solution built into the HPE Ezmeral Data Fabric. Object Store efficiently stores data for fast access and leverages the capabilities of the patented HPE Ezmeral Data Fabric file system for performance, reliability, and scalability.
- operator
  In Kubernetes, a way to install and manage an application. Kubernetes operators handle not just application installation, but also the entire application lifecycle, including complex upgrades. An operator consists of a combination of two real Kubernetes objects: a controller and a custom resource.
- Persistent Application Client Container (PACC)
  A Docker-based application container image that includes a container-optimized Data Fabric client. The PACC provides seamless access to cluster services, including the file system, HPE Ezmeral Data Fabric Database, and HPE Ezmeral Data Fabric Streams. The PACC makes it fast and easy to run containerized applications that access data in cluster.
- policy server
  The service that manages security policies and composite IDs.
- quota
  A disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the quota, no more data can be written.
- recovery point objective (RPO)
  The maximum allowable data loss as a point in time. If the recovery point objective is two hours, then the maximum allowable amount of data loss that is acceptable is two hours of work.
- recovery time objective (RTO)
  The maximum allowable time to recovery after data loss. If the recovery time objective is five hours, then it must be possible to restore data up to the recovery point objective within five hours.
- replication factor
  The number of copies of a volume.
- replication role
  The replication role of a container determines how that container is replicated to other storage pools in the cluster.
- replication role balancer
  The replication role balancer is a tool that switches the replication roles of containers to ensure that every node has an equal share of of master and replica containers (for name containers) and an equal share of master, intermediate, and tail containers (for data containers).
- re-replication
  Re-replication occurs whenever the number of available replica containers drops below the number prescribed by that volume's replication factor. Re-replication may occur for a variety of reasons including replica container corruption, node unavailability, hard disk failure, or an increase in replication factor.
- ResourceManager (RM)
  A YARN service that manages cluster resources and schedules applications.
- role
  The service that the node runs in a cluster. You can use a node for one, or a combination of the following roles: CLDB, JobTracker, WebServer, ResourceManager, Zookeeper, FileServer, TaskTracker, NFS, and HBase.
- secret
  A Kubernetes object that holds sensitive information, such as passwords, tokens, and keys. Pods that require this sensitive information reference the secret in their pod definition. Secrets are the method Kubernetes uses to move sensitive data into pods.
- secure by default
  The HPE Ezmeral Data Fabric platform and supported ecosystem components are designed to implement security unless the user takes specific steps to turn off security options.
- security policy
  A classification that encapsulates security controls on your data. Controls include which users have authorization to access and modify the data, whether to audit data operations, and whether to protect data in motion with wire-level encryption.
- schedule
  A group of rules that specify recurring points in time at which certain actions are determined to occur.
- snapshot
  A read-only logical image of a volume at a specific point in time.
- storage pool
  A unit of storage made up of one or more disks. By default, Data Fabric storage pools contain two or three disks. For high-volume reads and writes, you can create larger storage pools when initially formatting storage during cluster creation.
- stripe width
  The number of disks in a storage pool.
- super group
  The group that has administrative access to the Data Fabric cluster.
- super user
  The user that has administrative access to the Data Fabric cluster.
- tagging
  Operation of applying a security policy to a resource.
- ticket
  In the Data Fabric platform, a file that contains keys used to authenticate users and cluster servers. Tickets are created using the maprlogin or configure.sh utilities and are encrypted to protect their contents. Different types of tickets are provided for users and services. For example, every user who wants to access a cluster must have a user ticket, and every node in a cluster must have a server ticket.
- ticket secret
  A Kubernetes secret that contains a ticket.
- volume
  A tree of files and directories grouped for the purpose of applying a policy or set of policies to all of them at once.
- Warden
  A Data Fabric process that coordinates the starting and stopping of configured services on a node.
- WORM
  Write Once Read Many
- YARN resource containers
  A unit of memory allocated for use by YARN to process each map or reduce task.
- ZooKeeper
  ZooKeeper is a coordination service for distributed applications. It provides a shared hierarchical namespace that is organized like a standard file system.
- Fabric
  Fabric is an alternate term for a cluster.

storage pool

A unit of storage made up of one or more disks. By default, Data Fabric storage pools contain two or three disks. For high-volume reads and writes, you can create larger storage pools when initially formatting storage during cluster creation.

NOTE

Storage pool refers to the combined storage capacity that is obtained by combining one or more storage devices. Storage devices can be anything from a very small disk drive to large arrays of disk drives (each containing 20-30 drives).

A storage pool is created to get a very large capacity of GBs/TBs/PBs available, from which users are provided the required amount of storage.

For example, one can combine 10 hard disk drives of 4TB each, totaling to 40TBs. Now, one can either directly use the 40TB as a single device or partition the space out to many smaller storage capacities such as 100GB, 1TB and so on from this 40TB and provide that access to different users.

A storage pool can be created even after a cluster/fabric has been created. For example, you can add several unused disks to the distributed filesystem at the same time, to form a new storage pool.

If a disk that is a part of a storage pool fails and needs to be replaced, all data located in the storage pool is lost, including the data that is stored on the other disks from the storage pool. The data reconstruction or re-replication takes more time than the time taken if the failing disk was the only disk in the storage pool.

When data is replicated, data is available from other copies in other storage pools.

When erasure coding is used, the missing data blocks are reconstructed from other blocks with the help of parity blocks. The data block reconstruction operation is resource-intensive and takes time. During the operation, the files that have missing blocks missing are not available. If an application tries to read such files will either be suspended or get an I/O error during the read operation. Large storage pools should be avoided for erasure coding, by using storage labels.

Partners Support Dev-Hub Community ALA Privacy Policy Glossary

HPE Ezmeral Data Fabric – Customer-Managed 7.9.0 Documentation
Abstract	This site contains documentation for the customer-managed platform of the HPE Ezmeral Data Fabric version 7.9.0 including installation, configuration, administration, and reference content, as well as content for the associated bundled ecosystem components and drivers.
Published	April 2025
Edition	7.9.0
Topic last updated	2024-08-26