Metadata JSON
Each App Store image includes metadata contained inside a JSON file that specifies various interface and configuration options. Some of this metadata is visible in the App Store screen and/or the Create Cluster and Create New Job screens, and is described in the Interface Metadata section of this article. The interface metadata, along with other application configuration metadata, is contained inside the Catalog JSON file that is described in the Catalog JSON File section of this article.
Interface Metadata
The interface-related metadata included for use by the App Store interface consists of the following information:
- Basic image information: This information is visible in the App
Store screen and includes:
- App name: Name of the application.
- Description: Short description of the application.
- Logo: Image file that displays in the App Store screen.
- Hovering the mouse over an application tile expands the tile to display the
following additional information:
- Long Description: Longer description of the application.
- Version: Image version and optional build number.
- Root disk size (Local): Root disk size required for running the image on-premises.
- Root disk size (EC2): Root disk size required for running the image on an EC2 instance.
- Distro ID: Unique identifier for the image.
- Category: Category of Big Data application provided by the image.
- Additional metadata determines the options that will be available in the
Create New Cluster screen when a new cluster is created. This
includes:
- Cluster group name: Type(s) of cluster (such as Hadoop, Spark, and/or Kafka) on which the image can run.
- Node flavor limits: Role type specific node flavor(s) required to run the application, which will be based on the CPU, RAM, and storage requirements of the application.
- Node count limits: Number of role-specific and/or Edge nodes required to run the application.
Catalog JSON File
This article uses the CDH 5.4.3 with Cloudera Manager Catalog entry as an
example for explaining the HPE Ezmeral Runtime Enterprise Catalog (App
Store) entry JSON properties. The cdh54CM.json
file is
located in the /opt/bluedata/catalog/entries/system
directory.
Catalog entry properties can be broadly segregated into the following purposes:
Identification
The identification
blob appears as follows:
"distro_id": "cdh54CM", "label": { "name": "CDH 5.4.3 with Cloudera Manager", "description": "CDH 5.4.3 with MRv1/YARN and HBase support. Includes Pig, Hive, Hue and Spark." }, "version": "2.0.1", "epic_compatible_versions": ["3.4"], "categories": [ "Hadoop", "HBase" ],
In this blob:
distro_id
is unique identifier for either a Catalog entry or a versioned set of Catalog entries. It represents a particular application or application-framework setup as created and maintained by a particular author or organization. The HPE Ezmeral Runtime Enterprise interface and API currently only allow only one Catalog entry with a given distro ID to be installed for use at any given time. Each distro ID corresponds to one "tile" in the Images tab of the App Store screen. HPE Ezmeral Runtime Enterprise may also reference the distro ID when determining appropriate Add-On image entries that can be added to a cluster, because an add-on may have a distro ID requirement.- The
label
property contains the following parameters:name
, which is the "short name" of the Catalog entry. The Catalog API does not allow entries with different distro IDs to share the same name.description
, which is a longer, more detailed blurb about the entry.
version
is a discriminator between multiple Catalog entries that share the same distro ID. It is expected to adhere to a simple pattern of digits separated by dots in the format version a.b.c, where:a.b
is the version number, such as the2.0
in"version": "2.0.1"
. You may assign any version you want to the Catalog entry, and each Catalog entry will have its own unique distro ID . This version represents iterations of this Catalog entry; it does not necessarily represent the version of any software deployed in a cluster. For example, you may have a CDH 5.4 Catalog entry that you deploy as Version 1.0 followed by 1.1, 2.0, etc. HPE Ezmeral Runtime Enterprise installs the newest available version of a given distro ID when instructed to install or upgrade that distro ID.c
is the optional build number, such as the1
in"version": "2.0.1"
. App Workbench stores the first value used forc
when the distro ID is created. Future versions of the same distro ID will automatically increment the build number based on the last value stored in the system, provided that you do not change thec
value in the JSON file. In this example, the first-ever build of the same distro ID will be version 2.0.1, the next version will be 2.0.2, and so forth. Manually entering a new build number that is equal to or less than the stored build value will not have any effect until you change the version number by modifying the a and/or b values, such as by moving from version 2.0.1 to version 2.1.1 or 3.0.1. Manually entering a new build number that is higher than the stored build value will increment the build number to the new value. For example, if the stored build value is 5 and you enter a build number that is less than or equal to 5, then the next build number will be 6; however, if the stored build value is 5 and you enter a build number of 10, then the next build number will be 10 and will increment from there.
epic_compatible_versions
lists the HPE Ezmeral Runtime Enterprise versions where this Catalog entry may be used. An asterisk (*) may be used in a version string as a wildcard.categories
is a list of strings used by the HPE Ezmeral Runtime Enterprise interface to group Catalog entries during cluster creation. These values appear in the Select Cluster Type pull-down menu.
Components
The components
blob appears as follows:
"image": { "checksum": "b07e8cfea8a9c1a6cdc6990b1da29b9f", "import_url": "http://s3.amazonaws.com/bluedata-vmimages/Cloudera-CDH-CM-5.4.3-v2.tgz" }, "setup_package": { "checksum": "7560c8841c1400e0e4a4ba3dac1ba8d7", "import_url": "http://s3.amazonaws.com/bluedata-vmimages/cdh5-cm-setup.tgz" },
In this blob:
image
is a property that identifies the location for the image used to launch virtual nodes for this Catalog entry. In HPE Ezmeral Runtime Enterprise (EPIC) versions 2.0 and above, this will be an image for launching a Docker container. This location can be specified in either of two ways:import_url
, which is the http (not https) URL from which the image can be downloaded. This must be accompanied by thechecksum
, which is the MD5 checksum of the image. This method is used for normal Catalog entry distribution. The image will be downloaded into the images download cache directory when the entry is installed, and the downloaded image may be automatically deleted in certain garbage-collection situations when the Catalog entry is not in use and not present in any Catalog feed.source_file/opt/bluedata/catalog/images/
). Only the file system is necessary, not the complete path. No checksum is provided in this case. This method is used for either development or site-local entries. In this case, HPE Ezmeral Runtime Enterprise will never automatically download the designated image file.
setup_package
is similar to the image property except for the configuration scripts package that runs inside the launched virtual node. In this case, the download cache directory is/opt/bluedata/catalog/guestconfig
.
Services
The services
blob appears as follows:
"services": [ { "id": "hbase_master", "exported_service": "hbase", "label": { "name": "HMaster" }, "endpoint" : { "url_scheme" : "http", "port" : "60010", "path" : "/", "is_dashboard" : true } }, { "id": "hbase_worker", "label": { "name": "HRegionServer" }, "endpoint" : { "url_scheme" : "http", "port" : "60030", "path" : "/", "is_dashboard" : true } }, { "id": "hbase_thrift", "label": { "name": "HBase Thrift service." } }, ... ],
In this example, services
is a list of service objects. The
defined services will be referenced by other elements of this JSON file to determine
which services are active on which nodes within the cluster. That information will
then be used to:
- Present clickable Dashboard links in the HPE Ezmeral Runtime Enterprise interface.
- Determine which dependent nodegroups (Add-On Images) can be attached to the cluster.
- Trigger NAT port mapping for the service, if appropriate.
- Optionally be referenced by the setup scripts that run within the virtual node.
Setup scripts also use service identifiers to register those services with vAgent, so that necessary services can be properly started and restarted along with the virtual node. Setup scripts can also choose to wait for a vAgent-registered service to be active on a node in order to coordinate multi-node setup across the cluster.
sshd
.
In this blob:
id
is an identifier that must be unique within the scope of this JSON file. It is used by other objects in this file to reference this service. It is also used in the setup scripts when composing a key for registering a service with vAgent, or when waiting on a registered service to start.exported_service
is an optional property that has an agreed-by-convention value for a service that is referenced from outside the cluster. This property can have an optional qualifiers list of descriptive qualifiers for that exported service, again with agreed-by-convention values. qualifiers may only be defined ifexported_service
is defined.
label
uses the same format as the entry's label:name
, which briefly describes the service. This property is currently used only when composing clickable service-dashboard links in the HPE Ezmeral Runtime Enterprise interface; however, it is required for all services.description
, which is an optional description property with more details.
endpoint
describes the network endpoint of the service.- "
auth_token": true|false
: Whether (true) or not (false) the endpoint requires an authentication token. is_dashboard
is a Boolean property of the endpoint that indicates whether this is a URL that can (and should) be viewed from a web browser, such as in the HPE Ezmeral Runtime Enterprise interface.- The
url_scheme
,port
, andpath
properties of this object are used to compose a service URL. These properties have the following constraints:url_scheme
must be defined ifis_dashboard
istrue
.port
must be defined.path
is optional.
- "
endpoint
object triggers the
creation of a NAT port mapping for this service, if HPE Ezmeral Runtime Enterprise is running inside an EC2 instance. Node Roles
The node_roles
blob appears as follows:
"node_roles": [ { "id": "controller", "cardinality": "1", "anti_affinity_group_id": "CM", "min_cores": "4", "min_memory": "12288" }, { "id": "standby", "cardinality": "1", "anti_affinity_group_id": "CM" }, { "id": "arbiter", "cardinality": "1", "anti_affinity_group_id": "CM" }, { "id": "worker", "cardinality": "1+" } ],
In this example, node_roles
is a list of objects describing roles
that may be deployed for this Catalog entry. Each role is a particular configuration
instantiated from the entry's virtual node image and configured by the setup
scripts. The configuration associated with a particular role is broadly left up to
the setup scripts, and thus varies widely from entry to entry; however, there are
certain constraints and semantics associated with specific roles in the current
HPE Ezmeral Runtime Enterprise release (for non-Add-On entries):
- The allowed roles are
controller
,worker
,standby
, andarbiter
. If applicable, these roles will be created using the Master Node Flavor specified in the HPE Ezmeral Runtime Enterprise interface when you create the cluster. - To support job submission to a cluster from the HPE Ezmeral Runtime Enterprise interface, the cluster must include a
controller-role
node. If the cluster also includes astandby-role
node, then that standby will be tried as an alternate target for job submission if the Controller node is unresponsive. - Worker role nodes (if applicable) will be created using the Worker Node Flavor specified in the HPE Ezmeral Runtime Enterprise interface when you create the cluster.
- Only the
worker
role is allowed to have scale-out cardinality (see below); the worker role MUST have scale-out cardinality. - The Worker Count in the cluster creation interface covers the total
number of
worker
,standby
, andarbiter
nodes. Cluster expansion will increase the number ofworker
nodes.
The properties of each role object are:
id
is an identifier that must be unique within the scope of this JSON file. It is used by other objects in this file to reference this role. It is also used by HPE Ezmeral Runtime Enterprise as described above, and may also be referenced by the setup scripts.cardinality
describes the number of nodes in this role that will be deployed, if/when this role is selected to be used in a cluster. If thecardinality
string just consists of an integer, then a fixed number of nodes will be deployed for this role. If thecardinality
string is an integer followed by+
, then a variable number of nodes may be deployed in this role. The integer is the minimum number. This kind of value is referred to as a "scale-out" cardinality.anti_affinity_group_id
, if it has a specified value, causes nodes deployed from this role and/or from any other role with the sameanti_affinity_group_id
to be placed on different physical hosts. If this constraint cannot be satisfied, then the cluster creation/expansion will be rejected.
Anti-affinity is typically used to reduce the physical resources shared by a set of nodes, to make it less likely for a single physical fault to affect them all. This constraint only applies to nodes within a given cluster; anti-affinity is not enforced among nodes from different clusters.
min_cores
is an optional property that specifies a minimum number of virtual cores that must be provided in the flavor used to deploy this role.min_memory
is an optional property that specifies a minimum memory size that must be met by the flavor used to deploy this role.
Configuration
The configuration
blob appears as follows:
"config": { "selected_roles": [ ... ], "node_services": [ ... ], "config_meta": [ ... ], "config_choices": [ ... ],
The remainder of the JSON file describes which node roles will be deployed into the cluster, and which services will be present on any node with a given role. This information may depend on choices provided by the UI/API user when they are creating the cluster.
selected_roles
lists IDs of roles that will be deployed.node_services
lists IDs of services that will be present on nodes of a given role, if that role is deployed.config_meta
lists of string key/value pairs that can be referenced by the setup scripts.config_choices
lists both the choices available to the UI/API user and the possible selections for each choice. This is a potentially recursive data structure in that a selection may include another config object, which in turn may containselected_roles
/node_services
/config_meta
/config_choices
properties.
This structure means that the top-level selected_roles
, node_services
, and config_meta property
values will apply regardless of any user-provided input about choice selections.
User-provided input may then have consequences such as activating additional roles
and/or services in the cluster, and/or adding more elements to the config_meta
For example, in the CDH 5.4.3 JSON:
- There is a top-level
mrtypemrv1
andyarn
. - If
yarn
is selected for themrtype
choice, then:- The
controller
andworker,
roles are selected for deployment. - The
yarn_rm
andjob_history_server
services are selected to be present on thecontroller
role node. - The
yarn_nm
service is selected to be present on theworker
role nodes. - The
yarn_nm
service is also selected to be present on thestandby
andarbiter
- The
yarn_ha
options is enabled, with valid selections true or false. Iftrue
is selected foryarn_ha
, then:- The
controller
,standby
,arbiter
, andworker
roles must be defined. - The
zookeeper
service is selected to be present on thecontroller
,standby
, andarbiter
role nodes. - The
yarn_rm
andhdfs_rm
services are selected to be present on thestandby
role node.
- The
- The
Selected Roles
The selected_role
blob appears as follows:
"selected_roles": [ "controller", "standby", "arbiter", "worker" ],
The value of the selected_roles
property is a list of role IDs.
The example shown above is taken from the choice selection that activates HBase
support.
selected_roles
property is an empty list; no
roles at all will be activated unless the user provides some
input (choice selections). This is a valid arrangement and
reflects the fact that, for this Catalog entry, some choices
must be made before any usable application framework can be
provided in this cluster. By contrast, some other Catalog
entries have roles and services that are always selected.
Node Services
The node_services
blob appears as follows:
"node_services": [ { "role_id": "controller", "service_ids": [ "ganglia", "ganglia_api", "ssh", "gmetad", "gmond", "httpd" ] }, { "role_id": "standby", "service_ids": [ "ssh", "gmond" ] }, { "role_id": "arbiter", "service_ids": [ "ssh", "gmond" ] }, { "role_id": "worker", "service_ids": [ "ssh", "gmond" ] } ],
Each element of this list is a node_services
object that describes
the services available on a given role. The role may or may not be selected; this
data structure simply indicates that if a certain role is selected (according to
choice selections), then these are the services a node with that role will provide.
The top-level node_services
in this example Catalog entry are all
of the ancillary services that don't depend on choices like HBase support or MR
type.
The properties of each node_services
object are:
role_id
references the value of theid
property of anode_role
object defined within this same catalog entry JSON.service_ids
is a list of id values of service objects defined within this same Catalog entry JSON.
Config Metadata
The config_metadata
appears as follows:
"config_meta": { "streaming_jar": "/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar", "impala_jar_version": "0.1-SNAPSHOT", "cdh_major_version": "CDH5", "cdh_full_version": "5.4.3", "cdh_parcel_version": "5.4.3-1.cdh5.4.3.p0.6", "cdh_parcel_repo": "http://archive.cloudera.com/cdh5/parcels/5.4.3" },
In this example, config_meta
is a key-value store. These values are
only used by the scripts in the guest package and are thus completely opaque to
HPE Ezmeral Runtime Enterprise. These values may be referenced during
node setup. For example, the streaming_jar
value is conventionally
referenced by the script that runs Hadoop Streaming jobs.
Choice selections may cause the definition of multiple config_meta
lists that together form the KV store visible to the in-guest scripts. To avoid
confusion, key conflicts are not allowed. For example, it is legal for mutually
exclusive choice selections to define different values for a key, but it is not
legal for the same key to be defined more than once when composing the KV store that
results from a particular set of choice selections.
Config Choices
This config_choices
blob appears as follows:
"config_choices": [ { "id": "hbase", "type": "boolean", "label": { "name": "HBase" }, "selections": [ { "id": false }, { "id": true, "config": { ... } } ] }, { "id": "mrtype", "type": "multi", "label": { "name": "MR Type" }, "selections": [ { "id": "mrv1", "label": { "name": "MRv1" }, "config": { ... } }, { "id": "yarn", "label": { "name": "YARN" }, "preferred": true, "config": { "selected_roles": [ "controller", "worker" ], "node_services": [ ... ], "config_choices": [ { "id": "yarn_ha", "type": "boolean", "label": { "name": "YARN and HDFS High Availability" }, "selections": [ { "id": false }, { "id": true, "config": { ... } } ] } ], "config_choices":[ { "label": { "name": "CLouderaManagerServer" }, "type": "string", "id": "clouderamanager-server" } ] } } ] }, }
This blob lists the choices available to the API/UI user when creating a cluster.
Each choice has some number of valid selections (either Boolean or multiple-choice)
that can be provided to satisfy that choice. A given selection can then contain a
nested config
, as described previously.
In this example, one choice describes whether or not to activate HBase support. Another describes the choice between using MRv1 or YARN. If YARN is selected, then there is a further choice as to whether to activate cluster High A.
Each of these choices activates certain roles for deployment and selects certain services to be present on nodes of given roles.
This structure is fairly generic; however, HPE Ezmeral Runtime Enterprise constrains the choices to those currently defined among the various Catalog entries provided as part of the HPE Ezmeral Runtime Enterprise release. Please contact Hewlett Packard Enterprise support if you wish to define choices in a Catalog entry that you are authoring.
The properties of each choice object are:
id
is a choice identifier. It can be referenced by the setup scripts (which can see all choice selections made for cluster creation). Each selection object must contain anid
property that is the selection value. The possible values for this property are limited to the set of choices present in the Catalog provided with the HPE Ezmeral Runtime Enterprise release.type
describes the selection value type. This property may have one of the following values:boolean
: Selection values are eithertrue
orfalse
. This selection type does not require a label.multi
: Selection values are a defined set of strings. This selection type must have alabel
object that describes the selection. This object includes a requiredname
and an optionaldescription
, which will be used by future HPE Ezmeral Runtime Enterprise versions to drive various interface behaviors.- string: Alphanumeric characters.
selections
lists the valid selections for this choice. A selection may include an optionalpreferred
property. If this is set totrue
, the HPE Ezmeral Runtime Enterprise interface will default to this selection value when presenting the choice. A selection may contain an optional nestedconfig
object that describes the configuration activated by the selection.