Application Support
HPE Ezmeral Runtime Enterprise provides powerful support for Artificial Intelligence (AI) and Machine Learning (ML) applications (see Artificial Intelligence and ML/DL Workloads for additional information). It also includes pre-configured, ready-to-run versions of major Hadoop distributions, such as Cloudera (CDH), Hortonworks (HDP), and MapR (CDP). It also includes recent versions of Spark standalone as well as Kafka and Cassandra. I. Other distributions, services, commercial applications, and custom applications can be easily added to an HPE Ezmeral Runtime Enterprise deployment, as described in App Store. Some of the Big Data, AI, and ML application services that are supported out-of-the-box include:
- CAFFE2: CAFFE (Convolutional Architecture for Fast Feature Embedding) is a deep learning framewors that is merged into PyTorch.
- Cloudera Manager (for CDH): Cloudera Manager provides a real-time view of
CDH clusters, including a real-time view of the nodes and services running,
in a single console. It also includes a full range of reporting and diagnostic tools
to help optimize performance and utilization.
- Flume: Flume-NG is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of server log data. It is robust and fault tolerant with many failover and recovery mechanisms. It uses a simple extensible data model that allows one to build online analytic applications.
- HBase: HBase is a distributed, column-oriented data store that provides random, real-time read/write access to very large data tables (billions of rows and millions of columns) on a Hadoop cluster. It is modeled after Google's BigTable system.
- MapReduce: MapReduce assigns segments of an overall job to each Worker, and then reduces the results from each back into a single unified set.
- Sqoop: Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores, such as relational databases. It facilitates importing data from a relational database, such as MySQL or Oracle DB, into a distributed filesystem like HDFS, transforming the data with Hadoop MapReduce, and then exporting the result back into an RDBMS.
- GraphX (for Spark): GraphX works seamlessly with graphs and collections by combining Extract/Transform/Load (ETL), exploratory analysis, and iterative graph computation within a single system. The Pregel API allows you to write custom iterative graph algorithms.
- Hive: Hive facilitates querying and managing large amounts of data stored on distributed storage. This application provides a means for applying structure to this data and then running queries using the HiveQL language. HiveQL is similar to SQL.
- JupyterHub: JupyterHub is a multi-user server that provides a dedicated single-user Jupyter Notebook server for each user in a group.
- Kafka: Kafka allows a single cluster to act as a centralized data repository that can be expanded with zero down time. It partitions and spreads data streams across a cluster of machines to deliver data streams beyond the capability of any single machine.
- Kubeflow: A platform for developing and deploying an ML system. This is the ML toolkit for Kubernetes.
- MLlib: MLlib is Spark's scalable machine learning library that contains common learning algorithms, utilities, and underlying optimization primitives.
- Oozie: Oozie is a workflow scheduler system for managing Hadoop jobs that specializes in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs.
- Pig: Pig is a language developed by Yahoo that allows for data flow and transformation operations on a Hadoop cluster.
- PyTorch: Open-source ML library based on the Torch library and used for applications such as computer vision and natural language processing.
- Spark SQL: Spark SQL is a Spark module designed for processing structured data. It includes the DataFrames programming abstraction and can also act as a distributed SQL query engine. This module can also read data from an existing Hive installation.
- SparkR: SparkR is an R package which provides a lightweight front end for using Spark from R.
- Spark Streaming: Spark Streaming is an extension of the core Spark API that enables fast, scalable, and fault-tolerant processing of live data streams.
- Tensorflow: Open-source framework to run ML, deep learning, and other statistical and predictive analytics workloads.