Artificial Intelligence and ML/DL Workloads

Enterprises are increasingly turning to AI to solve complex problems, conduct research, and maintain or boost their competitive advantages in the marketplace. AI and machine learning (ML)/deep learning (DL) technologies have moved into the mainstream with a broad range of data-driven enterprise applications: credit card fraud detection, stock market prediction for financial trading, credit risk modeling for insurance, genomics and precision medicine, disease detection and diagnosis, natural language processing (NLP) for customer service, autonomous driving and connected car IoT use cases, and more.

A typical distributed ML/DL workflow may look something like this:

  1. The model is conceptualized.
  2. The model is built in one or more sandbox/custom environment(s) that require access to data and model storage.
  3. Subsequent versions are created that may add libraries and/or features and that require rerunning the model.
  4. The model is saved and deployed, and any API endpoints are published.
  5. Measurements to determine model efficacy occur both in real time and in batch feedback loops. This feedback is used to continue conceptualizing the model.

Needs

Enterprises wanting to deploy distributed ML/DL infrastructures typically have some or all of the following needs:

  • Role-based access control to some or all of the following:
    • ML/DL tools, such as TensorFlow, H2O, MXNet, BigDL for Spark, Caffe, and SparkMLib..
    • Common “big” and “small” data frameworks, such as Kafka, HDFS, HBase, Spark, model storage, and workflow management.
    • Data science notebooks, such as Jupyter, RStudio, and Zeppelin.
    • Various related analytics, business intelligence (BI), and ETL tools.
  • Choice of modeling techniques.
  • Ability to build, share, and iterate.
  • Reproducibility.
  • Easy scaling for testing on actual data sets.
  • Support for varying roles and actions.

Challenges

Some of the key challenges enterprises face when looking to build, deploy, and operationalize their ML/DL pipelines to meet the needs described above include:

  • Traditional analytics tools were built to process structured data in databases. AI use cases that require ML/DL tools require a large and continuous flow of data that is typically unstructured.
  • Data scientists and developers may have built and designed their initial ML/DL algorithms to operate in a single-node environment (e.g. on a laptop, virtual machine, or cloud instance) but need to parallelize the execution in a multi-node distributed environment.
  • Enterprises cannot meet their AI use case requirements using the data processing capabilities and algorithms of a single ML/DL tool. They need to use data preparation techniques and models from multiple open source and/or commercial tools.
  • Data science teams are increasingly working in more collaborative environments where the workflow for building distributed ML/DL pipelines spans multiple different domain experts.
  • Many ML / DL deployments use hardware acceleration such as GPUs to improve processing capabilities. These are expensive resources, and this technology can add to the complexity of the overall stack.
  • ML/DL technologies and frameworks are different from existing enterprise systems and traditional data processing frameworks.
  • ML/DL stacks are complex because they require both multiple software and infrastructure components and version compatibility and integration across those components.
  • Assembling all of the required systems and software is time consuming, and most organizations lack the skills to deploy and wire together all of these components.

The HPE Ezmeral Runtime Enterprise Solution for ML/DL

HPE Ezmeral Runtime Enterprise goes beyond Application Support by leveraging the inherent infrastructure portability and flexibility of containers to support distributed AI for both ML and DL use cases. The separation of compute and storage for Big Data and ML/DL workloads is one of the key concepts behind this flexibility, because organizations can deploy multiple containerized compute clusters for different workflows (e.g. Spark, Kafka, or TensorFlow) while sharing access to a common data lake. This also enables hybrid and multi-cloud HPE Ezmeral Runtime Enterprise deployments, with the ability to mix and match on- and/or off-premises compute and storage resources to suit each workload. Further, compute resources can be quickly and easily scaled and optimized independent of data storage, thereby increasing flexibility and improving resource utilization while eliminating data duplication and reducing cost.

Some of the key ML/DL features and benefits that HPE Ezmeral Runtime Enterprise provides include:

  • Container-based automation: HPE Ezmeral Runtime Enterprise creates virtual clusters that each contain one or more container(s). Containers are now widely recognized as a fundamental building block to simplify and automate deployments of complex application environments, with portability across on-premises infrastructure and public cloud services.
  • Deployment of ML/DL workloads: HPE Ezmeral Runtime Enterprise can be used to deploy distributed ML/DL environments such as TensorFlow, Caffe2, H2O, BigDL, and SparkMLlib. This allows organizations embarking on AI initiatives to quickly spin up multi-node ML/DL sandbox environments for their data science teams. If available, they can also easily and securely tap into an existing data lake to build and deploy their ML/DL pipelines.
  • Rapid and reproducible provisioning: Users can spin up new, fully-provisioned distributed ML/DL applications in multi-node containerized environments on any infrastructure, whether on-premises or in the cloud, using either CPUs and/or GPUs. These fully-configured environments can be created in minutes via either RESTful APIs or a few mouse clicks in the HPE Ezmeral Runtime Enterprise web interface. IT teams can ensure enterprise-grade security, data protection, and performance with elasticity, flexibility, and scalability in a multi-tenant architecture. The template feature allows organizations to preserve specific cluster configurations for reuse at any time with just a few mouse clicks. EPIC publishes service endpoint lists for each virtual node/cluster.
  • Decoupling of compute and storage resources: As described above, the separation of compute from storage allows organizations to reduce costs by scaling these infrastructure resources independently while leveraging their existing storage investments in file, block, and object storage to extend beyond their petabyte-scale HDFS clusters. HPE Ezmeral Runtime Enterprise allows secure integrations with distributed file systems including HDFS, NFS, and S3 for storing data and ML / DL models, including pass-through security from the compute clusters.