Creating Spark Applications

Describes how to create and submit Spark applications using HPE Ezmeral Unified Analytics Software.

Prerequisites

  • Must have a main application file (for example, compiled jar file for Java or Scala).
  • Must know the runtime dependencies of your application that are not built-in to the main application file.
  • Must know your application arguments.

Create a Spark Application

Complete the following steps to create and submit a Spark application:
  1. Sign in to HPE Ezmeral Unified Analytics Software.
  2. In the left navigation bar, select one of the following options:
    • Click the Analytics icon and click Spark Applications on the left navigation bar of the HPE Ezmeral Unified Analytics Software screen.
    • Click the Tools & Frameworks icon on the left navigation bar. Navigate to the Spark Operator tile under the Analytics tab and click Open.
  3. Click Create Application on the Spark Applications screen. Navigate through each step within the Create Spark Application wizard.

    The following table lists the steps in wizard and instructions:

    Steps Instructions
    Application Details Create an application or upload a preconfigured YAML file.
    • YAML FILE - When you select Upload YAML, you can upload a preconfigured YAML file from your local system. Click Select File to upload the YAML. The fields in the wizard are populated with the information from YAML.
    • Name - Enter the application name.
    • Description - Enter the application description.
    Configure Spark Application
    Configure the Spark application:
    • Type - Select the application type from Java, Scala, Python, or R.
    • Source - Select the location of the main application file from User Directory, Shared Directory, S3, and Other. HPE Ezmeral Unified Analytics Software preconfigures Spark applications and Livy sessions in such a way that both <username> and shared volumes are mounted to driver and executor runtimes. For additional details, see the Selecting the Location of the Main Application File section below.
    • File Name - Manually enter the location and file name of the application for the S3 and Other sources. For example:
      s3a://apps/my_application.jar
      For User Directory and Shared Directory, click Browse, and browse and select files.
      NOTE
      Ensure the extension of the main application file matches the selected application type. The extension must be .py for Python, .jar for Java and Scala, and .r for R applications.
    • Class Name - Enter main class of the application for Java or Scala applications.
    • Arguments - Click + Add Argument to add input parameters as required by the application.
    NOTE
    • To refer to data in mounted folders from application source code, use file:// schema.
      If a Spark application is reading a file from the shared or user volume and is taking a path to the file as an application argument, the argument will be file://[mount-path]/path/to/input/file. For example:
      User Directory: file:///mounts/<user-name>-volume/
      Shared Directory: file:///mounts/shared-volume/
    Dependencies

    To add dependencies required to run your applications, select a dependency type from excludePackages, files, jars, packages, pyfiles, or repositories, and enter the value of the dependency. To add more than one dependency, click Add Dependency.

    For example:
    • Enter the package names as the values for the excludePackages dependency type.
    • Enter the locations of file, for example, s3://<path-to file>, local://<path-to-file> as the values for files, jars, pyfiles, or repositories.
    Driver Configuration Configure the number of cores, core limits, and memory. The number of cores must be less than or equal to the core limit. See Configuring Memory for Spark Applications.
    When boxes in this wizard are left blank, the default values are set. The default values are as follows:
    • Number of Cores: 1
    • Core Limit: unlimited
    • Memory: 1g
    Executor Configuration Configure the number of executors, number of cores, core limits, and memory. The number of cores must be less than or equal to the core limit. See Configuring Memory for Spark Applications.
    When boxes in this wizard are left blank, the default values are set. The default values are as follows:
    • Number of Executors: 1
    • Number of Cores per Executor: 1
    • Core Limit per Executor: unlimited
    • Memory per Executor: 1g
    Schedule Application To schedule a Spark application to run at a certain time, toggle Schedule to Run. You can configure the frequency intervals and set the concurrency policy, successful run history limit, and failed run history limit. Set the Frequency Interval in two ways:
    1. To choose from predefined intervals, select Predefined Frequency Interval and click Update to open a dialog with predefined intervals.
    2. To set the frequency interval, select Custom Frequency Interval. The Frequency Interval accepts any of the following values:
      • CRON expression with
        • Field 1: minute (0–59)
        • Field 2: hour (0–23)
        • Field 3: day of the month (1–31)
        • Field 4: month (1–12, JAN - DEC)
        • Field 5: day of the week (0–6, SUN - SAT)
        • Example: 0 1 1 * *, 02 02 ? * WED, THU
      • Predefined macro
        • @yearly
        • @monthly
        • @weekly
        • @daily
        • @hourly
      • Interval using @every <duration>
        • Units: nanosecond (ns), microsecond (us, µs), millisecond (ms), second (s), minute (m), and hour (h).
        • Example: @every 1h, @every 1h30m10s
    Review Review the application details. Click the pencil icon in each section to navigate to the specific step to change the application configuration. To open an editor to change the application configuration using YAML in the GUI, click Edit YAML. You can use the editor to add the extra configuration options not available through the application wizard. To apply the changes, click Save Changes. To cancel the changes, click Discard Changes.
  4. To submit the application, click Create Spark Application on the bottom right of the Review step.

Results:

The Spark application is created and will immediately run, or will wait to run at its scheduled time. You can view it on the Spark Applications screen.

Selecting the Location of the Main Application File

Use one of the following methods to select the location of the main application file:

Uploading Files to the User and Shared Directories

To upload files to the user and shared directories:
  1. Open a different HPE Ezmeral Unified Analytics Software browser.
  2. In the left navigation bar, select Data Engineering → Data Sources and then select the Data Volumes tab.
  3. On the Data Volumes tab, select your user directory or the shared directory.

    The following image shows an example of a user (bob) directory and a shared directory:

    If you do not see your user directory or the shared directory, contact your administrator.

  4. Click Upload to upload the Spark application files to the user/ or shared/ directory.
  5. Return to the browser you were working in with the Configure Spark Application wizard.
  6. Click Browse, and navigate to the location where you uploaded files.
  7. Select the Spark application files.

Using S3

When you select S3 as the Source, the S3 Endpoint, Secret, and File Name fields appear. The following sections describe what values to enter in these fields:
S3 Endpoint
Secret
File Name
Enter the location and name of the Spark application file.

Using Other

Select Other as the data source, to reference other locations of the application file.

For example, to refer to main application files and dependency files, or to refer to a file inside the specific Spark image, use the local:// schema.

local:///opt/mapr/spark/spark-3.2.0/examples/jars/spark-examples_2.12-3.2.0.16-eep-810.jar