Creating Spark Applications
Describes how to create and submit Spark applications using HPE Ezmeral Unified Analytics Software.
Prerequisites
- Must have a main application file (for example, compiled jar file for Java or Scala).
- Must know the runtime dependencies of your application that are not built-in to the main application file.
- Must know your application arguments.
Create a Spark Application
- Sign in to HPE Ezmeral Unified Analytics Software.
- In the left navigation bar, select one of the following options:
- Click the Analytics icon and click Spark Applications on the left navigation bar of the HPE Ezmeral Unified Analytics Software screen.
- Click the Tools & Frameworks icon on the left navigation bar. Navigate to the Spark Operator tile under the Analytics tab and click Open.
- Click Create Application on the Spark Applications
screen. Navigate through each step within the Create Spark
Application wizard.
The following table lists the steps in wizard and instructions:
Steps Instructions Application Details Create an application or upload a preconfigured YAML file. - YAML FILE - When you select Upload YAML, you can upload a preconfigured YAML file from your local system. Click Select File to upload the YAML. The fields in the wizard are populated with the information from YAML.
- Name - Enter the application name.
- Description - Enter the application description.
Configure Spark Application Configure the Spark application:- Type - Select the application type from Java, Scala, Python, or R.
- Source - Select the location of the
main application file from User
Directory, Shared
Directory, S3,
and Other. HPE Ezmeral Unified Analytics Software preconfigures Spark
applications and Livy sessions in such a way that
both
<username>
andshared
volumes are mounted to driver and executor runtimes. For additional details, see the Selecting the Location of the Main Application File section below. - File Name - Manually enter the location
and file name of the application for the
S3 and
Other sources. For example:
s3a://apps/my_application.jar
For User Directory and Shared Directory, click Browse, and browse and select files.NOTEEnsure the extension of the main application file matches the selected application type. The extension must be.py
for Python,.jar
for Java and Scala, and.r
for R applications. - Class Name - Enter main class of the application for Java or Scala applications.
- Arguments - Click + Add Argument to add input parameters as required by the application.
NOTE- To refer to data in mounted folders from
application source code, use
file://
schema.If a Spark application is reading a file from theshared
oruser
volume and is taking a path to the file as an application argument, the argument will befile://[mount-path]/path/to/input/file
. For example:User Directory: file:///mounts/<user-name>-volume/ Shared Directory: file:///mounts/shared-volume/
Dependencies To add dependencies required to run your applications, select a dependency type from excludePackages, files, jars, packages, pyfiles, or repositories, and enter the value of the dependency. To add more than one dependency, click Add Dependency.
For example:- Enter the package names as the values for the excludePackages dependency type.
- Enter the locations of file, for example, s3://<path-to file>, local://<path-to-file> as the values for files, jars, pyfiles, or repositories.
Driver Configuration Configure the number of cores, core limits, and memory. The number of cores must be less than or equal to the core limit. See Configuring Memory for Spark Applications. When boxes in this wizard are left blank, the default values are set. The default values are as follows:- Number of Cores: 1
- Core Limit: unlimited
- Memory: 1g
Executor Configuration Configure the number of executors, number of cores, core limits, and memory. The number of cores must be less than or equal to the core limit. See Configuring Memory for Spark Applications. When boxes in this wizard are left blank, the default values are set. The default values are as follows:- Number of Executors: 1
- Number of Cores per Executor: 1
- Core Limit per Executor: unlimited
- Memory per Executor: 1g
Schedule Application To schedule a Spark application to run at a certain time, toggle Schedule to Run. You can configure the frequency intervals and set the concurrency policy, successful run history limit, and failed run history limit. Set the Frequency Interval in two ways: - To choose from predefined intervals, select Predefined Frequency Interval and click Update to open a dialog with predefined intervals.
- To set the frequency interval, select
Custom Frequency Interval.
The Frequency Interval
accepts any of the following values:
- CRON expression with
- Field 1: minute (0–59)
- Field 2: hour (0–23)
- Field 3: day of the month (1–31)
- Field 4: month (1–12, JAN - DEC)
- Field 5: day of the week (0–6, SUN - SAT)
- Example:
0 1 1 * *
,02 02 ? * WED, THU
- Predefined macro
- @yearly
- @monthly
- @weekly
- @daily
- @hourly
- Interval using @every <duration>
- Units: nanosecond (ns), microsecond (us, µs), millisecond (ms), second (s), minute (m), and hour (h).
- Example:
@every 1h
,@every 1h30m10s
- CRON expression with
Review Review the application details. Click the pencil icon in each section to navigate to the specific step to change the application configuration. To open an editor to change the application configuration using YAML in the GUI, click Edit YAML. You can use the editor to add the extra configuration options not available through the application wizard. To apply the changes, click Save Changes. To cancel the changes, click Discard Changes. - To submit the application, click Create Spark Application on the bottom right of the Review step.
Results:
The Spark application is created and will immediately run, or will wait to run at its scheduled time. You can view it on the Spark Applications screen.
Selecting the Location of the Main Application File
Use one of the following methods to select the location of the main application file:
Uploading Files to the User and Shared Directories
user
and shared
directories:- Open a different HPE Ezmeral Unified Analytics Software browser.
- In the left navigation bar, select Data Engineering → Data Sources and then select the Data Volumes tab.
- On the Data Volumes tab, select your user directory or the
shared
directory.The following image shows an example of a user (
bob
) directory and ashared
directory:If you do not see your user directory or the
shared
directory, contact your administrator. - Click Upload to upload the Spark application files to the
user/
orshared/
directory. - Return to the browser you were working in with the Configure Spark Application wizard.
- Click Browse, and navigate to the location where you uploaded files.
- Select the Spark application files.
Using S3
- S3 Endpoint
-
- Enter an S3 endpoint for direct access to an external S3 data
source or enter an S3 endpoint for access through the S3 proxy
in Unified Analytics, as described in Getting the Data Source Name and S3 Proxy Endpoint
URL.
For an explanation of direct access versus S3 proxy access, see Configuring a Spark Application to Access External S3 Object Storage.
- Enter an S3 endpoint for direct access to an external S3 data
source or enter an S3 endpoint for access through the S3 proxy
in Unified Analytics, as described in Getting the Data Source Name and S3 Proxy Endpoint
URL.
- Secret
-
- For HPE-Curated Spark and
Spark OSS images, you can enter
access-token
in the Secret field. For information about theaccess-token
secret, see Auth Tokens. - For HPE-Curated Spark
images, you can leave the Secret field empty if you
include
spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.spark.s3a.EzSparkAWSCredentialProvider"
in the Spark configuration. - Alternatively, you can generate a secret, as described in Configuring a Spark Application to Directly Access Data in an External S3 Data Source and Configuring a Spark Application to Access Data in an External S3 Data Source through the S3 Proxy Layer.
- For HPE-Curated Spark and
Spark OSS images, you can enter
- File Name
- Enter the location and name of the Spark application file.
Using Other
Select Other as the data source, to reference other locations of the application file.
For example, to refer to main application files and dependency files, or to refer to
a file inside the specific Spark image, use the local://
schema.
local:///opt/mapr/spark/spark-3.2.0/examples/jars/spark-examples_2.12-3.2.0.16-eep-810.jar