Spark

Describes how to identify and debug issues for Spark.

Spark History Server

Long-running Spark applications exceed disk quotas for Spark History Server.

Repeatedly running long-running Spark applications generates a large volume of logs in the Spark History Server event log directory. This can exceed disk quotas, causing failures in other Spark applications. You must monitor log sizes and manage disk space to mitigate this issue.

Workaround

To prevent the exceeding of disk quota for the Spark History Server event log directory, modify the Spark History Server configuration options as follows:

To periodically clean up event logs from storage, set:

spark.history.fs.cleaner.enabled true

To delete job log files older than the specified value, set:

spark.history.fs.cleaner.maxAge 1d

Here, job log files that are older than 1d are deleted by the filesystem history cleaner.

To specify the frequency for the filesystem job history cleaner to check for the files to be deleted, set:

spark.history.fs.cleaner.interval 12h

Here, the filesystem job history cleaner checks every 12h for files to be deleted.

To enable event log rolling based on size, set:

spark.eventLog.rolling.enabled true

The default is deactivated.

To specify the maximum size of the event log file before it rolls over, set:

spark.eventLog.rolling.maxFileSize 128m

The default is 128 MB.

To specify the maximum number of non-compacted event log files to retain, set:

spark.history.fs.eventLog.rolling.maxFilesToRetain 2

By default, all event log files are retained. To compact older event logs, reduce the value. The minimum value accepted is 1.

NOTE

Compaction tries to exclude events that point to outdated event log files, such as the following events:

Events for finished jobs and related staged/task events
Events for the terminated executor
Events for finished SQL execution and related job/staged/task events

Discarded events do not display in the Spark History Server UI.

NOTE

If the disk quota is full, contact HPE Support for assistance.

Spark Operator

Spark application submission hangs or fails.

If the Spark application submission hangs or fails, check the submission pod state.

If the pod is in the pending state, wait for more resources to be available.
If the pod is in the failed state, collect pod logs and contact HPE Support.

Spark application hangs in the Submitted or Running state.

If the Spark application hangs in the Submitted or Running state, check the state of the driver pod.

If the driver pod is in the ContainerCreating state, check the pod events.
- If the image is downloading, wait until the image is downloaded.
- For the FailedMount reason, you need to identify what volume is missing.
  - By default, all Spark workloads submitter pods are preconfigured to mount system volumes such as Spark History Server PVC, user PVC, and shared PVC.
  - If the problem is with the system volume, contact HPE Support.
- If the driver pod is in a Running state, check if executor pods are in a Running state as well. Sometimes executor pods are in a pending state due to a lack of resources, in this case, wait for the resources to be available.
For other reasons, collect driver and executor pod logs and contact HPE Support.

Spark application fails.

If the Spark application fails, collect the driver pod logs.

If the container fails before running the application code, contact HPE Support as there is a problem with the image.
If the container fails while the application is running, check the exception in a driver log:
- For the functional exception (e.g. NullPointerException), review the application source code.
- For the non-functional exception (e.g. OutOfMemoryError), increase memory allocation for the driver pod and/or review the application source code.

Livy

Livy session creation fails.

If the Livy session fails, create a Livy session with the default configuration and run.

If it runs successfully, check the configuration of your failed Livy session for configuration issues.
If it fails, collect the Livy server pod logs and driver pod logs (if available) and contact HPE Support.

Livy session hangs in the Starting state.

Verify that the driver pod is not in the Pending or ContainerCreating state.

Livy statement run hangs or fails.

If the Livy statement run hangs or fails,

Analyze the error message and fix the statement. For a detailed error message information, go to the Livy Server UI.
Verify the executor pods are not in the Pending state. For the Livy statements to run, executor pods must be available.
For other reasons, collect driver and executor pod logs and contact HPE Support.

Livy session disappears.

No action required as this is an expected behaviour for the idling sessions.

Executor pod logs are not available for the interactive sessions.

When you create a Spark interactive session by setting spark.log.level as key and INFO as value and then submit the Livy statements, the executor pod logs are not available after the session completion.

Workaround:

To resolve this issue, follow these steps:

Create a custom log4j properties file named custom_log4j.properties in the shared volume (local:///mounts/shared-volume/custom_log4j.properties).

Copy the following content of the log4j2.properties file of the driver pod of the Livy session to the custom_log4j.properties file.

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Set everything to be logged to the console
rootLogger.level = WARN
rootLogger.appenderRef.stdout.ref = console

# In the pattern layout configuration below, we specify an explicit `%ex` conversion
# pattern for logging Throwables. If this was omitted, then (by default) Log4J would
# implicitly add an `%xEx` conversion pattern which logs stacktraces with additional
# class packaging information. That extra information can sometimes add a substantial
# performance overhead, so we disable it in our default logging config.
# For more information, see SPARK-39361.
appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex

# Set the default spark-shell/spark-sql log level to WARN. When running the
# spark-shell/spark-sql, the log level for these classes is used to overwrite
# the root logger's log level, so that the user can have different defaults
# for the shell and regular Spark apps.
logger.repl.name = org.apache.spark.repl.Main
logger.repl.level = WARN

logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
logger.thriftserver.level = WARN

# Settings to quiet third party logs that are too verbose
logger.jetty1.name = org.sparkproject.jetty
logger.jetty1.level = WARN
logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
logger.jetty2.level = WARN
logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper
logger.replexprTyper.level = INFO
logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
logger.replSparkILoopInterpreter.level = INFO
logger.parquet1.name = org.apache.parquet
logger.parquet1.level = WARN
logger.parquet2.name = parquet
logger.parquet2.level = WARN

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
logger.RetryingHMSHandler.level = ERROR
logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
logger.FunctionRegistry.level = ERROR
logger.HiveConf.name = org.apache.hadoop.hive.conf.HiveConf
logger.HiveConf.level = ERROR

# SPARK-327: Settings to suppress the unnecessary warning message from MultiMechsAuthenticationHandler
logger.MultiMechsAuthenticationHandler.name = org.apache.hadoop.security.authentication.server.MultiMechsAuthenticationHandler
logger.MultiMechsAuthenticationHandler.level = ERROR
logger.KerberosAuthHandler.name = org.apache.hadoop.security.authentication.server.KerberosAuthHandler
logger.KerberosAuthHandler.level = ERROR

# SPARK-575: Settings to suppress the unnecessary warning message from AuthenticationFilter
logger.AuthenticationFilter.name = org.apache.hadoop.security.authentication.server.AuthenticationFilter
logger.AuthenticationFilter.level = ERROR

logger.NativeCodeLoader.name = org.apache.hadoop.util.NativeCodeLoader
logger.NativeCodeLoader.level = ERROR
logger.YarnClient.name = org.apache.spark.deploy.yarn.Client
logger.YarnClient.level = ERROR
logger.HiveUtils.name = org.apache.spark.sql.hive.HiveUtils
logger.HiveUtils.level = ERROR
logger.HiveMetastore.name = org.apache.hadoop.hive.metastore.HiveMetastore
logger.HiveMetastore.level = ERROR
logger.ObjectStore.name = org.apache.hadoop.hive.metastore.ObjectStore
logger.ObjectStore.level = ERROR
logger.SQLCompleter.name = org.apache.hive.beeline.SQLCompleter
logger.SQLCompleter.level = ERROR

# SPARK-945: Setting to suppress exception when non-cluster admin can not read ssl-server config
logger.Configuration.name = org.apache.hadoop.conf.Configuration
logger.Configuration.level = ERROR

# Hide Spark netty rpc error when driver is finished
logger.Dispatcher.name = org.apache.spark.rpc.netty.Dispatcher
logger.Dispatcher.level = ERROR

# For deploying Spark ThriftServer
# SPARK-34128: Suppress undesirable TTransportException warnings involved in THRIFT-4805
appender.console.filter.1.type = RegexFilter
appender.console.filter.1.regex = .*Thrift error occurred during processing of message.*
# Hide fips specific properties initialization
appender.console.filter.1.regex = .*org.bouncycastle.jsse.provider.PropertyUtils.*
appender.console.filter.1.onMatch = deny
appender.console.filter.1.onMismatch = neutral

Set the following configurations on the custom_log4j.properties file.

logger.SparkLogger.name = org.apache.spark
logger.SparkLogger.level = INFO

and

Create an interactive session by setting the Spark configuration with spark.executor.extraJavaOptions as key and -Dlog4j.configuration=file:/local:///mounts/shared-volume/custom_log4j.properties as value. See Creating Interactive Sessions.
Submit the Livy statements. See Submitting Statements.

Result:

The logs for the executor pod are now available.

HPE Ezmeral Unified Analytics Software 1.5 Documentation
Abstract	HPE Ezmeral Unified Analytics Software is a usage-based Software-as-a-Service (SaaS) model that operationalizes hybrid and multi-cloud modern analytical workloads through a simple user interface, easily installed and deployed in minutes. HPE Ezmeral Unified Analytics Software separates compute and storage for flexible, cost-efficient scalability to securely access data stored in multiple data platforms, enabling you to run traditional and advanced analytics workloads with open-source tools.
Published	July 2025
Edition	1.5.0
Topic last updated	2024-09-10