Spark
Describes how to identify and debug issues for Spark.
Spark History Server
- Long-running Spark applications exceed disk quotas for Spark History Server.
-
Repeatedly running long-running Spark applications generates a large volume of logs in the Spark History Server event log directory. This can exceed disk quotas, causing failures in other Spark applications. You must monitor log sizes and manage disk space to mitigate this issue.
Workaround
To prevent the exceeding of disk quota for the Spark History Server event log directory, modify the Spark History Server configuration options as follows:
To periodically clean up event logs from storage, set:spark.history.fs.cleaner.enabled true
To delete job log files older than the specified value, set:
Here, job log files that are older than 1d are deleted by the filesystem history cleaner.spark.history.fs.cleaner.maxAge 1d
To specify the frequency for the filesystem job history cleaner to check for the files to be deleted, set:
Here, the filesystem job history cleaner checks every 12h for files to be deleted.spark.history.fs.cleaner.interval 12h
To enable event log rolling based on size, set:
The default is deactivated.spark.eventLog.rolling.enabled true
To specify the maximum size of the event log file before it rolls over, set:
The default is 128 MB.spark.eventLog.rolling.maxFileSize 128m
To specify the maximum number of non-compacted event log files to retain, set:
By default, all event log files are retained. To compact older event logs, reduce the value. The minimum value accepted is 1.spark.history.fs.eventLog.rolling.maxFilesToRetain 2
NOTECompaction tries to exclude events that point to outdated event log files, such as the following events:- Events for finished jobs and related staged/task events
- Events for the terminated executor
- Events for finished SQL execution and related job/staged/task events
Spark Operator
- Spark application submission hangs or fails.
-
If the Spark application submission hangs or fails, check the submission pod state.
- If the pod is in the pending state, wait for more resources to be available.
- If the pod is in the failed state, collect pod logs and contact HPE Support.
- Spark application hangs in the Submitted or Running state.
-
If the Spark application hangs in the Submitted or Running state, check the state of the driver pod.
- If the driver pod is in the
ContainerCreating
state, check the pod events.- If the image is downloading, wait until the image is downloaded.
- For the
FailedMount
reason, you need to identify what volume is missing.- By default, all Spark workloads submitter pods are preconfigured to mount system volumes such as Spark History Server PVC, user PVC, and shared PVC.
- If the problem is with the system volume, contact HPE Support.
- If the driver pod is in a
Running
state, check if executor pods are in a Running state as well. Sometimes executor pods are in a pending state due to a lack of resources, in this case, wait for the resources to be available.
- For other reasons, collect driver and executor pod logs and contact HPE Support.
- If the driver pod is in the
- Spark application fails.
-
If the Spark application fails, collect the driver pod logs.
- If the container fails before running the application code, contact HPE Support as there is a problem with the image.
- If the container fails while the application is running, check the exception
in a driver log:
- For the functional exception (e.g. NullPointerException), review the application source code.
- For the non-functional exception (e.g. OutOfMemoryError), increase memory allocation for the driver pod and/or review the application source code.
Livy
- Livy session creation fails.
-
If the Livy session fails, create a Livy session with the default configuration and run.
- If it runs successfully, check the configuration of your failed Livy session for configuration issues.
- If it fails, collect the Livy server pod logs and driver pod logs (if available) and contact HPE Support.
- Livy session hangs in the Starting state.
-
Verify that the driver pod is not in the Pending or ContainerCreating state.
- Livy statement run hangs or fails.
-
If the Livy statement run hangs or fails,
- Analyze the error message and fix the statement. For a detailed error message information, go to the Livy Server UI.
- Verify the executor pods are not in the Pending state. For the Livy statements to run, executor pods must be available.
- For other reasons, collect driver and executor pod logs and contact HPE Support.
- Livy session disappears.
-
No action required as this is an expected behaviour for the idling sessions.
- Executor pod logs are not available for the interactive sessions.
-
When you create a Spark interactive session by setting
spark.log.level
as key andINFO
as value and then submit the Livy statements, the executor pod logs are not available after the session completion.Workaround:
To resolve this issue, follow these steps:
- Create a custom log4j properties file named
custom_log4j.properties
in the shared volume (local:///mounts/shared-volume/custom_log4j.properties
). - Copy the following content of the
log4j2.properties
file of the driver pod of the Livy session to thecustom_log4j.properties
file.# # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # Set everything to be logged to the console rootLogger.level = WARN rootLogger.appenderRef.stdout.ref = console # In the pattern layout configuration below, we specify an explicit `%ex` conversion # pattern for logging Throwables. If this was omitted, then (by default) Log4J would # implicitly add an `%xEx` conversion pattern which logs stacktraces with additional # class packaging information. That extra information can sometimes add a substantial # performance overhead, so we disable it in our default logging config. # For more information, see SPARK-39361. appender.console.type = Console appender.console.name = console appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex # Set the default spark-shell/spark-sql log level to WARN. When running the # spark-shell/spark-sql, the log level for these classes is used to overwrite # the root logger's log level, so that the user can have different defaults # for the shell and regular Spark apps. logger.repl.name = org.apache.spark.repl.Main logger.repl.level = WARN logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver logger.thriftserver.level = WARN # Settings to quiet third party logs that are too verbose logger.jetty1.name = org.sparkproject.jetty logger.jetty1.level = WARN logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle logger.jetty2.level = WARN logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper logger.replexprTyper.level = INFO logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter logger.replSparkILoopInterpreter.level = INFO logger.parquet1.name = org.apache.parquet logger.parquet1.level = WARN logger.parquet2.name = parquet logger.parquet2.level = WARN # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler logger.RetryingHMSHandler.level = ERROR logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry logger.FunctionRegistry.level = ERROR logger.HiveConf.name = org.apache.hadoop.hive.conf.HiveConf logger.HiveConf.level = ERROR # SPARK-327: Settings to suppress the unnecessary warning message from MultiMechsAuthenticationHandler logger.MultiMechsAuthenticationHandler.name = org.apache.hadoop.security.authentication.server.MultiMechsAuthenticationHandler logger.MultiMechsAuthenticationHandler.level = ERROR logger.KerberosAuthHandler.name = org.apache.hadoop.security.authentication.server.KerberosAuthHandler logger.KerberosAuthHandler.level = ERROR # SPARK-575: Settings to suppress the unnecessary warning message from AuthenticationFilter logger.AuthenticationFilter.name = org.apache.hadoop.security.authentication.server.AuthenticationFilter logger.AuthenticationFilter.level = ERROR logger.NativeCodeLoader.name = org.apache.hadoop.util.NativeCodeLoader logger.NativeCodeLoader.level = ERROR logger.YarnClient.name = org.apache.spark.deploy.yarn.Client logger.YarnClient.level = ERROR logger.HiveUtils.name = org.apache.spark.sql.hive.HiveUtils logger.HiveUtils.level = ERROR logger.HiveMetastore.name = org.apache.hadoop.hive.metastore.HiveMetastore logger.HiveMetastore.level = ERROR logger.ObjectStore.name = org.apache.hadoop.hive.metastore.ObjectStore logger.ObjectStore.level = ERROR logger.SQLCompleter.name = org.apache.hive.beeline.SQLCompleter logger.SQLCompleter.level = ERROR # SPARK-945: Setting to suppress exception when non-cluster admin can not read ssl-server config logger.Configuration.name = org.apache.hadoop.conf.Configuration logger.Configuration.level = ERROR # Hide Spark netty rpc error when driver is finished logger.Dispatcher.name = org.apache.spark.rpc.netty.Dispatcher logger.Dispatcher.level = ERROR # For deploying Spark ThriftServer # SPARK-34128: Suppress undesirable TTransportException warnings involved in THRIFT-4805 appender.console.filter.1.type = RegexFilter appender.console.filter.1.regex = .*Thrift error occurred during processing of message.* # Hide fips specific properties initialization appender.console.filter.1.regex = .*org.bouncycastle.jsse.provider.PropertyUtils.* appender.console.filter.1.onMatch = deny appender.console.filter.1.onMismatch = neutral
- Set the following configurations on the
custom_log4j.properties
file.
andlogger.SparkLogger.name = org.apache.spark logger.SparkLogger.level = INFO
- Create an interactive session by setting the Spark configuration with
spark.executor.extraJavaOptions
as key and-Dlog4j.configuration=file:/local:///mounts/shared-volume/custom_log4j.properties
as value. See Creating Interactive Sessions. - Submit the Livy statements. See Submitting Statements.
Result:
The logs for the executor pod are now available.
- Create a custom log4j properties file named