Spark

Describes how to identify and debug issues for Spark.

Spark History Server

Long-running Spark applications exceed disk quotas for Spark History Server.

Repeatedly running long-running Spark applications generates a large volume of logs in the Spark History Server event log directory. This can exceed disk quotas, causing failures in other Spark applications. You must monitor log sizes and manage disk space to mitigate this issue.

Workaround

To prevent the exceeding of disk quota for the Spark History Server event log directory, modify the Spark History Server configuration options as follows:

To periodically clean up event logs from storage, set:
spark.history.fs.cleaner.enabled true
To delete job log files older than the specified value, set:
spark.history.fs.cleaner.maxAge 1d
Here, job log files that are older than 1d are deleted by the filesystem history cleaner.
To specify the frequency for the filesystem job history cleaner to check for the files to be deleted, set:
spark.history.fs.cleaner.interval 12h
Here, the filesystem job history cleaner checks every 12h for files to be deleted.
To enable event log rolling based on size, set:
spark.eventLog.rolling.enabled true
The default is deactivated.
To specify the maximum size of the event log file before it rolls over, set:
spark.eventLog.rolling.maxFileSize 128m
The default is 128 MB.
To specify the maximum number of non-compacted event log files to retain, set:
spark.history.fs.eventLog.rolling.maxFilesToRetain 2
By default, all event log files are retained. To compact older event logs, reduce the value. The minimum value accepted is 1.
NOTE
Compaction tries to exclude events that point to outdated event log files, such as the following events:
  • Events for finished jobs and related staged/task events
  • Events for the terminated executor
  • Events for finished SQL execution and related job/staged/task events
Discarded events do not display in the Spark History Server UI.
NOTE
If the disk quota is full, contact HPE Support for assistance.

Spark Operator

Spark application submission hangs or fails.
If the Spark application submission hangs or fails, check the submission pod state.
  • If the pod is in the pending state, wait for more resources to be available.
  • If the pod is in the failed state, collect pod logs and contact HPE Support.
Spark application hangs in the Submitted or Running state.

If the Spark application hangs in the Submitted or Running state, check the state of the driver pod.

  • If the driver pod is in the ContainerCreating state, check the pod events.
    • If the image is downloading, wait until the image is downloaded.
    • For the FailedMount reason, you need to identify what volume is missing.
      • By default, all Spark workloads submitter pods are preconfigured to mount system volumes such as Spark History Server PVC, user PVC, and shared PVC.
      • If the problem is with the system volume, contact HPE Support.
    • If the driver pod is in a Running state, check if executor pods are in a Running state as well​. Sometimes executor pods are in a pending state due to a lack of resources, in this case, wait for the resources to be available.
  • For other reasons, collect driver and executor pod logs and contact HPE Support.
Spark application fails.
If the Spark application fails, collect the driver pod logs.
  • If the container fails before running the application code, contact HPE Support as there is a problem with the image.
  • If the container fails while the application is running, check the exception in a driver log:​
    • For the functional exception (e.g. NullPointerException), review the application source code​.
    • For the non-functional exception (e.g. OutOfMemoryError), increase memory allocation for the driver pod and/or review the application source code​.

Livy

Livy session creation fails.

If the Livy session fails, create a Livy session with the default configuration and run.

  • If it runs successfully, check the configuration of your failed Livy session for configuration issues.
  • If it fails, collect the Livy server pod logs and driver pod logs (if available) and contact HPE Support.
Livy session hangs in the Starting state.

Verify that the driver pod is not in the Pending or ContainerCreating state.

Livy statement run hangs or fails.
If the Livy statement run hangs or fails,
  • Analyze the error message and fix the statement. For a detailed error message information, go to the Livy Server UI.
  • Verify the executor pods are not in the Pending state. For the Livy statements to run, executor pods must be available.
  • For other reasons, collect driver and executor pod logs and contact HPE Support.
Livy session disappears.

No action required as this is an expected behaviour for the idling sessions.

Executor pod logs are not available for the interactive sessions.

When you create a Spark interactive session by setting spark.log.level as key and INFO as value and then submit the Livy statements, the executor pod logs are not available after the session completion.

Workaround:

To resolve this issue, follow these steps:

  1. Create a custom log4j properties file named custom_log4j.properties in the shared volume (local:///mounts/shared-volume/custom_log4j.properties).
  2. Copy the following content of the log4j2.properties file of the driver pod of the Livy session to the custom_log4j.properties file.
    #
    # Licensed to the Apache Software Foundation (ASF) under one or more
    # contributor license agreements.  See the NOTICE file distributed with
    # this work for additional information regarding copyright ownership.
    # The ASF licenses this file to You under the Apache License, Version 2.0
    # (the "License"); you may not use this file except in compliance with
    # the License.  You may obtain a copy of the License at
    #
    #    http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    #
    
    # Set everything to be logged to the console
    rootLogger.level = WARN
    rootLogger.appenderRef.stdout.ref = console
    
    # In the pattern layout configuration below, we specify an explicit `%ex` conversion
    # pattern for logging Throwables. If this was omitted, then (by default) Log4J would
    # implicitly add an `%xEx` conversion pattern which logs stacktraces with additional
    # class packaging information. That extra information can sometimes add a substantial
    # performance overhead, so we disable it in our default logging config.
    # For more information, see SPARK-39361.
    appender.console.type = Console
    appender.console.name = console
    appender.console.target = SYSTEM_ERR
    appender.console.layout.type = PatternLayout
    appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex
    
    # Set the default spark-shell/spark-sql log level to WARN. When running the
    # spark-shell/spark-sql, the log level for these classes is used to overwrite
    # the root logger's log level, so that the user can have different defaults
    # for the shell and regular Spark apps.
    logger.repl.name = org.apache.spark.repl.Main
    logger.repl.level = WARN
    
    logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
    logger.thriftserver.level = WARN
    
    # Settings to quiet third party logs that are too verbose
    logger.jetty1.name = org.sparkproject.jetty
    logger.jetty1.level = WARN
    logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
    logger.jetty2.level = WARN
    logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper
    logger.replexprTyper.level = INFO
    logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
    logger.replSparkILoopInterpreter.level = INFO
    logger.parquet1.name = org.apache.parquet
    logger.parquet1.level = WARN
    logger.parquet2.name = parquet
    logger.parquet2.level = WARN
    
    # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
    logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
    logger.RetryingHMSHandler.level = ERROR
    logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
    logger.FunctionRegistry.level = ERROR
    logger.HiveConf.name = org.apache.hadoop.hive.conf.HiveConf
    logger.HiveConf.level = ERROR
    
    # SPARK-327: Settings to suppress the unnecessary warning message from MultiMechsAuthenticationHandler
    logger.MultiMechsAuthenticationHandler.name = org.apache.hadoop.security.authentication.server.MultiMechsAuthenticationHandler
    logger.MultiMechsAuthenticationHandler.level = ERROR
    logger.KerberosAuthHandler.name = org.apache.hadoop.security.authentication.server.KerberosAuthHandler
    logger.KerberosAuthHandler.level = ERROR
    
    # SPARK-575: Settings to suppress the unnecessary warning message from AuthenticationFilter
    logger.AuthenticationFilter.name = org.apache.hadoop.security.authentication.server.AuthenticationFilter
    logger.AuthenticationFilter.level = ERROR
    
    logger.NativeCodeLoader.name = org.apache.hadoop.util.NativeCodeLoader
    logger.NativeCodeLoader.level = ERROR
    logger.YarnClient.name = org.apache.spark.deploy.yarn.Client
    logger.YarnClient.level = ERROR
    logger.HiveUtils.name = org.apache.spark.sql.hive.HiveUtils
    logger.HiveUtils.level = ERROR
    logger.HiveMetastore.name = org.apache.hadoop.hive.metastore.HiveMetastore
    logger.HiveMetastore.level = ERROR
    logger.ObjectStore.name = org.apache.hadoop.hive.metastore.ObjectStore
    logger.ObjectStore.level = ERROR
    logger.SQLCompleter.name = org.apache.hive.beeline.SQLCompleter
    logger.SQLCompleter.level = ERROR
    
    # SPARK-945: Setting to suppress exception when non-cluster admin can not read ssl-server config
    logger.Configuration.name = org.apache.hadoop.conf.Configuration
    logger.Configuration.level = ERROR
    
    # Hide Spark netty rpc error when driver is finished
    logger.Dispatcher.name = org.apache.spark.rpc.netty.Dispatcher
    logger.Dispatcher.level = ERROR
    
    # For deploying Spark ThriftServer
    # SPARK-34128: Suppress undesirable TTransportException warnings involved in THRIFT-4805
    appender.console.filter.1.type = RegexFilter
    appender.console.filter.1.regex = .*Thrift error occurred during processing of message.*
    # Hide fips specific properties initialization
    appender.console.filter.1.regex = .*org.bouncycastle.jsse.provider.PropertyUtils.*
    appender.console.filter.1.onMatch = deny
    appender.console.filter.1.onMismatch = neutral
  3. Set the following configurations on the custom_log4j.properties file.
    logger.SparkLogger.name = org.apache.spark
    logger.SparkLogger.level = INFO
    and
  4. Create an interactive session by setting the Spark configuration with spark.executor.extraJavaOptions as key and -Dlog4j.configuration=file:/local:///mounts/shared-volume/custom_log4j.properties as value. See Creating Interactive Sessions.
  5. Submit the Livy statements. See Submitting Statements.

Result:

The logs for the executor pod are now available.