Quantifind Blog

QUANTIFIND.COM

Configuring Spark Logs

After you run a few Spark jobs, you’ll realize that Spark spits out a lot of logging messages. At first, we found this too distracting, so we turned off all Spark logs. But that was too heavy-handed — we always wanted to see some of the log messages, and of course, when we needed to debug something, we wanted everything.

So, we settled on the following configuration for log4j:

# make a file appender and a console appender
# Print the date in ISO 8601 format
log4j.appender.myConsoleAppender=org.apache.log4j.ConsoleAppender
log4j.appender.myConsoleAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.myConsoleAppender.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
log4j.appender.myFileAppender=org.apache.log4j.RollingFileAppender
log4j.appender.myFileAppender.File=spark.log
log4j.appender.myFileAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.myFileAppender.layout.ConversionPattern=%d [%t] %-5p %c - %m%n



# By default, everything goes to console and file
log4j.rootLogger=INFO, myConsoleAppender, myFileAppender

# The noisier spark logs go to file only
log4j.logger.spark.storage=INFO, myFileAppender
log4j.additivity.spark.storage=false
log4j.logger.spark.scheduler=INFO, myFileAppender
log4j.additivity.spark.scheduler=false
log4j.logger.spark.CacheTracker=INFO, myFileAppender
log4j.additivity.spark.CacheTracker=false
log4j.logger.spark.CacheTrackerActor=INFO, myFileAppender
log4j.additivity.spark.CacheTrackerActor=false
log4j.logger.spark.MapOutputTrackerActor=INFO, myFileAppender
log4j.additivity.spark.MapOutputTrackerActor=false
log4j.logger.spark.MapOutputTracker=INFO, myFileAppender
log4j.additivty.spark.MapOutputTracker=false

This configuration file sends all the logs (from Spark and everything else) to a logfile, spark.log. That way we have the full logs if we need to debug later on. In addition, we log everything to the console except for some Spark messages that are a little too verbose. But, we keep some of the logs from Spark on the console. In particular, starting with Spark 0.6, the logs include a message telling you when its running a job, and exactly where in your code it got triggered:

2012-11-01 13:00:10,647 [main] INFO  spark.SparkContext - Starting job: aggregate at MyAwesomeSourceFile.scala:86
2012-11-01 13:00:25,169 [main] INFO  spark.SparkContext - Job finished: aggregate at MyAwesomeSourceFile.scala:86, took 14.521604475 s

To make use of this configuration, first save it into a text file. Then if you add it to your classpath, it will automatically get picked up. In particular, with Spark you can just drop it in the conf directory, as that is already on the classpath by default. Or, you can configure it via code with:

 import org.apache.log4j.PropertyConfigurator;
 PropertyConfigurator.configure(propertyFile)

We should add that even though we’re sending most of the Spark logs to a logfile, those log messages are great. There are all sorts of great information in there — how many partitions are being used, the size of the serialized tasks, which nodes they are running on, progress as individual tasks finish, etc. They are invaluable for more detailed debugging and optimization. Sometimes we even watch the full logs as the job is running with tail -f spark.log.

There are a few log messages that only make it into the logfile, that we really wish we could also get in the console. For example, sub-stages of a job get logged with DAGScheduler:

2012-10-31 13:54:00,757 [DAGScheduler] INFO  spark.scheduler.DAGScheduler - Submitting Stage 1 (map at MyAwesomeSourceFile.scala:57), which has no missing parents
2012-10-31 13:54:00,785 [DAGScheduler] INFO  spark.scheduler.DAGScheduler - Submitting 52 missing tasks from Stage 1
...
2012-10-31 13:58:45,215 [DAGScheduler] INFO  spark.scheduler.DAGScheduler - Stage 1 (map at MyAwesomeSourceFile.scala:57) finished; looking for newly runnable stages
2012-10-31 13:58:45,216 [DAGScheduler] INFO  spark.scheduler.DAGScheduler - running: Set(Stage 2)
2012-10-31 13:58:45,217 [DAGScheduler] INFO  spark.scheduler.DAGScheduler - waiting: Set(Stage 0)
2012-10-31 13:58:45,217 [DAGScheduler] INFO  spark.scheduler.DAGScheduler - failed: Set()

However, DAGScheduler emits so many more log messages, that we’d rather just have these messages in the logfile. Maybe in the future Spark will move those messages to another class, so they’re easy to filter from the rest of the messages? :)

We know that there isn’t one “right way” to configure your logging, but we hope this is at least a useful starting point for others. Let us know if you come up with any useful variants of this, or even if you decide on a completely different setup.