Robust Software

Tales of a code samurai

Logging

If there was one thing that I learnt in 2012 that I would want to convey to all the developers I know, it would be this:

Logging is about so much more than failures

I don’t know if it was just my experience but little to no emphasis was put on logging aside from handling exceptions. In the .NET world this boiled down to adding ELMAH to your project and then forgetting about it, with Rails it meant having nothing but the logs that you got out of the box. I’ve found that if that’s all you have, you’re really missing out.

Benefits of good logging

Bug investigations

I think that every system I’ve ever worked on has been distributed, at least in some form. It may have been as simple as scheduled processes working alongside a web site, or it may have been a truly distributed, message-driven system. Distributed systems can be hard to monitor and bugs can be particularly tricky to track down. The complex interactions dependent upon the state of different components can be a nightmare to reproduce.

Good logging virtually solves this problem. Given sufficient detail in your logs, you no longer have to theorise or piece together how something happened. Your logs will lay it out for you, plain as day.

Here’s a contrived example to demonstrate where better logging can help you:

1
2
3
4
5
6
7
8
9
10
11
12
13
module Background

  def self.execute(user)
    if user.privileged? or transient_condition?(user)
      something_dangerous(user)
    end
  end

  def self.transient_condition?(user)
    user.first_name.start_with?(random_character)
  end

end

transient_condition? is the most contrived part but imagine you’ve received a bug report a day later and multiple things have changed on their account and the entire system that transient_condition? references since that time. You’ll end up with the same effect.

So, why did something_dangerous get executed for our friend Bob yesterday? Where do you start? Check the database for their current state? How do you know they haven’t updated their profile since? Perhaps you use event sourcing, how easy is it for you to roll back everything in the system that could have had an effect to that point in time? What was that point in time?

You can see where I’m going so I won’t labour the point. It probably will, however, be laborious to investigate this bug report. This is where I would have been a year ago, piecing together transient bits of information and responding with sentences containing “probably”.

Here’s how I would write that same code today:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
module Background
  include Hatchet

  def self.execute(user)
    log.info "Entering execute(user:##{user})"

    do_something_dangerous = false

    if user.privileged?
      log.info "##{user} is privileged"
      do_something_dangerous = true
    elsif transient_condition?(user)
      log.info "##{user} satisfied transient condition"
      do_something_dangerous = true
    end

    if do_something_dangerous
      log.info "Doing something dangerous for ##{user}"
      something_dangerous(user)
    else
      log.info "Not doing something dangerous for ##{user}"
    end

    log.info "Exiting execute(user:##{user})"
  end

  def self.transient_condition?(user)
    log.info "Entering transient_condition?(user:##{user})"
    character = random_character
    log.info "Checking if first_name:##{user.first_name} starts with character:#{character}"
    result = user.first_name.start_with?(random_character)
    log.info "Exiting transient_condition?(user:##{user}) - return:#{result}"
    result
  end

end

Yes, this is much longer but it has the exact same logic. We’re capturing the context of each branch of logic so that we may be able to understand what happened after the fact. For me, this is the difference between clean code and production-ready code.

Now let’s go back to our bug report. Why did something_dangerous get executed for Bob yesterday? Let’s find out with a little grep-fu:

$ egrep -A 6 "Entering execute\(user:Bob" yesterday.log

13:53:01 INFO Background - Entering execute(user:Bob)
13:53:01 INFO Background - Entering transient_condition?(user:Bob)
13:53:01 INFO Background - Checking if first_name:Bob starts with character:B
13:53:01 INFO Background - Exiting transient_condition?(user:Bob) - return:true
13:53:01 INFO Background - Bob satisfied transient condition
13:53:01 INFO Background - Doing something dangerous for Bob
13:53:06 INFO Background - Exiting execute(user:Bob)

There’s our answer, Bob satisfied the transient condition. Now you can definitively respond to the bug report with the exact reason why something dangerous happened for Bob yesterday, regardless of the state of the system today. We can also see that doing something dangerous took around 5 seconds.

Which brings me on to a second benefit of logging.

Performance profiling

Particularly when working with bulk processes it can be extremely difficult to build up a test suite that truly reflects production. If you have a database involved you also have the effects of simultaneous queries being executed, the state of indexes, etc. The exact hardware can also have an effect. For all your efforts, sometimes there is no better place to profile code than in your production environment whilst working on real-world data.

However, you don’t want to attach a profiler to your production processes directly. That would mean too much access to a production system, it can often affect how the process executes, and if it’s a nightly task you may have to be up at 4am to watch it. None of those things are desirable.

If you invest in your logs you will not only get the information needed to troubleshoot things happening in your system, you’ll also get a good level of performance profiling too. Most logging frameworks can give you timestamps to millisecond precision, and in my experience that’s more than enough to narrow down which method or methods are taking the most time. This allows you to focus your efforts on where the problem lies in production, and avoids speculative optimisation based upon sterile tests.

Particularly with bulk processes I believe it is hard to log too much. They tend to be extremely reliant on state and can rarely be re-run after the fact, they often run overnight during your system’s quiet window, and they are often left untouched for months as their requirements tend to change much less frequently.

Good logs give you view into the behaviour of all your code. That insight really pays dividends for code that you rarely see and rarely touch.

Understanding unfamiliar code

Good logs give you insight into how a system works, beyond just looking at the code. It helps you understand how frequently a class is used, how frequently a method is called, how often the various branches in a method are hit, etc. Just looking at the code cannot give you that same insight.

Logs can be used to safely prune “probably dead” code or features. If you suspect a branch of logic can be removed, or that a feature is rarely used, you can simply add an additional log statement or raise the level of an existing one:

1
2
3
4
def perhaps_dead
  log.warn "DEPRECATED perhaps_dead called"
  ## does something
end

You can then review your logs a week or two later to see if any messages have been recorded. If there aren’t any messages you can be more confident in removing it. If there are several messages then you can keep it in place and remove the message without anyone ever being affected.

Good logs replace the need for a debugger, but better than that they provide a historic debugger. You can step through code executed days ago by following the messages that were logged as you look through the code, following the branches taken according to the next message logged. This is invaluable for answering questions like “how many people were affected by this bug?”

Additional context for errors

I’m pretty sure that everyone logs errors. However, they can often be pretty useless apart from telling you what line of code the failure came from. Think of the beloved null reference exception, if you have something like:

1
if left.id == right.id

If this throws a null reference exception which of left or right was null? Also, if this is 5 steps down a path of execution what are left and right in the first place?

Good logs give you the additional context of the state of the system before the error happened which can be vital in understanding why the error happened.

How to log well

Good logs are all about context. A good message answers the questions:

  • Where are you?
  • What state matters?
  • Why are you there?

The good news is that most decent loggers are class-based and so they deal with the majority of “where” on your behalf, you only have to add the method when it is useful. The difficult parts are the “what” and the “why”.

For the “what”, a trick I use when I am unsure is to imagine I have hit a breakpoint. I then imagine what variables I would be inspecting, are there particular properties of those variables that are interesting at that point in time? The things I would be inspecting with a debugger are the things I record in my log messages.

Much like a bug is a sign of a missing unit test, the need to use a debugger is the sign of a missing log message. Each time you reach for a debugger, ask yourself what log message would avoid you doing so and add it. If you aren’t sure, mindfully use the debugger and notice what you are looking at, then put the message in place.

You can’t attach a debugger to production, but you can read your logs. If the log message is useful to you now, it may well be useful to you in production some day.

Recording the “why” tends to be a bit more straight forward. You want every logical path in your code to result in a log statement, and that statement should record what was satisfied in order for you to hit that path. Our original example demonstrates this well.

I refer to this state of thinking as the maintainer’s mindset. Try and imagine yourself as the maintainer of the code, trying to resolve a production issue, and give yourself the information you would need to shorten the time to resolve it.

How to avoid being overwhelmed by messages

You are probably thinking you’ll start generating gigs and gigs of log files, and you may be right. However, remember that log messages can be written at various levels, and that you can configure your logger to discard particular messages based upon level or even the class they came from.

DEBUG, INFO, WARN, ERROR, and FATAL are pretty ubiquitous across all logging frameworks and they all convey a level of importance.

By default I have production loggers set up to record INFO or higher, meaning that DEBUG messages are discarded. This may make you wonder why you would leave them in, but if they were useful at one time they may be useful again. You can quickly and easily change your production configuration to record DEBUG messages, but you can’t quickly and easily add them across your code base.

Chosing the right level for a message

DEBUG

The finest level of detail. I rarely add a message at this level to begin with. I will usually start with the message at INFO and then downgrade it in a later release if it proves to have a low signal-to-noise level in production.

INFO

Something generally interesting but not something to be worried about. These should give enough information to understand what the system is doing at any point in time.

WARN

Not an error, but something worth highlighting to someone monitoring the logs. For example:

  • Handling a particularly rare edge case
  • Hitting deprecated code
  • Falling over to a secondary provider
  • Particularly slow response times

ERROR

Generally tied to an error though not always. For example, you might log an ERROR message when you encounter a state that can be handled gracefully but is not reasonable to expect. There was no error as such but the fact it happened deserves the additional attention a higher level demands.

If your logging framework allows, always pass through the error object if there is one. Logging frameworks can usually have their output configured to include the stack trace formatted well without you having to do it in your log message.

FATAL

Reserved for when the underlying process will be halted as a result. To that end I only use these on background processes as web servers should never die completely.

Frameworks

For .NET I’ve never seen a need to look beyond log4net, similarly for Java I’ve used log4j. For Ruby I couldn’t find a framework that did what I wanted so I created Hatchet.

Creating your own framework is rarely a good idea. In particular the parts that deal with writing messages are tough to get right and existing frameworks have been battle hardened. So while I had my reasons for creating Hatchet, the actual logging is performed by the logger in Ruby’s standard library.

Performance and configuration

It’s worth mentioning the performance hit of logging, and configuration come hand-in-hand with that. One of the main concerns I’ve heard for logging as I’ve described is “won’t it slow down the application?” The answer is obviously yes, you can’t do anything for free. However, the benefits I’ve outlined far outweigh the small cost of writing the logs if you get your configuration right.

The biggest mistake you can make in your log configuration is having everything written to a database. Doing so incurs the cost of an insert for every log message you write. In comparison, writing the message to a local disk is an order of magnitude or more faster. The performance penalty of going to the database can quickly add up. For example, I have switched some old SQL-backed logging code to log4net writing to local disk and saw a 10-20% performance increase from that change alone.

Costly calls should be reserved for infrequent events such as errors. Think of SaaS products such as Airbrake, they have the overhead of a network call but it is worth it to get notification of something as significant as an error. At Zopa we have errors sent to our HipChat channel as we feel the cost of the call is justified by the real-time notifications we receive for production issues.

Be aware that no message you log is free and configure the overhead incurred accordingly. 95% of your log messages will be WARN or lower, and that is the case that should be optimised.

Wrapping up

If in doubt log too much and log to a local disk. Having too much information about what your system is doing is significantly better than having too little. Logs can always be filtered, and message levels and log configurations can always be changed.

If you aren’t used to working with log files you will have to learn some new skills to query files rather than a database, but a decent log viewer like Ostros will make most things as easy if not easier than using SQL. Sometimes you can’t beat a bit of grep-fu, but those skills are useful any time you need to work with any kind of text file.

I hope I’ve conveyed the benefits of good logging and given some guidance on the practices to be able to produce good logs for your own code.

If you need any help getting started get in touch on Twitter (@gshutler) or via email.