Logging means to record runtime data as well as data on extreme situations like errors or crashes. While logging in Python or logging in Java use different libraries, there is lots of common ground. Obviously, there is a balance between over-reporting which stresses the infrastructure and can literally bury the important information in noise, and not enough logging which leaves you puzzling. Typically, there are several logging levels that define the amount of logging is done.
In the lifetime of an application (or a bug) there are three phases of logging:
- Development time: Here as a developer we have to provide logging into the mechanism we are using.
- Runtime: Here whoever runs the software (DevOps, yourself, your customer), collects the logs (and need an environment that provides for it)
- Using the logs: You will be presented with possible tons of data you have to wade through. Using tools for log aggregation might help to find the diamond in the mud or the needle in the haystack, or… you get what we mean.
Logging Best Practices:
Some best practices:
- Do not reinvent the wheel. There are tons of good logging frameworks out there. Make use of those!! Select one that supports your environments like OSes, runtime, app servers, etc. It is a Software Architecture (see 2.1.12) task to define which framework is to be used.
- There is system logging, aka your underlying infrastructure also collects logs. Make sure to have a good overview of all possible sources. For example, the system logging in Linux provides you with information about resources such as networking, memory, CPU which helps to identify time outs or resource depletion. Of course, there are system logging best practices that you can find in the operation guidelines for the infrastructure.
- Have logging guidelines (see for example Atlassian’s or Google Optimal Logging) to define a common understanding of what, when, and how to log. Again, this is Software Architecture. For example (taken from Atlassian’s guidelines) define the classical log levels:
- DEBUG Fine-grained information about what is going on within the system.
- INFO Announcements about the normal operation of the system - scheduled jobs running, services starting and stopping, significant user-triggered processes
- WARN Any condition that, while not an error in itself, may indicate that the system is running sub-optimally
- ERROR A condition that indicates something has gone wrong with the system In practice, two configurations are needed: a. Development / Debugging while in development and as verbose as needed. b. Production in production logging everything but DEBUG.
- Be aware of privacy! Be super careful logging anything that remotely could be PII (Personal Identifiable Information). The same is true for account credentials or secrets.
- When logging provides as much context as possible. In today’s Micro-Services, it is extremely important to be able to track a call through different services. Best, when you assign an id early on and hand it on through the API calls (System Architecture, see 2.1.12).
- Be aware that logging can lead to bugs in itself (like accessing properties of objects that are actually
null). Make sure your logging is resilient against the problems it shall report.
Logs in the Debugging Process:
To use logs in the debugging process, here are some ideas:
- Try to find the bug in the logs either by using timestamps or the symptoms. Obviously, everything later in the logs we can disregard (and reduce the data to be looked at).
- Next, try to further reduce the log data by sorting out systems that are not involved (like upstream systems, APIs that consume…). In our experience, follow the data flow inside of the application helps in this regard.
- Try to really understand the system state that triggered the error. If in doubt - or simply as a habit - try to bring the debug system into this state and see it reacting (and logging) as you expect. Again, use your finding to reduce the data you look at.
- The final goal is to have an understanding of how the data was flowing through your app and finally led to an issue. Why is this important? Because you want to fix the issue at the point within the architecture that actually makes sense and ensures proper handling.
- Sometimes you will find that everything is correct but the amount of data overwhelmed your system (“On my machine, it worked…” because your tests only have 5 data points and the real world 5 million). Now, there is a short time solution (add more oomph) and maybe a long term (and financially lucrative) solution: Redesign your algorithms. Remember the runtime complexity thingie? In the age of cloud computing, you pay for the CPU cycles you use. Make a business case and see if it makes sense to rebuild.
- Think about your tests. Do you need to change anything like increasing the load tests, have a network-detached test suite, etc?
- Sit back, take a sip of coffee, and ask yourself: Where else? Was this a one-off or is there a pattern? If there is a pattern, save you from future bugs and report them.
It is all about reducing the amount of data in the beginning. Then, try to replay and trigger the effect you expect to see (also in the logs). Than fix things in a smart way.