BearTrap.jpg

“Developers are drawn to complexity like moths to flame, frequently with the same result. Puzzle solving is fun, and developers are problem solvers. Who doesn’t like the rush of solving some incredibly complex problem? In large-scale software, though, removing accidental complexity while retaining the solution to the essential complexity is challenging.” Neal Ford

I have been coding in Clojure for almost 4 years now. I have been fighting with logging in Clojure for almost 4 years now. This is not something for which Clojure is solely to blame, but it is something that seems to often suffer the Lisp Curse.

This past year, while working in a field of green (ie, building Contexts), I was once again confronted with a “simple” problem in the realm of logging…namely we were spitting out all of our secrets. Figuring this was probably enough to block launching it publicly, my team took some time to dig deeper and hunt down the errant class writing all of our requests to stdout.

Yak Shaving: working on something tangentially related to your real goal.

Many yaks shaved later, we found ourselves pondering the very essence of what we at CircleCI wanted to get out of logging. We had a problem that we knew had many layers, but we also knew we could make it better. However, once we poked it, we found it very difficult to escape. Much like moths to the flame, or a bear in a trap.

Bear Trap: A problem that once poked, has a tendency to bite, and is difficult to escape. Typically these are problems you know you can make better, but as you dig in, you find out there are many layers to them, not least of which are your compatriots.

What follows is a case study of CircleCI’s recent “Bear Trap experience”: logging in a Clojure application. All of our code for this can be found here.

The Problems with Logging

There are a number of things that make CircleCI’s logging infrastructure really fascinating:

  • Distributed architecture
  • Distributed team
  • Over 400K instances of foreign code being executed a day (maybe your code!)
  • Clojure libraries
  • Maintaining Cloud, Enterprise, test CI, development, and individual developer environments

To do all of this, we rely on a few technologies:

All of these things compounded lead to non-trivial problems.

Difficult-to-mitigate mistakes

There is simply nothing better than seeing “AKIA…” crop up in your log lines. Or maybe better: “{‘cookies’ [‘password:guest’, …]” etc.

Whether we like it or not, we’re human. And when you work at a company of our size, with people in all the different places we do, and with both that lovely legacy monolith and those slim new-fangled microservices, there will be mistakes.

With our Contexts Service logging out all environment variables dumped into it via gRPC, we knew that we should take precautions for logging out credentials. To accomplish this, the team built out logging Hazard Prevention. This ultimately boiled down to a few simple regex-based filters in code that would <<REDACTED>> anything that could possibly be related.

We wanted to make sure that we were benefiting those who would ultimately use our tools, and make the reward for using them even greater. Instead of enforcing “better coding practices” or other such forms of newcomer shaming, we wanted to just make the tool better. To paraphrase the metaphor, try to offer people the carrot over the stick.

We wrote tests. We verified it worked. And during our internal testing we saw all of our credentials still being logged…(expletives).

Since these filters weren’t being implemented between the JVM running our Clojure app and the Container aggregating stdout to send files off to Elasticsearch, we needed to be sure that our code base was viewing all logging statements. Unsurprisingly, they were not.

About 5 different logging frameworks (all operating at the same time)

Clojure, running in the JVM, gains access to the vast wealth of the Java library and ecosystem. Things like communication over HTTP or cryptography don’t have to be written in full, as other people have already taken the time to provide these for you. This is often pointed to as one of the key reasons someone should use Clojure. It’s functional, in moderation.

While the Java community can be praised for a good many things (backwards compatibility, Apache, etc.), logging is probably not one of them. And since this is a problem in Java-land, it’s a problem in Clojure.

What we found in our codebases was that there were 6 frameworks writing to stdout:

some will note that LogBack isn’t on here. Luckily, it wasn’t one of the frameworks we found logging out, but it is one we added later (read below).

Ultimately, we wanted all of these to flow through just one logger. We wanted to be able to simplify our issue, narrowing the scope of our problem space. Instead of sticking your whole leg into the trap, why not just a toe?

The decision we went with at the time was Timbre. While “bridging” output from a given logger directly to Timbre don’t exist for many of the above, we don’t need them as almost every one of them has a way to forward all of its output to SLF4J. Clojure.Tools.Logging is most likely already using a logger which we are piping through SLF4J. And since SLF4J has a great bridge to Timbre, we’re done!

This is to say, all java-land logging is forwarded to SLF4J, then we send all of the logging SLF4J is doing to Timbre. Details are below.

To SLF4J

SLF4J provides a nice “dealing with legacy APIs” readme on their site: https://www.slf4j.org/legacy.html. Using this, you’ll get most things to talk through SLF4J.

While most of the above just require including some jars on your classpath, there are a couple that require more work:

So we thought ourselves pretty slick. However, using and indexing arbitrary log entries is no small chore, and CircleCI has…

Lots of logs

There’s no way around this one for us. We generate about 300 million lines of logs a day. That’s 1.5T. Using this data comes in many forms: our security team relies on these for auditing clients’ builds, our ops team for tracking down Bitcoin miners, and our developers for monitoring/observability . With so many people relying on these logs to record so much different information, the result is an immense amount of data. Often we have prs devoted to hunting out log lines that aren’t used anymore to reduce the data load we are carrying.

For all of us to be able to get the functionality we need out of this, we have to tackle a number of things, not least of which are time zones, varying deployments, and specialty fields for each use case.

So while tools like Logstash exist, and you can easily add “grok filters” for parsing fields you want out of your log lines, the coordination of making sure all of those various filters are still working requires a lot of hoops.

Or, you can use JSON as a transport.

Not wishing to deal with this in both our Cloud and Server offerings (or on everyone’s personal machines), we chose to pipe messages directly as JSON to Logstash. New fields that show up in the payload are auto-magically indexed, timestamps “just work”.

Best of all, with structured data, writing unit tests becomes a breeze…

Testable Logging

This section is short, but all of the above is testable.

Clojure Tools Logging did a great job with testability by using protocols and interfaces to allow developers to write a fake/mock/stub implementation of a logger that receives all messages.

Timbre does something similar with appenders. For our use case, we are able to register a test appender and to check the statements that were going to be logged.

Building our tooling out with testing as a requirement was easily one of the best things we could have possibly done. It’s been said a bajillion times before, but test early and test often. Coordination for a distributed team is non-trivial. If you care about something, you best write a test.

Set tests up before you change anything. If others wrote tests for you, steal them. If others snagged applicable test data for you, steal that too! Testing makes you an early consumer of your own work, and gives you a leg up on the pain that others might feel trying to use your solution.

The Release

Bike Shedding: debating easy to grasp, inconsequential items, instead of the meat of the matter.

After incorporating all of the above into a couple of our services and having good luck with it, we decided to let the rest of the team in on it. Early feedback from folks “beta-testing” was overly positive. The simplicity of setup, opinionated configuration, and the lack of knobs and switches which would never be used were touted as its best selling points. The internal release however, was almost immediately met with a lot of questions…

  • Why are we using Timbre, shouldn’t we be using X/Y/Z instead?
  • Why is this baked into a Microservice Framework? Don’t you know frameworks are the devil’s work? Shouldn’t this have been a standalone library instead?
  • I’m going to just build my own thing.

No good deed goes unpunished, and sharing software is just such a sin. As developers, it’s easy to write off external input as uninformed feel’pinions. But, it’s critical to remember that feedback is crucial in ideation lifecycle. It’s how we improve and progress.

The Second Wave

About 3 months after the logging work was “complete”, we realized we needed to be more proactive in our thinking about how common things should work across our ecosystem. We wanted give developers the necessary space for mindful thinking on these subjects, so CircleCI spun up a team devoted to building out tools in areas of cross-team concern. This team has so far been working really well (though we do miss the days of chucking things over the wall at SRE).

Now that the team is publishing the logging work, it seemed like a good time to highlight those things that didn’t work out:

Smaller Adoption Footprint

While the reaction many have to the notion of frameworks in the Clojure community is itself quite laughable, the desire for simple problems to have simple solutions is not.

Adopting and subsequently having to support a service framework to “solve logging” is absurd. Trying to solve our initial problem quickly ended up obfuscating the scope of our solution. Instead, we’ve moved our logging solution into its own library.

In retrospect, placing all of our work in our pre-existing repo made the development cycle fast, but the adoption cycle slow.

Finding a balance between these forces is really tricky, and is something that we’ll be trying to solve for until the end of time.

For those curious, I previously wrote about our “Service Utils” Framework here. One of our Backplane team’s first goals was to break up our “microservice framework” into a “microservice chassis” to make adoption simple and unarguably gradual. The hope is that as teams need more integrations with third party services (Rollbar, Postgres, etc.), they can grab one of these ready made chassis and snap it into place without much effort. It’ll provide more benefit to them than if they built something themselves, and the overhead of adding it will be tiny. Simple problem, simple solution, incremental change.

Regex-based hazard prevention

The team found that writing filters to remove things that are secret was cumbersome, and definitely not guaranteed to catch everything. Instead, their solution places the impetus of prevention upon those who are first introducing the security vector. If a new value is deemed to be secret, it will have a special Type that doesn’t have a .toString method. Instead, when .toString gets called, WARNING:REDACTED will be returned. Now people using the secrets will have to jump through hoops in order to accidentally log something.

(defprotocol SecretAccess
  (danger [this]))

(defrecord SecretWrapper
  [secret]
  SecretAccess
  (danger [this] secret)
  Object
  (toString [this] "<<REDACTED>>"))

(def x (SecretWrapper. "AKIAIOSFODNN7DAMNIT"))

x
=> #user.SecretWrapper{:secret "AKIAIOSFODNN7DAMNIT"}

(log/info x)
<<REDACTED>>
=> nil

This is a subtle difference between being proactive in our programming instead of reactive. Reactive is dealing with a problem after it’s already happened. Proactive is leveraging the ability to prevent a problem from happening in the first place. Ultimately, we’re expecting that there will be more iteration here and we’ll end up with a mesh of the two.

Using Clojure to define the logging configuration

While Timbre is still what many developers at CircleCI will use for logging in their future projects, its configuration in Clojure ultimately was viewed as a hindrance for forward progress. Since there isn’t native namespace filtering, more log messages than are necessary get passed around and filtered later.

Presently, we’re using LogBack and some embedded XML configuration file in some shared service containers. No configuration necessary from end developers, and more fine-grained control.

Happy Hacking

Finding small problems and making them incrementally better is blessed work. It’s is also often surprisingly difficult. We hope that continuously reflecting on how we work and why we made decisions will help us avoid Bear Traps in the future.

To summarize the learnings about Bear Traps above:

  • Find differing views and opinions early, weave developer critique into the ideation lifecycle
  • Reward people for using your code, don’t increase the barrier to entry
  • Try to simplify your problem while still solving something significant and meaningful
  • Coordinating a distributed team is non trivial, try to automate this (we use continuous integration and tests here at CircleICI, shocking I know)
  • Balance your Development cycle time vs your Adoption cycle time

As always, happy hacking!