I Solemnly Swear I am up to No Good
Platform Observability: What would you do with a map of all your IT resources, what they talked to, how, and why?
I’ve focused my career on helping engineers, their teams, and their organizations build better software faster. I built novel observability solutions (New Relic Browser), deep security tech to stop SQL and XSS injection (IMMUNIO, now part of Trend Micro), and a developer experience (DX) platform for the cloud (Stackery, now AWS Infrastructure Composer).
A common theme emerged from these efforts: It is really hard for engineering organizations to keep track of their IT resources, how they interact, and who accesses them. This post details the problem and provides a teaser of the “Platform Observability” solution I’m building to solve it.
New Relic’s El Dorado
I started as employee number ~80 at New Relic. My goal was to architect the first Browser Observability product. Six weeks before the public beta launch we realized we needed to sync with the Ops side of the org about our ingestion requirements. There was already a service, simply named “Beacon”, doing a tiny amount of browser telemetry ingestion (e.g. page load latency) for customers. We were extending it to ingest much more observability data (JS errors, XHR timing data, etc.). We needed to confirm that the service could handle this additional load.
But who owned “Beacon”? Thankfully, New Relic engineering fit on one floor of Portland’s Big Pink tower, so this only took a couple days to nail down. Beacon had become a long forgotten service because it just “worked” and hadn’t been touched for a while. The original code had been written by a few of the first five employees at the company, who now had titles like CTO and CPO. Getting answers to questions that normally might take a quick conversation between engineers became days of back and forth as the organization tried to remember long-forgotten details of the service. Thankfully we nailed down these details and launched on time and without issue.1
After I left New Relic in 2014 the company continued on its rapid startup-to-public-company ascent. But the problem of tracking IT resources and who owned them only grew. I’ve been told of offsites where key engineering leaders attempted to remember and diagram all the microservices that formed the New Relic platform. Inevitably these attempts ended in failure. Ward Cunningham, inventor of Wikis and a Principal Engineer at New Relic at the time, led an effort to solve this challenge. That effort, through a few fits and starts, became El Dorado, New Relic’s attempt to produce a map of all IT resources.
El Dorado was a massive undertaking. Unfortunately, only the frontend became open source. The backend was much larger, ingesting information from dozens of sources including git repos, project management tools, and HR systems. Whether or not the project uncovered untold riches is debatable, but it is clear that the problem of understanding New Relic’s IT footprint was painful enough to spend significant resources on.
Stackery’s Architecture Diagrams
In 2017 I co-founded Stackery as a DX platform for serverless cloud applications. A key feature was its visual drag-and-drop interface for editing your Infrastructure-as-Code (IaC) templates. While the editing functionality was useful, it was only near the end of our time as an independent company that we realized one of the most impactful benefits of our solution was simply being able to visualize the architecture embedded within IaC templates. One of our customers said to me: “We pay $20k/yr for Lucidchart just to be able to spend our Architects’ valuable time to maintain all our architecture diagrams by hand.” Stackery focused on individual serverless applications, but this customer was asking for more: Automatic documention of their entire IT footprint.
A (Marauder's) Map of your IT
The Marauder's Map was a magical document in the Harry Potter series. After waving one’s wand and saying, “I Solemnly Swear I am up to No Good”, the map would “unlock” to show everyone’s location, in realtime, at Hogwarts School. What if we could do the same for organizations’ IT footprints?
This is what I’m working on. Archodex is the first Platform Observability solution. It efficiently observes traffic between your services, including encrypted HTTPS API requests and responses, picking out the important details that enable automated documentation and deep introspection of your IT footprint.
Let’s imagine an engineer comes to you and says: “We need to rotate the staging sprockets DB credentials.” But now you need to know which services use those credentials to ensure you can rotate them safely. This map shows you that the Kubernetes sprockets service running in the default namespace accesses the stg/sprockets/db_creds secret in the vault.acme.com Hashicorp Vault service. Now you have the information you need to make sure your credential rotation can be done safely.2
Let’s imagine another scenario: A security engineer wants to understand more about who and what can access a secret. Here we see that the GitHub user @txase (me!) has read the prod DB secret in the vault. Let’s click the link to find out more details.
The image on the left above is show first, and the image on the right appears when we click “Show Event Chains”. The Event Chain details tell us: I, GitHub user @txase, Assumed myself (a GitHub Actions detail) to Invoke the workflow .github/workflows/deploy.yml in the txase/archodex-demo-sprockets repo, which Read the prod/sprockets/db_creds secret. Archodex provides a rich level of detail to help engineering organizations understand their IT footprint, including how resources and people interact.
More to Come
This is just a tease of Archodex. You can find out more about planned functionality at its website, and you can request early access if you need to solve painful problems today. Current plans are to make Archodex free for individual and team-sized usage, available for folks to download and run for free themselves, and offer it as a managed service. I’m polishing it up to share publicly, stay tuned!
I’m particularly proud of this launch. Even though we clearly marked the new features as “beta” functionality that should not be enabled on production sites, we had many customers click the button to do so. Services like Bitbucket (back then a substantial competitor of GitHub, even if its market share has diminished since) began sending our relatively large and deeply instrumenting JS agent to all their customers’ browsers. The agent and service worked flawlessly. In the first 6 months post-beta-launch we only had one issue to “fix”: Making sure that if we were used in combination with a different company’s browser instrumentation agent the site wouldn’t break due to a bug in the other company’s code.
I lie. This example actually does not give you all the info you need. There may be source repos with the DB credentials hardcoded within them. The secret may exist in multiple secret stores. Etc., etc. These will also be noted in Archodex system maps and records, but are not shown here in order to simplify the example scenario.