Two weeks in, and I'm already have a late weekly blog post for myself! Time can really fly by when you stack up a decent amount on your plate. I'm going to be honest there isn't really one thing in particular that I want to talk about this week. I just kind of did things, nothing really too exciting, or no new problems that particularly interested me.
Instead I figured in the spirit of Eevee I'd just write a weekly roundup of one of the things I worked on this week, and provide it's current status.
Dimentio is my current work project (and thankfully they allow me to talk fully about it!). The tl;dr here is we're adopting Kafka at Instructure. Basically our log pipeline for all of our apps is starting to creak under the weight of us writing >5 TeraBytes of data every day during the slow season for just access logs (let alone debug logs, vpc flowlogs in aws, and the like). I've been leading this rollout, and dimentio is the next thing we have to solve.
Essentially we want our Kafka Clusters to have two traits that aren't necessarily inherit to just running a production ready kafka cluster:
- We want it to be extremely easy to add new nodes incase of unprecedented capacity (this happens every so often when a client doesn't predict the right capacity due to either chance, a failure on our part communicating, or just some other minor mistake).
- We would love our cluster to auto-heal as much as possible.
Now for the first point most people usually solve this problem by running Kafka in some sort of Mesos/Kubernetes/etc. type platform. In mesos you can just add more resources, and then rebalance the topics in kafka using the cli tools provided by kafka. Essentially with one cli command you can just add new resources. It's extremely easy.
That being said the Core Ops Team at Instructure doesn't have any Kube/Mesos/Nomad/etc. Framework deployed right now (it's something we really really want to have as we've been growing, but we just don't want to make the wrong decision. We're still deciding about what's best for us, plus asking questions about if we want to replace our internal PaaS.). With a rollout as fast as we're hoping to get out Kafka to relieve pain in our team, we don't want to adopt one of these giant frameworks for one tool. We also don't want to rush into choosing a framework. So that's a no go.
Since we're deployed in AWS it'd be nice to take advantage of the "ASGs" provided by AWS. Unfortunately we're not aware of any tool that currently watches an ASG for scale-up/scale-down events, and properly scales kafka (without needing to run a second cli command after you've scaled the instance).
This is the first thing Dimentio does. It watches an SQS Queue for scale-up/scale-down events, and properly rebalances the topics in Kafka. However since we know we could be scaling up when things are running really hot, we don't just want to blindly rebalance. Rebalancing can cause a lot of contention, so dimentio will also split up the rebalance into a bunch of small manageable chunks.
As for the second requirement there's quite a few solutions out there. However, I was really impressed with the idea behind: DoctorKafka. Although I like the idea, and how it does certain things there are also certain things I don't like about it. I don't like that it requires a local agent when I already have a local agent for DataDog, and AWS Metrics that can fill me in just fine. I also don't like how I can't specify configuration in something like Consul, which refreshes automatically. These aren't things wrong with DoctorKafka per say, it's just things that don't fit our specific deployment method as well.
So this is what dimentio does. It listens for ASG Scale Ups/Scale Downs rebalancing topics where necessary to give us a nice even distribution. It will also employ DoctorKafka esque techniques in order to ensure when a broker goes down we rebalance correctly, and alert when necessary.
Really though I should say that is what it "will do". This week I started work on dimentio, and got to a place I'm happy with for one week considering I don't normally write Java code. Dimentio currently can:
- Connect to Consul, or read from the environment for Config Values. Will refresh consul values automatically every 5 minutes, or when the process is sent a SIGHUP on linux/mac (yay we can be a daemon!)
- Connect to an SQS Queue with long polling, and deserialize Instance Scale Up/Down Events.
- Report metrics to Statsd.
- Log in json (only took adding 7 dependencies to get this right :groan: I really dislike logging in Java.)
- Build Successfully checking linting rules, and processing through FindBugs.
- Has some Automated Tests, and some integration tests.
All in all I think that's a good point one week in, and I hope to take it even farther next week.