Here are some cool links to other Stuff. Enjoy, or Don't. I'm not your boss.

Actionable Stacktraces in Rust

Recently I've been working on a fairly large Rust Project. It's actually been deployed for quite awhile now, but most of the time it requires a person who's written rust to even know where to start looking for errors.

Generally I see people who haven't touched rust before look at the Rust Backtrace log, and just go "Welp, I'm not sure what this is", or "Welp, I'll just try and grep our giant log files".

This makes a bit of sense seeing as how really at Instructure, we're used to writing either Ruby, JS, Python, or Go. Sure we've written/deployed in other languages before, but these are really our bread and butter.

Let's look at a standard ruby stackrace for us:

Couldn't find User with API id ''

[..]/lib/api.rb:26:in `api_find'
[..]/app/controllers/comm_messages_api_controller.rb:115:in `index'
[..]/vendor/bundle/ruby/2.3.0/gems/actionpack- `send_action'
[..]/vendor/bundle/ruby/2.3.0/gems/actionpack- `process_action'
[..]/vendor/bundle/ruby/2.3.0/gems/actionpack- `process_action'
[..]/vendor/bundle/ruby/2.3.0/gems/actionpack- `block in process_action'
[..]/vendor/bundle/ruby/2.3.0/gems/activesupport- `call'
[..]/vendor/bundle/ruby/2.3.0/gems/activesupport- `block (2 levels) in compile'
[..]/vendor/bundle/ruby/2.3.0/gems/activesupport- `call'
[..]/vendor/bundle/ruby/2.3.0/gems/activesupport- `block (2 levels) in around'
[..]/vendor/bundle/ruby/2.3.0/gems/activesupport- `block (2 levels) in halting'
[..]/app/controllers/application_controller.rb:352:in `block in enable_request_cache'
[..]/lib/temp_cache.rb:7:in `enable'
[..]/app/controllers/application_controller.rb:351:in `enable_request_cache'

It's honestly a pretty easy stacktrace to read.

We see it's unable to find a User with the API ID of an empty string, that gives us a surprising amount of context, let alone we also get some line numbers around that. Plus all errors are logged through Sentry, so we get even more context through line numbers/source code mapping to see the actual code producing the error.

Figuring out the state of what occurs when the error happens using tags, and descriptive error messages even we have to dig into the bowels of Rails, we can at least figure out what is going wrong. Even if it's immediately unclear how that state gets to be.

Let's compare that to the rust backtrace's we were getting:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NotPresent', /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/
stack backtrace:
   1:     0x5627d768b8aa - std::sys::imp::backtrace::tracing::imp::write::h3188f035833a2635
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/backtrace/tracing/
   2:     0x5627d768f1df - std::panicking::default_hook::{{closure}}::h6385b6959a2dd25b
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/
   3:     0x5627d768edde - std::panicking::default_hook::he4f3b61755d7fa95
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/
   4:     0x5627d768f687 - std::panicking::rust_panic_with_hook::hf00b8130f73095ec
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/
   5:     0x5627d768f4c4 - std::panicking::begin_panic::h6227f62cb2cdaeb4
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/
   6:     0x5627d768f439 - std::panicking::begin_panic_fmt::h173eadd80ae64bec
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/
   7:     0x5627d768f3c7 - rust_begin_unwind
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/
   8:     0x5627d76baa8d - core::panicking::panic_fmt::h3b2d1e30090844ff
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/
   9:     0x5627d7120034 - core::result::unwrap_failed::h7badae956541c245
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/
  10:     0x5627d710647f - <core::result::Result<T, E>>::unwrap::h66c35e8e1208bb45
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libcore/
  11:     0x5627d714ea28 - anvil::main::h0d6a3a4699fe6d43
                        at [...]/crucible/anvil/src/
  12:     0x5627d769665a - __rust_maybe_catch_panic
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/
  13:     0x5627d768fdc6 - std::rt::lang_start::h65647f6e36cffdae
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/
                        at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/
  14:     0x5627d71500c2 - main
  15:     0x7f77833bc82f - __libc_start_main
  16:     0x5627d70e2ca8 - _start
  17:                0x0 - <unknown>

Can you tell what's wrong here? Probably so, if you've written a bit of Rust before. That first line might give it away for you:

'called `Result::unwrap()` on an`Err` Value NotPresent'

Which means we're calling unwrap on something that isn't there. Unwrap being a decently common thing in rust. Generally it means you've written some code, that broken down looks like:

let potential_failure: Result<Success, Failure> = some_code_that_can_fail();
let item = potential_failure.unwrap();

What's happening here is you aren't checking 'potential_failure', and just blindly calling unwrap.

Unwrap panic's the program when it hasn't succeeded (or none for option types). However based on our stacktrace can you tell me where this error is actually happening?

If you've had to debug rust before you probably know the answer as well, and caught it pretty fast as being:

11:     0x5627d714ea28 - anvil::main::h0d6a3a4699fe6d43
                        at [...]/crucible/anvil/src/

This means in the anvil program/crate's main had the error, and the specific line that had the error was line 79 of Which if you looked at the code would correspond to:

let manager = ConnectionManager::<diesel::pg::PgConnection>::new(env::var("DATABASE_URL").unwrap());

Looking at this stacktrace it's pretty clear Rust stacktraces were not really built to be as friendly as something like ruby. Ruby within the first line gets you immediately where the error occurred, along with the friendly debug log of 'Could not find User ID'. With Rust since this occured without a descriptive error message like expect all we get is a vague "couldn't unwrap value".

So the question is how can we make these errors more friendly, and actually help people understand these errors so when they see them even people who haven't worked with Rust before, can at least get some context? That's been my goal for the past two weeks now, and I'm glad to say I've finally made decent enough progress that it's worth writing this post about:

  1. I've improved the actual error messages themselves, by using the great error-chain library. There's a nice post about debugging with rr HERE that turned me onto it, but I may decide to write more about error-chain itself at a later date.
  2. What I really want to talk about here, is I've worked on getting our Rust errors into Sentry to make for easier debugging. Through a crate I've created called sentry-rs.

Now if you've looked into this process before, you're probably wondering "you're own crate? Why not use the crate on the sentry repo, Sentry-Rust"? There are a couple reasons why this just didn't fit the bill for what I was looking for:

  • Despite being on the sentry github, I couldn't find it anywhere in the documentation for sentry itself (seems to be crowdsourced)?
  • The panic handler doesn't actually wait to finish posting to Sentry, meaning in many cases our errors were not getting to sentry.
  • We noticed it was using outdated crates. Like most of them were out of date, and with dependencies likehyper installing two versions in every project we wanted to use this with was not cool.
  • It was using manual serialization/parsing, that frequently broke for us on errors (although logging seemed to be fine).
  • Was using printlnfor logging instead of using the logging crate which although not a problem was different than all of our other logging.
  • Seemed to be installing (outdated) versions of time/chrono even though chrono provided all the functions time did, and more.

Really it just didn't fill us with confidence. It seemed to not be well kept with time, and just oddly not ready for a staging/beta level scale. Even though it had commits going to the project, it seemed keeping the project up to date wasn't apparent, and it'd be easier to recreate the functionality, and ensure it stayed maintained.

Plus we wanted to add things like context_lines/in_app context. So that way even if you haven't really worked with rust before, you could at least have that context of where the error was happening, and what was going on.

With our fork we started off pretty simple fixing a couple things that we're non-starters for us:

  • Stopped using both time, and chrono. Switched purely to chrono.
  • Naively made the panic handler wait for timeout (thread::sleep).
  • Removed the custom string parsing, and used serde_json instead.

We also had dropped the panic handler function, but overall we could now have a rust program panic!(), and still be notified in sentry, which was really the biggest use case for us. We don't use Sentry Logging that much, since we have better solutions in place for logs, and with rust forcing us to handle errors we don't really have exceptions that aren't panics, or things that we can't handle.

This was still really rough around the edges, but since then we've made some leaps to get to a client we're actually happy with (Though there's still work to be done). Our new client has several advantages from a code health perspective:

  • The crate is mostly documented, if not completely documented! We've worked hard to ensure people know how this crate does what it does.
  • Cut out both custom serialization, and parsing falling back to the great serde/serde_json libraries to handle the bulk of the work.
  • Provide automated tests, and examples on pretty much anything you could want to do with it.
  • Removed crappy timeout detection, and actually use channels to ensure our message gets posted to sentry.
  • Use the log crate to provide logging.
  • Made the structure modular, so we're not all stuffed in
  • Kept all dependencies up to date, and ensured extraneous dependencies aren't included.

That being said we've also added in a few features to make it useful for our non-rust users:

  • Added in context lines through the feature "sourcemap" which can read the files off disk if it's on the same box that built the Rust code. Since our AMI build process runs the cargo build --release, and we distribute that AMI we get context lines on every panic!().
  • Added the in_app flag so sentry by default shows us only the stacktrace that concerns our app, unless we switch to "full" which will then show us problems in things like: /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libpanic_unwind/

Thus our non rust users can see the code that caused the panic right inside of sentry as the first thing they see. For example that error before, now becomes:

What an error looks like in sentry.

What an error looks like in sentry.

From here we immediately can see "unwrap()" was called on env::var("DATABASE_URL"), which means the DATABASE_URL environment variable was not provided.

This creates something that even someone who doesn't really understand rust can almost immediately go "That's what's wrong", or at least get enough info to actually have a theory to check something that could really be it. We've created actionable stacktraces that make it much easier to find out what is going on, even for those who don't really know rust.

Even though the crate still has some things that need fixing up (some "bugs", and some features), we've now got it to a point where stacktraces are useful, even to those of us who don't write rust.

Iterating Fast in Rust

How Rust-Lang won my Heart