Unleashing the Mayhem CRS

In June, ForAllSecure participated in DARPA’s Cyber Grand Challenge (CGC) Qualification Event (CQE) 1. During the event our automated system tweeted its progress, and to continue the trend of openness, we decided to publish a writeup of some more details about our system. Our team, Thanassis Avgerinos, David Brumley, John Davis, Ryan Goulden, Tyler Nighswander, and Alex Rebert spent many thousands of hours on our system, and now that the CQE is over, we’re excited to give you a glimpse of its inner workings.

CGC Background

DARPA’s Cyber Grand Challenge aims to push the state of the art in automatic program analysis. Competitors are given several compiled x86 programs and asked to find inputs which crash the target programs, and to generate new versions which are secured against those crashes. If this sounds tough, it is–but there is another twist: competitors are completely automated systems.

All of the information on the scoring and rules for CGC is publicly available. Their rules document and FAQ provide a lot of insight into how the competition works, but we can summarize them quickly here:

  • The CGC platform is based on Linux, but it is modified slightly and named DECREE. The CGC binaries contain only 7 system calls (terminate, transmit, receive, fdwait, allocate, deallocate, and random)
  • A vulnerability is defined as the ability to cause a binary to die with a SIGSEGV, SIGBUS, or SIGILL. A Proof of Vulnerability (POV) is an input which demonstrates this behavior.
  • All distributed Challenge Binaries (CBs) have one or more vulnerabilities, and there are reference POVs written by the challenge authors to ensure this.
  • Each challenge may have a PCAP file associated with it to provide sample network interactions with the service to help our CRS.
  • All CBs have several tests which are performed to ensure they behave as expected, so competitors cannot simply remove functionality in order to protect the program. These tests are not available to competitors.

When a competitor patches a binary and submits a Replacement Binary (RB), they are judged on several things:

  • Functionality is measured by running the hidden tests on the binaries. Failing tests negatively affects a team’s score.
  • Security is measured by running the POVs created by the challenge authors on the RB. If no POVs are blocked, the score is 0.
  • Performance is measured by taking the execution overhead, memory overhead, and size overhead of the RB. The greater the overhead, the worse the score.
  • If any POVs are found by the competitor, their score on the binary is doubled. Therefore, if their original score was 0, a POV will not help.

In the qualification round held on June 3rd, “Cyber Reasoning Systems” (CRSs) across the country were given 131 binary programs. They had just 24 hours to automatically find bugs in and repair as many binaries as possible. The system developed at ForAllSecure secured a solid first place qualification, and we thought we would share some insights into how our system worked and how we fared in this fascinating competition.

Results Overview

In the interest of openness, the data from the qualification has been released so that anyone can perform their own analysis. The scoring information is available here and the data submitted by the teams (patches, POVs, etc) is available here.

The first thing we notice from the data is that ForAllSecure did very well! We qualified with plenty of breathing room, and our total score was quite high relative to other teams.

Final scores for teams playing in the DARPA CQE
Final scores for teams playing in the DARPA CQE

There are some oddities; for example, the 13th place team actually had a higher score than the 11th and 12th place teams. This is likely due to “diagnostic binaries”: DARPA included some binaries in the qualification event that did not count towards ranking, but did not release which programs they were.

How significant was the lead we had, exactly? We can quickly see that our score is greater than the sum of the next three teams combined. Realistically, though, if teams were working together, the scores wouldn’t simply add together. If instead all teams (2nd through 13th) joined together and took the best submission for each binary, the combined score would have been 206.12, compared with ForAllSecure’s 245.37. That means still would have gotten first against all teams naively working together! If, however, our hypothetical adversary took the best scoring patches and combined that with POVs found by any other team, the total score is 254.12, beating out our own score by about 3.5%. Of course, it’s also nice to see what would happen if everyone worked together. In that case, the end result would have been a whopping 325.83 points! The cool thing here is that this means teams are approaching this competition very differently. Although we performed well, there were still several problems in which other teams did better. This diversity suggests different approaches to the Cyber Grand Challenge which will ultimately mean better automatic bug finding and patching for everyone.

Scores of hypothetical super-teams and ForAllSecure
Scores of hypothetical super-teams and ForAllSecure

The next interesting thing worth looking at is the number of CBs each team was able to crash. It is worth noting this value isn’t necessarily the number of bugs found, as each program might have multiple bugs and multiple POVs. A team receives points as soon as one POV is given for a binary, and no additional points are given for finding more crashes, so 10 unique bugs in one CB will only result in one POV.

Total number of crashing binaries found by each team during the CQE
Total number of crashing binaries found by each team during the CQE

One fascinating thing here is how little finding bugs mattered! Although ForAllSecure did manage to crash more CBs than any other team, the second most successful team at finding bugs ended up in 9th place. Likewise, the 2nd and 4th place teams found very few bugs, but were still able to patch the binaries to run safely.

Keep in mind there were 131 total binaries. That means even though ForAllSecure managed to find bugs in 77, that’s only a little bit more than half of the binaries! As it turns out, many teams identified different vulnerabilities from each other–what was easy for one team to find wasn’t necessarily easy for others. Again, this says a lot for the diversity of approaches that teams are using.

Crashing binaries produced by each team. Note that there are many differences across crashing binaries between teams
Crashing binaries produced by each team. Note that there are many differences across crashing binaries between teams

This graph shows all of the binaries each team was able to crash. Although there are clearly commonalities across some teams, there are also a few differences between teams. Aggregating results from every team, a total of 96 binaries were crashed, meaning there were 19 binaries that some team was able to crash but ForAllSecure was not.

Of course, as we saw, defense was far more important than offense in this game. Without a performant patched binary, it was not possible to get points! So how did teams do at patching?

When a team finds a POV for a binary, their score on that program is doubled. We can easily calculate what the game results would have looked like if POVs had no scoring impact on the game.

What the CQE scores would look like if POVs did not directly affect scores
What the CQE scores would look like if POVs did not directly affect scores

The results are, well, pretty much the same. The qualifying teams would have been identical, with the only difference being the 6th and 7th place teams swapping positions. As we said before, defense was definitely the most important part of this game.

Additional analysis of teams’ defenses and patching scores can be found at the excellent blog post written by Trail of Bits.

Offense

Our bug finding system is composed of two primary components: the Mayhem symbolic executor and a directed fuzzer we call Murphy 2. These two components communicate through a PostgreSQL database by sharing testcases they find “interesting”, based on the coverage they achieve. By using Murphy and Mayhem together, we are able to boost both: the fuzzer is great at quickly finding shallow bugs, but fails on complex cases; Mayhem is good at generating deep paths in a program, but is not always fast enough to explore them all. Working together, we found that our symbolic executor was able to upload interesting test cases which are picked up by our fuzzer and transformed into crashing inputs. Looking at a graph of the test cases our system generated over time, there are several places where we can see a spike in, say, Mayhem test cases quickly leading to a spike a bit later in test cases from Murphy. This leads to a far more diverse set of test cases than either a fuzzer or symbolic execution engine could come up with alone, which of course increases our bug finding abilities.

Test case generation by Mayhem and Murphy throughout the entire CQE
Test case generation by Mayhem and Murphy throughout the entire CQE

Due to how tightly Mayhem and Murphy work together, it is a bit misleading to look at which vulnerability discovery engine reported flaws in the end. Although Murphy was responsible for more POVs in the end3, without Mayhem’s guidance it would never have explored many of those paths. However, for completeness we share these numbers as well:

Murphy Mayhem
65 Crashing CBs 31 Crashing CBs

Note that these two sum to greater than 77, the total number of binaries we were able to crash. Often each attack component would discover flaws in the same binary before our CRS decided to move on to other challenges.

When we find bugs, we triage through a variety of custom tools to determine where they crashed and send that information back to our patching infrastructure. Additionally, we run our patched binary against the POVs we developed to make sure our patch was actually successful.

Many times during testing our CRS discovered unintentional bugs in problems or did not recover all possible locations of an access violation from a bug. This causes a reduction in score, as our security performance is not measured by seeing how many POVs we can protect against, but rather how many POVs written by the original challenge author we can protect against. This means finding unintended vulnerabilities or patching closely related crashes might not help our score. Rather than continue to look for bugs in programs indefinitely, our CRS waits one hour after it has found the first valid POV. In that time it will continue fuzzing and symbolic execution to attempt to coax out additional crashes.

Defense

By far, the most important part of the Cyber Grand Challenge qualification scoring metrics was the ability to defend programs. Based on that, the majority of our development time was spent on developing our patching system. Like most teams, we are able to patch binaries without having actual examples of faulty inputs. By examining the program statically we can find locations that look questionable, and patch those areas to ensure no memory access violations occur. Of course, this can result in a large overhead, so we also developed a system which can take POVs and directly patch the access violations they trigger. This is much more performant, but it’s possible that we may miss some of the vulnerabilities, or not adequately patch all invalid memory accesses that a vulnerability might trigger.

We can definitely see we are not alone in having separate strategies for patching based on whether or not we discovered an POV.

Average patched binary score excluding POV bonus, separated based on whether the team had found a crash in the binary or not.
Average patched binary score excluding POV bonus, separated based on whether the team had found a crash in the binary or not.

This graph shows the average score of patched binaries, separated based on whether a POV was found or not. If a POV was found, the score was divided by 2, to account for the bonus received from finding an exploit. From this, it seems that every team shown aside from the 3rd and 9th place teams had separate patching strategies based on whether or not they were able to find a crash. The 8th place team, one place below the qualification cut-off, seems to have a strategy for patching only when a crash is found. We can also see that the ForAllSecure patches of each type also scored higher than those of any other team; in fact, our crash-agnostic patches scored about as well as the crash-aware patches of other teams.

The first method of patching we created was based on binary hot-patching techniques often seen in CTFs and other hacked-together systems. After identifying the potentially dangerous instruction, our CRS inserts a jump immediately before the instruction. This jump will redirect the program to a section of the program that we insert, which will execute a safety check and then the target (protected) instruction. This allows us to insert code without disturbing the functionality of the program when it behaves normally. In the case of an input which would normally crash the program, our safety check detects the error before it occurs and exits cleanly.

Simple example of a binary hot-patching technique
Simple example of a binary hot-patching technique

Unfortunately, this hot patching method is not good for performance. Jumping to distant parts of the program can clear the instruction cache, and adds a reasonable amount of overhead in tight loops. To deal with this, we developed a custom, lightweight recompilation framework. This framework allows us to easily insert patches without disturbing the control flow of the program. Although this method was often more performant, we were worried that our recompilation framework would not be able to safely handle every program, and so our CRS carefully tested each patched binary to decide on the correct one to submit.

Infrastructure

Before the qualification event, we didn’t know how many challenge binaries would be released. We knew only that the number would be greater than one hundred. For the two practice events leading up to the qualification round, ForAllSecure used “Major Tom”, a single 48-core machine with 128GB of RAM. Each component of our CRS ran inside virtual machines on this host. Unfortunately, this was not suitable for the qualification event both due to reliability (our office internet has been known to go down, and Pittsburgh weather sometimes results in power outages), and scalability.

Diagramatic view of our CQE infrastructure
Diagramatic view of our CQE infrastructure

In the end we settled for a cloud-based approach using both Google Compute Engine and Amazon Web Services. The overall architecture of our setup was something like this:

Both the Murphy and Mayhem clouds automatically distributed load across their instances. When our CRS was satisfied that a challenge binary had been examined enough, the CPUs working on it would start working on a new binary. In this way we were able to give more CPU time to the more difficult challenges. These components were also “self healing”. If liveness checks determined that a node was down, our system would automatically spread the node’s work elsewhere until the node rebooted.

The choice to use Google Compute Engine for Murphy was based primarily on our ability to run DECREE natively. The DECREE kernel is 32-bit, and while it would have been possible to modify it to compile for 64-bit, we chose to make the minimum possible changes to avoid inconsistencies with the official DARPA DECREE. The Xen-based virtualization setup on Amazon EC2 meant that only 32-bit EC2 instances could support DECREE, which in turn limited instances to a maximum of 2 cores each. Luckily GCE’s KVM-based setup allowed us to run our 32-bit guest on any instance we wanted, making scaling up much easier.

Captain Crunch was another physical machine we purchased. We chose the hardware based on our best guess of the hardware used internally by the CGC team. The most important part though was that this machine ran DECREE natively. This allowed us to score our binaries on “bare metal” in order to get the most accurate estimate of our scores and to ensure our POVs reproduced as we intended. This allowed us to decide which of our several patches and POVs to submit to DARPA.

Major Tom ran most of our patching framework. Although we needed far less horsepower for patching than it was capable of providing, we were also prepared to run backup instances of our database or either of our two attack frameworks on Major Tom in case of catastrophic failure. Likewise, Major Tom and Captain Crunch’s VMs had cloud based counterparts that could have been deployed had a hardware or internet failure occurred locally.

A multi-zone database instance running on Amazon RDS provided the communication framework for our CRS. All data was passed between components via our database to provide a robust and failure-tolerant system. Many of our components were redundant as well, so that a failure of, say, Murphy would not result in a failure of our entire system. If our CRS needed to, it was capable of making reasonable choices of which patches to submit without much information.

Conclusion

The DARPA Cyber Grand Challenge is an incredibly interesting project. ForAllSecure was very successful, but every team that participated developed novel tools and techniques for patching and bug discovery. We hope writing about our CRS will help others interested in automatic binary analysis and motivate others to publish some of their data as well. We have a long way to go before the final round of CGC in August 2016, but we are definitely in an excellent position for the event.


  1. DARPA loves acronyms, which has led to us using them over the past year. Apologies in advance, but here is a quick cheat sheet:
    CGC: Cyber Grand Challenge
    CRS: Cyber Reasoning System
    CQE: CGC Qualification Event
    CB: Challenge Binary, the binary given by DARPA for teams to patch and exploit
    POV: Proof Of Vulnerability, an input which causes a CB to crash with a SIGSEGV, SIGBUS, or SIGILL
    RB: Replacement Binary, the patched binary submitted by competitors corresponding to a CB 
  2. Named after John Davis’s cat, Murphy 
  3. We call these POVs Murphy’s Flaws 

8 thoughts on “Unleashing the Mayhem CRS

  1. Nice writeup and great work! It sounds like you guys have a similar approach to what we (team Shellphish at UC Santa Barbara) have, since the CQE, developed as “Driller”, down to the number of pwned bins! If any of you guys are at NDSS in a few weeks, you should check out our talk.

    Also, I hope you’re working on some new breakthrough before the CFE 😉

    Liked by 1 person

    1. After the CFE competition we did a quick count of lines of code and such. It looks like ~60k lines of Python, 25k lines of OCaml, 18k lines of C/C++, 3k lines of bash scripts (scary, I know), and 1k lines of assembly.

      Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s