Discover Hidden Secrets in Git Repos with Rust

Discover Hidden Secrets in Git Repos with Rust

Why you can't just delete passwords from a git repo

Table of Contents

  • Intro
  • Git Blobs
  • Secrets Test
  • Conclusion

Intro

It's no secret to we something, every once in a while, on occasion, accidently commit a password or api key into git. And it's not secret that instead of doing the right thing, we just delete the password or api key, recommit, and move on with our day only to forget it ever happened. Well, today we're going to find out how easy it is uncover that bad behavior and hopefully convince you to reconsider next time.

Git 101

For this task, we'll need to know a litle bit about how git works under the hood. While this isn't necessarily important for writing the code (you can 100% program this yourself by just Googling stuff and coping+pasting stackoverflow code), it'll for sure make the code easier to read and understand.

Git is a content-addressable filesystem. It means that at the core of Git is a simple key-value data store. What this means is that you can insert any kind of content into a Git repository, for which Git will hand you back a unique key you can use later to retrieve that content. (source)

What git really is under the hood is a database of objects (objects in this case refering to unstructured content like files). When you make changes to a project and commit those changes, the changed files are stored as objects along with a bunch of other fancy metadata git uses on top of that. Objects can be either tags, commits, trees, or blobs. Tags give labels to commits, commits contain metadata (e.g. timestamp, author, etc.), trees are like file and directory names that hold information about the project structure, blobs are the actual content of the files. Both of these object types are stored in git's object database (which you can see for yourself by looking in .git/objects )

Untitled.png

Git in Rust

It's no secret that git is written in C which means there is plenty of support for C library like libgit2. Luckly, there are rust bindings for libgit2 in the crate git2 which gives us a nice high level API with types like Repository, Commit, Object, Odb, etc. that we can leverage to search for secrets.

Secrets

An important note, when I say secrets, I mean any sensitive information that you wouldn't want to commit to a public repo (e.g. passwords, api keys, mother's maiden name, etc.). Obviously, we can't check for everything so for this, we'll be checking for a subset of secrets, namely:

[
  "AWS API Key",
  "Facebook Oauth",
  "Generic API Key",
  "Generic Secret",
  "GitHub",
  "Google (GCP) Service-account",
  "Google Oauth",
  "Heroku API Key",
  "PGP private key block",
  "Password in URL",
  "RSA private key",
  "SSH (DSA) private key",
  "SSH (EC) private key",
  "SSH (OPENSSH) private key",
  "Slack Token",
  "Slack Webhook",
  "Twilio API Key",
  "Twitter Oauth"
]

Projects Overview

With all the background knowledge in place, we can combine to create the basic flow for the project. We're going to be giving our utility a git repo as an input, we'll need to get the object database from the git repo, get all of the objects that are blobs, and check each blob for secrets. To check for secrets, we'll use regex and check for matches in the blob's content.

# Pseudocode
secrets = [...]
repo = some_git_repo
objectDB = repo.objects

for object in objectDB
    for secret in secrets 
        if object.type is Blob && secret in object.content
            print secret.type
        end
    end
end

The final source code can be found here: https://github.com/jonaylor89/JAZ

Setup

Before starting, go ahead and create a new rust projects with cargo.

cargo new secret_catcher

Scanning Git Blobs

Looking at our pseudocode, it looks like the first thing we need to do it figure our how to ingest a git repository, read its object DB, and somehow iterate over it. As mentioned above, for this, we'll be using the git2 crate so the first set is adding that to our Cargo.toml

 [package]
name = "secret_catcher"
version = "0.0.1"
authors = ["John Naylor <my@email.con>"]
edition = "2021"
description = "Find secrets hidden in commits"
repository = "https://github.com/jonaylor89/JAZ"
license = "MIT"
readme = "README.md"

[dependencies]
git2 = "0.13"

Moving to the code, let's start by injesting the git repo:

use git2::Repository;
use std::ffi::OsString;

fn main() {

    // Get path to git repo via command line args or assume current directory
    let repo_root: String = std::env::args_os()
                    .nth(1)
                    .unwrap_or_else(|| OsString::from(".");

    // Open git repo
    let repo = Repository::open(&repo_root).expect("Couldn't open repository");

    println!(
        "{} state={:?}",
        repo.path().display(),
        repo.state()
    );
}

To make things easier, we can test our code using the git repo for this project:

$ cargo run
/home/johannes/Repos/secrets_catcher/.git/ state=Clean

With that bit of code, we're using git2's Repository type to serial the git repo. What we need now it to extract the object DB from it. Fortunuately for us, git2 has some methods that'd make that trivial.

use git2::Repository;

fn main() {

    // Get path to git repo via command line args or assume current directory
    let repo_root: String = std::env::args_os()
                    .nth(1)
                    .unwrap_or_else(|| OsString::from(".");

    // Open git repo
    let repo = Repository::open(&repo_root).expect("Couldn't open repository");

    println!(
        "{} state={:?}",
        repo.path().display(),
        repo.state()
    );

        // Get object database from the repo
    let odb = repo.odb().unwrap();

    // Loop through objects in db
    odb.foreach(|oid| {
                println!("{}", oid);

        // Return true because the closure has to return a boolean
        true
    })
    .unwrap();
}

Gotta love when things are simple (can't beat Rust's no-cost abstraction)

The code above compiles to print all of the git objects to the console. By leveraging the method odb() for the Repository type, we can a git2 Odb type, that contains a very handle foreach() method. With the foreach() method, supply a closure that will soon contain the code for seeing if that git object contains any secrets.

For reabilitiy purposes, let's put the code for scanning the object into its own function:

odb.foreach(|&oid| {
        // Get the object from the object's id
    let obj = odb.read(oid).unwrap();

    // Look for secrets in the object
    scan_object(&obj);

    // Return true because the closure has to return a boolean
    true
})
.unwrap();

With the function looking something like this:

use git2::{ObjectType, Object, Oid, Repository};
use std::str::from_utf8;

...
fn main() {...}
...

fn scan_object(obj: &Object) {

        if obj.kind() != ObjectType::Blob {
            return;
        }

     // Check if the blob contains secrets
     // TODO: check for secrets in commit here!!!
}

Raw git objects require a little bit of preprocessing before we can check if they contain secrets as you can see above. First we use obj.kind() to see if we're dealing with a Blob because that's the only type of git object we care about. Any other object type can be discarded. Next, to extract the raw Blob string, we use the as_blob() method in conjection with the from_utf8() , found in the standard library, to finally get a raw Rust string containing the Blob information.

Again, for readability, let's put the code for actually determining if a Blob has a secret in it, into its own function.

fn scan_object(obj: &Object) {

        if obj.kind() != ObjectType::Blob {
            return;
        }

      // Check if the blob contains secrets
      if let Some(secrets_found) = find_secrets(obj.data()) {
          for bad in secrets_found {
              println!(
                  "object {} has a secret of type `{}`",
                  obj.id(),
                  bad
              );
          }
      }
}

// is_bad : if secrets are found in blob then they are returned as a vector, otherwise return None
fn find_secrets(blob: &str) -> Option<Vec<&'static str>> {
    None
}

In the updated scan_objects() function, we're passing the Blob content to a newly created function find_secrets() , which we'll be filling in later, and printing a short message to the console if that Blob did, in fact, contain secrets.

For a little bit of flair, I went ahead and added some color to the output of the print statements. This is competely optional but it definitely makes it look cooler and is a tad more pleasing to the eyes.

use ...;

const INFO: &str = "\x1B[32m[INFO]\x1B[0m"
const CRITICAL: &str = "\x1B[31m[CRITICAL]\x1B[0m"

fn main() {
    ...
    println!(
      "{} {} state={:?}",
      INFO,
      repo.path().display(),
      repo.state()
  );
    ...
}

fn scan_object(...) {
    ...
    println!(
     "{} object {} has a secret of type `{}`",
      CRITICAL,
      obj.id(),
      bad
  );
    ...
}

Finding Secrets with Regex

It seems that now, the only other thing that needs to be implemented is the find_secrets() function that gets called in out scan_objects() function. There are a few ways to go about doing this but for this project, let's use regex. We'll check every Blob we get against a set of regex which will tell us if there is a secret and what kind of secret it is. Every "hit" we get, we'll add that secret type to some kind of list and return that list at the end of the function. First, for the exact regex, I found a few other similar projects that use regex to find secrets and frankensteined them into an array of tuples which is our set of rules.

fn find_secrets(blob: &[u8]) -> Option<Vec<&'static str>> {
    const RULES: &[(&str, &str)] = &[
      ("Slack Token", "(xox[p|b|o|a]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32})"),
      ("RSA private key", "-----BEGIN RSA PRIVATE KEY-----"),
      ("SSH (OPENSSH) private key", "-----BEGIN OPENSSH PRIVATE KEY-----"),
      ("SSH (DSA) private key", "-----BEGIN DSA PRIVATE KEY-----"),
      ("SSH (EC) private key", "-----BEGIN EC PRIVATE KEY-----"),
      ("PGP private key block", "-----BEGIN PGP PRIVATE KEY BLOCK-----"),
      ("Facebook Oauth", "[f|F][a|A][c|C][e|E][b|B][o|O][o|O][k|K].{0,30}['\"\\s][0-9a-f]{32}['\"\\s]"),
      ("Twitter Oauth", "[t|T][w|W][i|I][t|T][t|T][e|E][r|R].{0,30}['\"\\s][0-9a-zA-Z]{35,44}['\"\\s]"),
      ("GitHub", "[g|G][i|I][t|T][h|H][u|U][b|B].{0,30}['\"\\s][0-9a-zA-Z]{35,40}['\"\\s]"),
      ("Google Oauth", "(\"client_secret\":\"[a-zA-Z0-9-_]{24}\")"),
      ("AWS API Key", "AKIA[0-9A-Z]{16}"),
      ("Heroku API Key", "[h|H][e|E][r|R][o|O][k|K][u|U].{0,30}[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}"),
      ("Generic Secret", "[s|S][e|E][c|C][r|R][e|E][t|T].{0,30}['\"\\s][0-9a-zA-Z]{32,45}['\"\\s]"),
      ("Generic API Key", "[a|A][p|P][i|I][_]?[k|K][e|E][y|Y].{0,30}['\"\\s][0-9a-zA-Z]{32,45}['\"\\s]"),
      ("Slack Webhook", "https://hooks.slack.com/services/T[a-zA-Z0-9_]{8}/B[a-zA-Z0-9_]{8}/[a-zA-Z0-9_]{24}"),
      ("Google (GCP) Service-account", "\"type\": \"service_account\""),
      ("Twilio API Key", "SK[a-z0-9]{32}"),
      ("Password in URL", "[a-zA-Z]{3,10}://[^/\\s:@]{3,20}:[^/\\s:@]{3,20}@.{1,100}[\"'\\s]"),
  ];

    None
}

For implementing the actual regex matching, we'll need to get a little fancy since compiling regex is expensive.

It is an anti-pattern to compile the same regular expression in a loop since compilation is typically expensive. (It takes anywhere from a few microseconds to a few milliseconds depending on the size of the regex.) Not only is compilation itself expensive, but this also prevents optimizations that reuse allocations internally to the matching engines. (source)

To work around this issue, we'll be using Lazy from the library once_cell to ensure our regex is compiled only once.

(Thank you to Dr-Emann for the PR adding lazy regex compiling)

use once_cell::sync::Lazy;
use regex::bytes::RegexSet;

...

fn find_secrets(blob: &str) -> Option<Vec<String>> {
    const RULES: &[(&str, &str)] = [...];

        // Lazily compile regex set
        static REGEX_SET: Lazy<RegexSet> = Lazy::new(|| {
            RegexSet::new(RULES.iter().map(|&(_, regex)| regex)).expect("All regexes should be valid")
        });

        let matches = REGEX_SET.matches(blob);
        if !matches.matched_any() {
            return None
        }

    Some(matches.iter().map(|i| RULES[i].0).collect())
}

And with the find_secrets() function done, we can go ahead and test our code! Again, the easier way is to just test it against the git repo that you're using for the project:

Bildschirmfoto_vom_2021-10-29_13-15-04.png

Conclusion

At this point, we've succeeded at what we set out to create. I went ahead and scanned common testing repositories for this sort of thing like Plazmaz/leaky-repo and dijininja/leakyrepo. In general the program found all or most of the secrets. In the case of dijininja/leakyrepo it found a lot of RSA private keys which is acceptable but technically a misidentification. For Plazmaz/leaky-repo we find the majority of the keys although once again misidentify some. The decision to use rust makes performance really solid. A couple good extensions to this to help with that could be adding a thread pool in order to scan objects in parallel. In more professional code, it seems more idiomatic for the scan_objects() function to return some objects of objects including their results rather than just printing the one containing secrets. For example, it could be formatted something like this:

{
    objectID1: [
        secrets1,
        ...
    ],
    ...
}

In the end, this tool could work as a good starting point for something more sophisticated.

The final source code can be found here: https://github.com/jonaylor89/JAZ

Further Reading

Did you find this article valuable?

Support Johannes Naylor by becoming a sponsor. Any amount is appreciated!