blog

Ultimate Guide to Removing Sensitive Data from Git History: Protect Your Codebase

Discover the definitive methods to remove sensitive data like API keys from Git history. Learn how to safeguard your repository, rewrite commit history, and prevent future leaks using best practices.

Ultimate Guide to Removing Sensitive Data from Git History: Protect Your Codebase

Unlike traditional credentials, secrets are meant to be distributed to developers, applications, and infrastructure systems. Adding more of these factors will inevitably make the number of secrets used in a development cycle increase, leading to a natural sprawling phenomenon: secrets start to appear hardcoded in source code. From an organization’s point of view, visibility and control over their distribution start to degrade. This is what secrets sprawl is all about – GitGuardian

Steps to Remove Historic Commits with Sensitive Data

1. Understand the Problem

When sensitive data is committed to a Git repository:

  • It remains in the commit history even if removed in later commits.
  • Anyone with access to the repository can recover the data by inspecting the history.

The solution involves rewriting the commit history to permanently remove sensitive data.


2. Prerequisites

  • Backup: Clone the repository and create a backup to prevent accidental data loss.
  • Access: Ensure you have the necessary permissions to rewrite history and force-push changes.

3. Choose a Method for Removal

There are two primary tools for removing sensitive data from Git history:

A. Git Filter-Branch (Deprecated, but Widely Used)

git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch path/to/your/file' \
--prune-empty --tag-name-filter cat -- --all
  • --index-filter: Removes the file from the Git index for all commits.
  • --prune-empty: Removes commits that become empty after file removal.
  • --all: Rewrites all branches.

B. BFG Repo-Cleaner (Recommended)

BFG Repo-Cleaner is faster and simpler than git filter-branch.

  1. Install BFG:

    brew install bfg
    

    Or download the BFG JAR file.

  2. Use BFG to remove sensitive files:

    bfg --delete-files path/to/your/file
    

    Or to remove specific strings (like API keys):

    bfg --replace-text replace-patterns.txt
    

    Format for replace-patterns.txt:

    YOUR_SECRET_KEY==>
    
  3. Clean up and repackage the repository:

    git reflog expire --expire=now --all
    git gc --prune=now --aggressive
    

4. Push the Changes

Force push the cleaned history to the remote repository:

git push origin --force --all
git push origin --force --tags

5. Notify Collaborators

  • Let collaborators know the history was rewritten.
  • Advise them to re-clone the repository to avoid conflicts.

6. Invalidate Leaked Credentials

  • Rotate the API keys or sensitive credentials immediately, as they might have been compromised.
  • Use tools like GitGuardian to detect any leaks.

Use Case Example

Imagine you accidentally commit an API key in a file named config.json. Here’s how to remove it:

  1. Identify the File: Check if the file is part of your history.

    git log --all -- config.json
    
  2. Remove the File Using BFG:

    bfg --delete-files config.json
    
  3. Replace the Key in the File Content: Create replace-patterns.txt:

    OLD_API_KEY==>
    

    Then run:

    bfg --replace-text replace-patterns.txt
    
  4. Repackage and Push:

    git reflog expire --expire=now --all
    git gc --prune=now --aggressive
    git push origin --force --all
    
  5. Rotate API Keys: Immediately invalidate the old API key and issue a new one.


Best Practices to Avoid Leaking Sensitive Data

  1. Use .gitignore: Prevent sensitive files from being committed.
  2. Git Hooks: Use pre-commit hooks to scan for sensitive data.
  3. Environment Variables: Store sensitive data outside the repository.
  4. Secret Scanning: Enable tools like GitHub's secret scanning or GitGuardian.

References

The source code isavailable on GitHub
© 2024 All rights reserved. Made with 🖤 by Umair Jibran.