Ultimate Guide to Removing Sensitive Data from Git History: Protect Your Codebase
Unlike traditional credentials, secrets are meant to be distributed to developers, applications, and infrastructure systems. Adding more of these factors will inevitably make the number of secrets used in a development cycle increase, leading to a natural sprawling phenomenon: secrets start to appear hardcoded in source code. From an organization’s point of view, visibility and control over their distribution start to degrade. This is what secrets sprawl is all about – GitGuardian
Steps to Remove Historic Commits with Sensitive Data
1. Understand the Problem
When sensitive data is committed to a Git repository:
- It remains in the commit history even if removed in later commits.
- Anyone with access to the repository can recover the data by inspecting the history.
The solution involves rewriting the commit history to permanently remove sensitive data.
2. Prerequisites
- Backup: Clone the repository and create a backup to prevent accidental data loss.
- Access: Ensure you have the necessary permissions to rewrite history and force-push changes.
3. Choose a Method for Removal
There are two primary tools for removing sensitive data from Git history:
A. Git Filter-Branch (Deprecated, but Widely Used)
git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch path/to/your/file' \
--prune-empty --tag-name-filter cat -- --all
--index-filter
: Removes the file from the Git index for all commits.
--prune-empty
: Removes commits that become empty after file removal.
--all
: Rewrites all branches.
B. BFG Repo-Cleaner (Recommended)
BFG Repo-Cleaner is faster and simpler than git filter-branch
.
-
Install BFG:
brew install bfg
Or download the BFG JAR file.
-
Use BFG to remove sensitive files:
bfg --delete-files path/to/your/file
Or to remove specific strings (like API keys):
bfg --replace-text replace-patterns.txt
Format for replace-patterns.txt
:
YOUR_SECRET_KEY==>
-
Clean up and repackage the repository:
git reflog expire --expire=now --all
git gc --prune=now --aggressive
4. Push the Changes
Force push the cleaned history to the remote repository:
git push origin --force --all
git push origin --force --tags
5. Notify Collaborators
- Let collaborators know the history was rewritten.
- Advise them to re-clone the repository to avoid conflicts.
6. Invalidate Leaked Credentials
- Rotate the API keys or sensitive credentials immediately, as they might have been compromised.
- Use tools like GitGuardian to detect any leaks.
Use Case Example
Imagine you accidentally commit an API key in a file named config.json
. Here’s how to remove it:
-
Identify the File: Check if the file is part of your history.
git log --all -- config.json
-
Remove the File Using BFG:
bfg --delete-files config.json
-
Replace the Key in the File Content:
Create replace-patterns.txt
:
OLD_API_KEY==>
Then run:
bfg --replace-text replace-patterns.txt
-
Repackage and Push:
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push origin --force --all
-
Rotate API Keys: Immediately invalidate the old API key and issue a new one.
Best Practices to Avoid Leaking Sensitive Data
- Use
.gitignore
: Prevent sensitive files from being committed.
- Git Hooks: Use pre-commit hooks to scan for sensitive data.
- Environment Variables: Store sensitive data outside the repository.
- Secret Scanning: Enable tools like GitHub's secret scanning or GitGuardian.
References