The methodology for collecting secrets involves various phases, which allows you to eventually identify secret information with a high degree of confidence. Illustration from scientific workGitHub and similar platforms for open source code publishing have become the standard tool for developers today. However, a problem arises if this open source works with authentication tokens, private API keys and private cryptographic keys. For security, this data must be kept secret. Unfortunately, many developers add sensitive information to the code, which often leads to random information leaks.
A group of researchers from the University of North Carolina conducted a
large-scale study of secret data leaks on GitHub.
They scanned billions of files, which are compiled by two complementary methods:
')
- Nearly six-month real-time scanning of GitHub public commits
- A snapshot of publicly accessible repositories covering 13% of all repositories on GitHub, a total of about 4 million repositories.
The conclusions are disappointing. Scientists have not only found that leaks are widespread and affect more than 100,000 repositories. Even worse, thousands of new, unique “secrets” fall on GitHub every day.
The table lists the APIs of popular services and the risks associated with leaking this information.

General statistics on found secret objects shows that most often Google API keys are in open access. Also, RSA private keys and Google OAuth IDs are common. Characteristically, the vast majority of leaks occur through repositories with one owner.
Secret | Total | Unique | % one owner |
---|
API Key | 212,892 | 85 311 | 95.10% |
RSA secret key | 158 011 | 37,781 | 90.42% |
Google oauth id | 106 909 | 47,814 | 96.67% |
Regular private key | 30,286 | 12,576 | 88.99% |
Amazon AWS Access Key ID | 26 395 | 4648 | 91.57% |
Twitter access token | 20,760 | 7953 | 94.83% |
EC private key | 7838 | 1584 | 74.67% |
Facebook access token | 6367 | 1715 | 97.35% |
PGP Private Key | 2091 | 684 | 82.58% |
MailGun API Key | 1868 | 742 | 94.25% |
MailChimp API Key | 871 | 484 | 92.51% |
Stripe Standard API Key | 542 | 213 | 91.87% |
Twilio API Key | 320 | 50 | 90.00% |
Access token Square | 121 | 61 | 96.67% |
Square OAuth Secret | 28 | nineteen | 94.74% |
Amazon MWS Auth Token | 28 | 13 | 100.00% |
Braintree Access Token | 24 | eight | 87.50% |
Picatic API Key | five | four | 100.00% |
Total | 575 456 | 201 642 | 93.58% |
Monitoring of commits in real time allowed us to determine how much sensitive information is removed from the repositories shortly after getting there. It turned out that on the first day a little more than 10% of secrets are deleted, and on the following days a few more percent, but more than 80% of the private information remains in the repositories two weeks after the addition, and this proportion practically does not decrease later.
Among the most notable leaks are the AWS account of a government agency in one of the Eastern European countries, as well as 7,280 RSA private keys for accessing thousands of private VPN networks.
The study demonstrates that an attacker, even with minimal resources, can compromise many GitHub users and find a lot of secret keys. The authors note that many existing methods of protection are ineffective against the collection of classified information. For example, tools like TruffleHog demonstrate efficiency at only 25%. The built-in GitHub limit on the number of requests to the API is also easy to manage.
However, many of the secrets discovered have clear patterns that simplify
their search. It is logical to assume that these same templates can be used to monitor secret information leaks and warn developers. Probably, similar mechanisms should be implemented on the server side, that is, on GitHub. The service may issue a warning directly during a commit.
GitHub recently implemented a token scan token (
Token Scanning feature), which scans repositories, searches for tokens, and notifies service providers of information leaks. In turn, the vendor may cancel this key. The authors believe that through their research, GitHub can improve this function and expand the number of vendors.