📜 ⬆️ ⬇️

Leaks of secret information found in 100,000 repositories on GitHub


The methodology for collecting secrets involves various phases, which allows you to eventually identify secret information with a high degree of confidence. Illustration from scientific work

GitHub and similar platforms for open source code publishing have become the standard tool for developers today. However, a problem arises if this open source works with authentication tokens, private API keys and private cryptographic keys. For security, this data must be kept secret. Unfortunately, many developers add sensitive information to the code, which often leads to random information leaks.

A group of researchers from the University of North Carolina conducted a large-scale study of secret data leaks on GitHub. They scanned billions of files, which are compiled by two complementary methods:
')

The conclusions are disappointing. Scientists have not only found that leaks are widespread and affect more than 100,000 repositories. Even worse, thousands of new, unique “secrets” fall on GitHub every day.

The table lists the APIs of popular services and the risks associated with leaking this information.



General statistics on found secret objects shows that most often Google API keys are in open access. Also, RSA private keys and Google OAuth IDs are common. Characteristically, the vast majority of leaks occur through repositories with one owner.

SecretTotalUnique% one owner
API Key212,89285 31195.10%
RSA secret key158 01137,78190.42%
Google oauth id106 90947,81496.67%
Regular private key30,28612,57688.99%
Amazon AWS Access Key ID26 395464891.57%
Twitter access token20,760795394.83%
EC private key7838158474.67%
Facebook access token6367171597.35%
PGP Private Key209168482.58%
MailGun API Key186874294.25%
MailChimp API Key87148492.51%
Stripe Standard API Key54221391.87%
Twilio API Key3205090.00%
Access token Square1216196.67%
Square OAuth Secret28nineteen94.74%
Amazon MWS Auth Token2813100.00%
Braintree Access Token24eight87.50%
Picatic API Keyfivefour100.00%
Total575 456201 64293.58%

Monitoring of commits in real time allowed us to determine how much sensitive information is removed from the repositories shortly after getting there. It turned out that on the first day a little more than 10% of secrets are deleted, and on the following days a few more percent, but more than 80% of the private information remains in the repositories two weeks after the addition, and this proportion practically does not decrease later.

Among the most notable leaks are the AWS account of a government agency in one of the Eastern European countries, as well as 7,280 RSA private keys for accessing thousands of private VPN networks.

The study demonstrates that an attacker, even with minimal resources, can compromise many GitHub users and find a lot of secret keys. The authors note that many existing methods of protection are ineffective against the collection of classified information. For example, tools like TruffleHog demonstrate efficiency at only 25%. The built-in GitHub limit on the number of requests to the API is also easy to manage.

However, many of the secrets discovered have clear patterns that simplify
their search. It is logical to assume that these same templates can be used to monitor secret information leaks and warn developers. Probably, similar mechanisms should be implemented on the server side, that is, on GitHub. The service may issue a warning directly during a commit.

GitHub recently implemented a token scan token ( Token Scanning feature), which scans repositories, searches for tokens, and notifies service providers of information leaks. In turn, the vendor may cancel this key. The authors believe that through their research, GitHub can improve this function and expand the number of vendors.

Source: https://habr.com/ru/post/445038/


All Articles