In last post about analyzing open source development I mentioned that this one would be about massaging people information to have unique identities for all the project contributors.
But before that, I would like to explore something different. How to get data from multiple repositories? What happens when I want data from a whole GitHub organization’s or user’s repositories?
The obvious answer would be:
1. Let’s get the list of repositories:
import requests
def github_git_repositories(orgName):
query = "org:{}".format(orgName)
page = 1
repos = []
r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
items = r.json()['items']
while len(items) > 0:
for item in items:
repos.append(item['clone_url'])
page += 1
r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
items = r.json()['items']
return repos
2. And now, for each repository, run the code seen in previous post to get a dataframe for each one in list and concat them with:
df = pd.concat(dataframes)
For organizations or users with a few repositories, it would work. But for those with hundreds of repositories, how long would it take to go one by one fetching and extracting info?
Would there be a fastest approach? Let’s play with threads and queues…
Continue reading “Analyzing Open Source development (part 3)”