Analyzing Open Source development (part 3)

In last post about analyzing open source development I mentioned that this one would be about massaging people information to have unique identities for all the project contributors.

But before that, I would like to explore something different. How to get data from multiple repositories? What happens when I want data from a whole GitHub organization’s or user’s repositories?

The obvious answer would be:
1. Let’s get the list of repositories:


import requests

def github_git_repositories(orgName):
    query = "org:{}".format(orgName)
    page = 1
    repos = []
    
    r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
    items = r.json()['items']
    
    while len(items) > 0:
        for item in items:
            repos.append(item['clone_url'])
        page += 1
        r = requests.get('https://api.github.com/search/repositories?q={}&page={}'.format(query, page))
        items = r.json()['items']
    
    return repos

2. And now, for each repository, run the code seen in previous post to get a dataframe for each one in list and concat them with:


df = pd.concat(dataframes)

For organizations or users with a few repositories, it would work. But for those with hundreds of repositories, how long would it take to go one by one fetching and extracting info?

Would there be a fastest approach? Let’s play with threads and queues…
Continue reading “Analyzing Open Source development (part 3)”

Analyzing Open Source development (part 1)

Simple analysis of open source development in public administrations can be done very easily. This post describes the initial steps to understand how to obtain previous post results.

We’ll learn how to use Perceval. It’s the tool responsible for data retrieval in GrimoireLab, the free, open source software framework for software development analytics.

Take some coffee or tee, and let’s start!

Laptop, notebook and coffee

Continue reading “Analyzing Open Source development (part 1)”