logo
Menu

Star Wars, Good Food, and Astronomy: Tracking GitHub Traffic with a Serverless Architecture

Learn how to track, store, and analyze GitHub traffic metrics using AWS Lambda, Amazon DynamoDB, and Amazon S3.

Mahanth Jayadeva
Mahanth Jayadeva
Amazon Employee
Published Apr 25, 2023

At first glance, it may seem like Star Wars, good food, and astronomy have little in common. However, they all share a surprising connection to the world of GitHub metrics - clones, forks, and stars. These are some common GitHub metrics developers use to gain insight into the popularity and usage of their repositories. Let's delve into the world of GitHub metrics and explore how we can track and store them for analysis.

With the recent launch of the aws-resilience-hub-tools repository, I needed a way to track the traffic to the repository to get a better understanding of my customers’ needs. Like many others, I decided that metrics like stars, clones, forks, and views would allow me to get a pulse for the repository. Although GitHub provides this data, it is limited to the last 14 days. So I embarked on a journey to create a solution that would allow me to capture, store, and analyze this information using GitHub APIs and AWS serverless. Let's dive in!

“A picture is worth a thousand words” they say, so let's look at how this solution works and then walk through how to put it all together.

Architecture
  1. We start with an Amazon EventBridge rule that periodically invokes an AWS Lambda function.
  2. The Lambda function makes API calls to GitHub to fetch traffic data.
  3. For persistence, the data is stored in an Amazon DynamoDB table.
  4. Data is then exported as a CSV file from the DynamoDB table and stored in an Amazon S3 bucket.
  5. The CSV file is then ingested into Amazon QuickSight for visualization.

Simple, powerful, and cost efficient!

We start with the Lambda function, which does most of the magic.

PyGithub is an awesome library that allows us to query the GitHub APIs using Python, which is one of the supported runtimes for Lambda functions. We start by importing boto3, which is the AWS SDK for Python, CSV, and PyGithub libraries, and then authenticating to GitHub. Note that I'm using a personal access token here, but you can use other forms of authentication (though they might require additional code/configuration). The personal access token is stored as a parameter in AWS Systems Manager Parameter Store and retrieved dynamically when the Lambda function runs. Remember, hard-coding credentials into application code is bad practice and should be avoided.

1
2
3
4
5
6
7
8
9
10
11
12
13
import boto3
from github import Github
import csv

def lambda_handler(event, context):
# Retrieve the personal access token from SSM Parameter Store
access_token = get_ssm_parameter('arh-traffic-stats')

# Authenticate with GitHub using the PyGitHub library
g = Github(access_token)

# Replace <owner> and <repo> with your repository information
repo = g.get_repo("<owner>/<repo>")

Next we query GitHub for the data that is needed. The following APIs provide this data:

Additionally, we can get the count of stars and forks on the repo from the repo object.

1
2
3
4
5
traffic = repo.get_views_traffic()
clones = repo.get_clones_traffic()
stars = repo.stargazers_count
top_paths = repo.get_top_paths()
forks = repo.forks_count

We initialize the Boto client for DynamoDB and then persist the traffic data in the DynamoDB table after making sure we are not duplicating the data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Initialize DynamoDB client and specify table name
dynamodb = boto3.client('dynamodb')
table_name = <table_name>

# Process the traffic data and store it in DynamoDB
for view in traffic['views']:
timestamp = view.timestamp.strftime('%Y-%m-%d %H:%M:%S')
views_count = view.count

clone = next((c for c in clones['clones'] if c.timestamp == view.timestamp), None)
clones_count = clone.count if clone else 0

# Check if the item already exists in DynamoDB
response = dynamodb.get_item(TableName=table_name, Key={'timestamp': {'S': timestamp}})

# If the item does not exist, put the item in DynamoDB
if 'Item' not in response:
item = {
'timestamp': {'S': timestamp},
'views': {'N': str(views_count)},
'clones': {'N': str(clones_count)},
'stars': {'N': str(stars)},
'forks': {'N': str(forks)}
}
dynamodb.put_item(TableName=table_name, Item=item)
else:
print(f"Item with timestamp {timestamp} already exists in DynamoDB. Skipping.")

print('Data stored in DynamoDB')

Finally, we export the data from the table and create a CSV file. We create an additional CSV file for the top views data as this is not time stamped and doesn't make sense to persist. The CSV files are then stored in an S3 bucket for analysis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
scan_response = dynamodb.scan(
TableName=table_name
)

items = scan_response['Items']

while 'LastEvaluatedKey' in scan_response:
scan_response = dynamodb.scan(
TableName=table_name,
ExclusiveStartKey=scan_response['LastEvaluatedKey']
)
items += scan_response['Items']

# Replace bucket name with your bucket
bucket = <bucket_name>

with open('/tmp/views_clones_stars.csv', 'w', newline='') as file:
writer = csv.writer(file)
fieldnames = ['timestamp', 'clones', 'views', 'stars', 'forks']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()

for item in items:
row = {'timestamp': item['timestamp']['S'], 'clones': item['clones']['N'], 'views': item['views']['N'], 'stars': item['stars']['N'], 'forks': item['forks']['N']}
writer.writerow(row)

s3 = boto3.resource('s3')
write_response = s3.Bucket(bucket).upload_file('/tmp/views_clones_stars.csv', 'views_clones_stars/views_clones_stars.csv')
print(write_response)

with open('/tmp/top_paths.csv', 'w', newline='') as file:
writer = csv.writer(file)
fieldnames = ['path', 'views']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()

for path in top_paths:
row = {'path': path.path, 'views': path.count}
writer.writerow(row)

s3 = boto3.resource('s3')
write_response = s3.Bucket(bucket).upload_file('/tmp/top_paths.csv', 'top_paths/top_paths.csv')
print(write_response)

Putting it all together, this is what the Lambda function looks like. When deploying it to Lambda, the PyGitHub library needs to be packaged up with the function code as it is not natively available on the Lambda runtimes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import boto3
from github import Github
import csv

# Function retrieve GitHub token from SSM Parameter store
def get_ssm_parameter(param_name):
client = boto3.client('ssm')
response = client.get_parameter(
Name=param_name
)
return response['Parameter']['Value']

def lambda_handler(event, context):
# Retrieve the personal access token from SSM Parameter Store (Replace <parameter_name>)
access_token = get_ssm_parameter(<parameter_name>)

# Authenticate with GitHub using the PyGitHub library
g = Github(access_token)

# Replace <owner> and <repo> with your repository information
repo = g.get_repo("<owner>/<repo>")
traffic = repo.get_views_traffic()
clones = repo.get_clones_traffic()
stars = repo.stargazers_count
top_paths = repo.get_top_paths()
forks = repo.forks_count

# Initialize DynamoDB client and specify table name
dynamodb = boto3.client('dynamodb')
table_name = <table_name>

# Process the traffic data and store it in DynamoDB
for view in traffic['views']:
timestamp = view.timestamp.strftime('%Y-%m-%d %H:%M:%S')
views_count = view.count

clone = next((c for c in clones['clones'] if c.timestamp == view.timestamp), None)
clones_count = clone.count if clone else 0

# Check if the item already exists in DynamoDB
response = dynamodb.get_item(TableName=table_name, Key={'timestamp': {'S': timestamp}})

# If the item does not exist, put the item in DynamoDB
if 'Item' not in response:
item = {
'timestamp': {'S': timestamp},
'views': {'N': str(views_count)},
'clones': {'N': str(clones_count)},
'stars': {'N': str(stars)},
'forks': {'N': str(forks)}
}
dynamodb.put_item(TableName=table_name, Item=item)
else:
print(f"Item with timestamp {timestamp} already exists in DynamoDB. Skipping.")

print('Data stored in DynamoDB')

scan_response = dynamodb.scan(
TableName=table_name
)

items = scan_response['Items']

while 'LastEvaluatedKey' in scan_response:
scan_response = dynamodb.scan(
TableName=table_name,
ExclusiveStartKey=scan_response['LastEvaluatedKey']
)
items += scan_response['Items']

# Replace bucket name with your bucket
bucket = <bucket_name>

with open('/tmp/views_clones_stars.csv', 'w', newline='') as file:
writer = csv.writer(file)
fieldnames = ['timestamp', 'clones', 'views', 'stars', 'forks']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()

for item in items:
row = {'timestamp': item['timestamp']['S'], 'clones': item['clones']['N'], 'views': item['views']['N'], 'stars': item['stars']['N'], 'forks': item['forks']['N']}
writer.writerow(row)

s3 = boto3.resource('s3')
write_response = s3.Bucket(bucket).upload_file('/tmp/views_clones_stars.csv', 'views_clones_stars/views_clones_stars.csv')
print(write_response)

with open('/tmp/top_paths.csv', 'w', newline='') as file:
writer = csv.writer(file)
fieldnames = ['path', 'views']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()

for path in top_paths:
row = {'path': path.path, 'views': path.count}
writer.writerow(row)

s3 = boto3.resource('s3')
write_response = s3.Bucket(bucket).upload_file('/tmp/top_paths.csv', 'top_paths/top_paths.csv')
print(write_response)

Using QuickSight, I can ingest this data and create two datasets. This requires creating a manifest file that informs QuickSight as to where the data is stored for each dataset. We use the following manifest files for the datasets.

For the dataset that will contain information on views, stars, forks, and clones (note that the S3 path to the CSV needs to be updated):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"fileLocations": [
{
"URIPrefixes": [
"s3://PATH_TO_views_clones_stars.csv"
]
}
],
"globalUploadSettings": {
"format": "CSV",
"delimiter": ",",
"containsHeader": "true"
}
}

For the dataset that will contain information on top paths for the last two weeks (note that the S3 path to the CSV needs to be updated):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"fileLocations": [
{
"URIPrefixes": [
"s3://PATH_TO_top_paths.csv"
]
}
],
"globalUploadSettings": {
"format": "CSV",
"delimiter": ",",
"containsHeader": "true"
}
}

After the datasets have been created, we create an analysis to define how the data is visualized. Once the analysis has been created, we need to make sure both our datasets are added. The analysis can then be used to publish a beautiful dashboard like the one below (at least in my humble opinion).

Dashboard

Finally, we add a trigger to the Lambda function. For this solution, it will be an EventBridge (CloudWatch Events) trigger, and we will use a rate expression to trigger the function every 13 days.

trigger

Note that the entire solution has been built using AWS Serverless technology. Lambda for the compute, DynamoDB for the database, S3 for storage, QuickSight for analytics, and EventBridge for scheduling. This means there's no infrastructure that needs to be managed, and we only pay for what we use (almost all of this is included in the free tier). Since all the services used are managed by AWS, we get the added benefit of improved security and reliability i.e. no security groups/NACLs to manage, and definitely no multi-AZ configurations required (they're all managed by the service and are multi-AZ by default. Yay, high availability!).

Remember, you cannot manage what you cannot measure. With all the awesome open-source projects available on GitHub, it's important to understand the impact of your hard work. We've explored a solution to track key GitHub metrics like clones, stars, and forks, store it beyond the 14 day limit on GitHub, and analyzed the data to gain insight. This approach can be expanded to track additional metrics such as issues, pull requests, and any other data that you may find useful.

So whether you're exploring a galaxy far, far away or tracking the performance of your code, the importance of metrics cannot be understated. You too can use the power of serverless architectures to track the success of your repositories.


Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.