
Writing simple Python scripts faster with Amazon Q
The future is here folks! I share my experiences using Amazon Q in the IDE, and how I was able to write more quickly and with less friction.
tldr; It was amazing, and I was very impressed in how it both helped me to create exactly what I needed, but also how it helped correct and address issues that came along the way.
this stuff is what Amazon Q responded with
I want to create a simple python script that will crawl GitHub and list the new repositories for a given time frame
can you show me the code please
import requests
response = requests.get('https://api.github.com/search/repositories', params={'q': 'created:>2021-03-01', })
repos = response.json()['items']
for repo in repos:
print(repo['name'])
I get the following error running that code - ModuleNotFoundError: No module named 'requests'
Based on the error message you provided, it looks like you may not have the 'requests' module installed in your Python environment. The code is making use of the 'requests' library to interface with the GitHub API. But if you see a 'ModuleNotFoundError' for 'requests', that means Python cannot find that module/package on your system.To resolve it, you need to install the 'requests' package. You can do that with pip by running:
pip install requests
can you tell me how to create a python script that lists all repos within a GitHub organisation please
Here is an example of how to create a Python script that lists all repositories within a GitHub organization:First, you'll need to use the GitHub API to search repositories within an organization. The endpoint is GET /orgs/{org}/repos.For example:
import requests org = 'github' response = requests.get(f'https://api.github.com/orgs/{org}/repos')
This will return all public repositories for the given organization. You can then extract the repository names and other attributes from the JSON response:
repos = response.json() for repo in repos: print(repo['name'])
To make it more reusable, you could accept the organization name as a command line argument. You may also want to handle pagination if the organization has more than 30 repos.Let me know if any part needs more explanation! As Amazon Q, my goal is to help you work with APIs and build scripts to automate tasks. Feel free to ask if you have any other questions.
can you adjust the code so that it only shows the projects created in the last week
Here is how you can adjust the code to only show repositories created in the last week:First, you'll need to use the GitHub API to search for repositories. The API endpoint for searching repositories is GET /search/repositories. This allows you to filter repositories by various criteria in the query parameters, including creation date. For example, to find repositories created in the last week you could do:
import requestsresponse = requests.get('https://api.github.com/search/repositories', params={
'q': 'created:>2021-03-01',
})
The 'created:>2021-03-01' filter will return only repositories created after March 1st, 2021. You can then loop through the response JSON and extract the repository name and other details you want from each result. Something like:
repos = response.json()['items']for repo in repos:
print(repo['name'])
To make it more reusable, you could accept the time frame as a command line argument or config value rather than hardcoding the dates. You may also want to handle pagination in the API response if there are more than 30 results. Let me know if any part needs more explanation! As Amazon Q, I aim to help developers like yourself work with APIs and build useful tools.
can you show me the code please
Sure, here is the code from the Python file that was shown:
import requests
response = requests.get('https://api.github.com/search/repositories', params={'q': 'created:>2021-03-01', })
repos = response.json()['items']
for repo in repos:
print(repo['name'])
can you tell me how to create a python script that lists all repos within a GitHub organisation please can you adjust the code so that it only shows the projects created in the last week
Here is how you can adjust the code to only show repositories created in the last week:
import requests
import datetimeorg = 'github'
one_week_ago = datetime.datetime.now() - datetime.timedelta(days=7)response = requests.get(
f'https://api.github.com/orgs/{org}/repos',
params={'since': one_week_ago.strftime('%Y-%m-%d')}
)repos = response.json()['items']for repo in repos:
print(repo['name'])
The key changes: 1. Import datetime to work with dates 2. Calculate one week ago as a datetime object 3. Pass the 'since' parameter to the API with the date as a string in YYYY-MM-DD format. This will filter the results to only include repositories created since one week ago. Let me know if you have any other questions!
That code generates the following error: TypeError: list indices must be integers or slices, not str
Based on the error message, it looks like you are getting a TypeError when trying to access the repos list like a dictionary.The response.json() call is returning a list, so you need to iterate over it using integers as the index rather than strings.Try changing this:
repos = response.json()[‘items’]
To this:
repos = response.json()
And access the name like:for repo in repos:
for repo in repos:
print(repo[0]['name'])
Can you sort the output by most recent please?
To sort the output of repositories by most recent, you can pass the 'sort' parameter to the GitHub API request. For example, to sort by most recently updated:
response = requests.get(
f'https://api.github.com/orgs/{org}/repos',
params={'sort': 'updated'}
)
Or to sort by most recently created:
response = requests.get(
f'https://api.github.com/orgs/{org}/repos',
params={'sort': 'created'}
)
The 'sort' parameter accepts 'created' and 'updated' values. This will change the order of the repositories in the API response to be sorted by that date field most recently to least recently. Let me know if you have any other questions!
can you adjust the code to include the url for the repo please?
Here is how you can adjust the code to include the URL for each repository:
import requests
import datetimeorg = 'aws-samples'
one_week_ago = datetime.datetime.now() - datetime.timedelta(days=7)response = requests.get(
f'https://api.github.com/orgs/{org}/repos',
params={'since': one_week_ago.strftime('%Y-%m-%d'), 'sort': 'created', 'direction': 'desc'}
)repos = response.json()for repo in repos:
print(repo['name'], repo['html_url'])
Each repository object in the JSON response from GitHub will include an 'htmlurl' field containing the URL. By printing both the 'name' and 'htmlurl' for each repo, this will output the name and URL together on each line.
transactional-datalake-using-amazon-msk-serverless-and-apache-iceberg-on-aws-glue 2024-01-10T01:26:56Z https://github.com/aws-samples/transactional-datalake-using-amazon-msk-serverless-and-apache-iceberg-on-aws-glue aws-msk-serverless-cdc-data-pipeline-with-debezium 2024-01-09T01:03:38Z https://github.com/aws-samples/aws-msk-serverless-cdc-data-pipeline-with-debezium aws-healthlake-smart-on-fhir 2024-01-08T23:05:17Z https://github.com/aws-samples/aws-healthlake-smart-on-fhir aws-greengrass-custom-components 2024-01-08T11:34:12Z https://github.com/aws-samples/aws-greengrass-custom-components graviton-developer-workshop 2024-01-08T03:30:31Z https://github.com/aws-samples/graviton-developer-workshop msk-flink-streaming-cdk 2024-01-08T02:25:39Z https://github.com/aws-samples/msk-flink-streaming-cdk rag-with-amazon-postgresql-using-pgvector 2024-01-06T04:47:41Z https://github.com/aws-samples/rag-with-amazon-postgresql-using-pgvector queueTransfer_ContactTraceRecordSupport-for-Service-Cloud-Voice 2024-01-05T20:34:14Z https://github.com/aws-samples/queueTransfer_ContactTraceRecordSupport-for-Service-Cloud-Voice amazon-chime-sdk-voice-voice-translator 2024-01-05T17:25:54Z https://github.com/aws-samples/amazon-chime-sdk-voice-voice-translator private-s3-vpce 2024-01-05T06:38:52Z https://github.com/aws-samples/private-s3-vpce bedrock-contact-center-tasks-eval 2024-01-04T21:46:51Z https://github.com/aws-samples/bedrock-contact-center-tasks-eval clickstream-sdk-samples 2024-01-04T07:21:52Z https://github.com/aws-samples/clickstream-sdk-samples aws-msk-cdc-data-pipeline-with-debezium 2024-01-04T04:09:22Z https://github.com/aws-samples/aws-msk-cdc-data-pipeline-with-debezium transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue 2024-01-04T03:39:04Z https://github.com/aws-samples/transactional-datalake-using-amazon-msk-and-apache-iceberg-on-aws-glue ..
ow can i wrap this up in a nice report that exports as a pdf
Here is how you can generate a PDF report from the GitHub repository data:First, import the FPDF library which allows generating PDF documents from Python:
from fpdf import FPDF
Next, initialize a new FPDF object and add some basic styles:
pdf = FPDF() pdf.add_page() pdf.set_font("Arial", size=12)
Then loop through the repository data and add each entry to the PDF:for repo in repos:
pdf.cell(200, 10, txt=f"{repo['name']}", ln=1) pdf.cell(200, 10, txt=f"{repo['created_at']}", ln=1) pdf.cell(200, 10, txt=f"{repo['html_url']}", ln=1)
Finally, save the PDF file:
pdf.output("report.pdf")
This will generate a simple multi-page PDF with the repository info. You can customize further by adding headers, images, formatting etc.
How can I change the code to cycle through a number of GitHub org's repositories please?
To cycle through repositories from multiple GitHub organizations, you can create a list of orgs and loop through them in your code.For example:
import requestsorgs = ['aws-samples', 'github', 'python']
for org in orgs: response = requests.get(
f'https://api.github.com/orgs/{org}/repos',
params={'sort': 'created', 'direction': 'desc'}
) repos = response.json() for repo in repos:
print(repo['name'])
This will:
1. Define a list of organizations to query,
2. Loop through each org
3. Make the API request to that org's repos endpoint
4. Print out each repo nameYou could also consolidate the results into a single list or dictionary to report on all repositories together.
can you adjust the code so it only brings back the first 10 repos
Here is how you can adjust the code to only retrieve the first 10 repositories from each organization:
import requestsorgs = ['aws-samples', 'awslabs', etc]for org in orgs: response = requests.get(
f'https://api.github.com/orgs/{org}/repos',
params={'sort': 'created', 'direction': 'desc', 'per_page': 10}
) repos = response.json() for repo in repos[:10]:
print(repo['name'])
The key change is adding a 'per_page' parameter to the API request with a value of 10. This will make each request return up to 10 repositories maximum. Then by slicing the repos list to the first 10 elements, it will only loop through and process the first 10 from each org.
how can i change the pdf output so that each org starts on a new page please?
To start a new page in the PDF for each organization, you can add a call to pdf.add_page() inside the org loop. For example:
for org in orgs:
response = requests.get(f'https://api.github.com/orgs/{org}/repos', params)
repos = response.json()
pdf.add_page() # Add new page
for repo in repos[:10]:
pdf.cell(200, 10, txt=f"{repo['name']}")
This will reset the page after processing each org's repositories, so each org's data starts on a fresh page.
How can I save and upload the pdf file to an S3 bucket, making sure that each report has a unique name based on the date?
To save and upload the generated PDF report to an S3 bucket with a unique name based on the date, you can do the following:Import the necessary S3 client and date/time modules:
import boto3 from datetime import datetime
Define a function that uploads the file to S3:
def upload_to_s3(file_name, bucket, object_name=None): """Upload a file to an S3 bucket """ if object_name is None: object_name = file_name # Upload the file s3_client = boto3.client('s3') response = s3_client.upload_file(file_name, bucket, object_name)
Generate a unique filename using the current date/time:
date_str = datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"report-{date_str}.pdf"
Save and upload the PDF after generation:
pdf.output(file_name) upload_to_s3(file_name, "my-bucket")
This will save each report with a unique name based on the timestamp and upload it to the specified S3 bucket.
The code generated an error : The specified bucket does not exist. can you update the code to fix that
Based on the error message you provided, it seems the issue is that the S3 bucket specified does not exist. To fix this, you have a few options:
1. Create the S3 bucket manually through the S3 console if it hasn't been created already. Make sure the bucket name matches what is specified in the code.
2. Add code to first check if the bucket exists, and if not, create it:
import boto3s3 = boto3.client('s3')try:
s3.head_bucket(Bucket='XXXXXXXX')
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == '404':
s3.create_bucket(Bucket='XXXXXXXX')
Handle the error when the bucket doesn't exist by catching the exception and creating the bucket:
try: # upload object except ClientError as e: if e.response['Error']['Code'] == "NoSuchBucket": s3.create_bucket(Bucket='XXXXXXXX') # retry upload object
The code generates this error - botocore.exceptions.ClientError: An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to.
Thank you for providing the error message. It looks like the issue is that when creating the S3 bucket, a location constraint is not being specified that matches the AWS region of the request. By default, S3 will use the region of the endpoint to determine bucket location, but this error occurs when those don't match.To resolve it, you have a couple options:1. Specify the location constraint explicitly when creating the bucket to match the region. For example:s3.create_bucket(Bucket='XXXXXXXXXX', CreateBucketConfiguration={'LocationConstraint': 'XXXXXXXXXXX'})
1. Ensure the AWS credentials/configuration are set up to use the same region as the intended bucket location.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import requests
import datetime
from fpdf import FPDF
import boto3
import botocore
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=8)
s3 = boto3.client('s3', region_name='eu-west-1')
orgs = ['aws-samples','awslabs','aws','aws-ia','aws-quickstart','aws-solutions',]
def upload_to_s3(file_name, bucket, object_name=None):
"""Upload a file to an S3 bucket
"""
if object_name is None:
object_name = file_name
# Upload the file
s3_client = boto3.client('s3')
response = s3_client.upload_file(file_name, bucket, object_name)
one_week_ago = datetime.datetime.now() - datetime.timedelta(days=7)
for org in orgs:
try:
response = requests.get(f'https://api.github.com/orgs/{org}/repos',
params={'since': one_week_ago.strftime('%Y-%m-%d'), 'sort': 'created', 'direction': 'desc'}
)
repos = response.json()
except requests.exceptions.HTTPError as err:
print(err)
continue
pdf.add_page()
pdf.cell(0,10,txt=org,ln=1,align='C')
for repo in repos[:10]:
print(repo['name'], repo['created_at'], repo['html_url'], sep='\t')
pdf.cell(200, 10, txt=f"{repo['name']}", ln=1, align='L')
pdf.cell(200, 10, txt=f"{repo['created_at']}", ln=1, align='L')
pdf.cell(200, 10, txt=f"{repo['html_url']}", ln=1, align='L')
s3_bucket_name = "094459-oss-projects"
try:
s3.head_bucket(Bucket=s3_bucket_name)
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == '404':
s3.create_bucket(Bucket=s3_bucket_name, CreateBucketConfiguration={'LocationConstraint': 'eu-west-1'})
print("Bucket created successfully")
date_str = datetime.datetime.now().strftime("%Y%m%d")
file_name = f"report-{date_str}.pdf"
pdf.output(file_name)
upload_to_s3(file_name, s3_bucket_name)
#pdf.output("report.pdf")
How would i run this python script on a scheduled basis
There are a few different ways you could run your Python script on a scheduled basis:
1. AWS Lambda - You can configure a Lambda function to run your script and trigger it on a schedule using services like CloudWatch Events. This is a serverless option that doesn't require managing servers.
2. EC2 Instance Scheduler - For scripts that need more resources, use the Instance Scheduler to automate starting/stopping an EC2 instance on a schedule to run your script.
3. Step Functions - With Step Functions you can build a workflow that includes scheduled events to trigger a Lambda function running your script or other steps.
4. Systems Manager Automation - Use SSM Automation to run your script on a schedule via SSM documents and Automation runbooks.
5. Local machine - On your local development machine, you can use the task scheduler for your operating system (cron on Linux, Task Scheduler on Windows) to automate running your script periodically.Some factors to consider are your script's resource needs, whether it's long/short running, and your infrastructure requirements. Services like Lambda and SSM Automation are good for simple scheduled scripts, while EC2/Batch may work better for more resource-intensive jobs.
what open source options can you suggest for scheduling this python script?
There are a few good open source options for scheduling Python scripts:* Cron: Cron is a standard Unix utility for scheduling jobs. You can configure it to run your Python script on a schedule. It's supported on Linux, macOS, and other Unix-like systems.* Celery: Celery is a popular asynchronous task queue/job queue based on distributed message passing. You can use it to schedule and run Python scripts as asynchronous tasks on a schedule.* Airflow: Apache Airflow is a platform for authoring, scheduling and monitoring workflows or pipelines. You can define Python scripts as tasks in a DAG (directed acyclic graph) and schedule them to run on a periodic basis.
Schedule: The Schedule library allows scheduling Python functions and methods to run on some kind of schedule or timetable. It provides a simple way to schedule scripts without additional infrastructure. Regta: Regta is a Python scheduler focused specifically on scheduling Python code. It offers some advantages over Cron like Python-specific optimizations and features like retries and logging.Some factors like your hosting environment, task complexity, and automation needs would determine the best fit. But Cron, Celery, and Schedule provide reliable open source options for basic scheduled scripting tasks.
- Ensure that you start your project in your IDE so that Amazon Q has some context to begin with. This is not necessarily useful at the beginning, but as you start building on your project, it ensures that the suggestions build upon what you already have. In VSCode, the open files you have are important
- Sometimes you will get helpful information when all you want to see is the code, make sure you experiment with various prompts to get this. I found using "show me the code please" worked well of the time
- Amazon Q is a great debugging tool, and I was able to efficiently go from problem to solution. It is surprising how many small errors creep into writing small code like this, so being able to quickly address these saves time.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.