CloudGuruChallenge - Event Driven Python on AWS

After being asked by many friends about transitioning into a DevOps role with no background; I began doing some digging and stumbled upon the Cloud Resume Challenge. The book opens up with a story about a plumber learning to code while dealing with literal shit, then proceeds to give the curious an open-ended challenge to complete.

Some further digging I stumbled upon A Cloud Guru's Event-Driven Python challenge and decided to jump in as a means to overcome my fear of serverless functions and build some new skills.

Database

For the database, I decided to choose something light weight and affordable. DynamoDB looked to be the best option as I was able to quickly launch a table without having to managing credentials. Keeping the stack light and cost-efficient went hand in hand with a serverless stack. The main drawback to this decision is presented when setting up the dashboard. Unfortunately, there are some extra steps required to leverage AWS Quicksight which was a bit surprising for a native AWS tool.

Data Handling in Python

The data manipulation portion of this was new to me, even with a decent background in Python. For manipulating data, I used the pandas library and leveraged dataframes to manipulate fields and verify the data was as expected. The awswrangler package was used to load and update any records in the database that have been added. I do believe the function as a sync and will update any tables if the data is not the same, however I have not tested this for the sake of getting this project done. From a data engineering perspective it would be wise to verify the nature of the awswrangler put command. Once records are updated, I compare the fields in the dataframe with the number of rows in Dynamo to get the number of rows updated.

Infrastructure As Code (IAC)

The IAC portion of this project was quite interesting, as I had to leverage Terraform extensively to package the lambda application and deliver it to AWS. Since the repository was public (and also since it's good practice) I had to take into consideration publishing variable files to Git, as that would expose some ARN IDs, emails and AWS credentials. This issue becomes more apparent when hooking up the CI component of this project.

A notable problem was dealing with delivering third party packages to lambda. Thankfully awswrangler has a lambda layer that we can pull in without any additional packages to build.

Notifications

Notifications were easy to hookup, leveraging AWS's SNS subscription. For flexibility and security, I configured the SNS topic via environment variable to the application. Anytime the job would succeed and update records, a message would be emailed letting the user know how many records were added to the table.

records-updated-notification

Whenever the job fails, the recipient will receive a notification with the python error message, which has proven very helpful when you need to debug any permission or code issues.

failed-job-notification

CI/CD

For CI/CD, I leveraged Github actions to deploy the Terraform changes. The workflow involves, opening up a Pull Request to run a terraform plan job. This way we can ensure our changes won't fail on the plan before merging. We can check out any updates in the comment section of the pull request. Upon merging to master, the job will run a terraform apply and updated our changes to AWS.

Follow this Terraform tutorial to learn how to leverage github actions to automate terraform deployments.

Wrap Up

This was a great project to explore the challenges a data engineer may face when setting up ETL pipelines. The most challenging parts of this challenge were deploying the lambdas to AWS and setting up a secure CI/CD pipeline. I can happily say, this entire project has a monthly cost of ~$10 even with active testing. You can find the source code for the project on GitHub.

aws-billing-console