Use AWS Glue to make CloudTrail Parquet partitions
You can turn on CloudTrail logging with a single command, but how do you use the data for audits and automation? In this post, I’ll describe cloudtrail-parquet-glue, which makes CloudTrail logs efficiently Athena-searchable with minimal custom code (because ”the best code is no code”) using AWS Glue.
First, why use Athena for CloudTrail logs?
- Athena uses familiar expressive SQL and allows you to search large volumes of data efficiently. I find other CloudTrail data stores awkward to use (e.g. the CloudTrail log insights query language).
- CloudTrail Athena queries can be automated and chained with other serverless technologies like Lambda to create security automation tailored to your AWS usage. You can build alerts that have reduced false positives and negatives, are enriched with additional data, and are presented to end-users in a way that makes sense to your organization.
- Athena tables are integrated with other AWS data tools like AWS QuickSight.
- Athena can cross-reference CloudTrail logs with other forms of logs (e.g. IPs in access logs), making complicated data analysis scenarios easier.
- Having CloudTrail logs in the AWS data platform lets you tap into the benefits of a data lake architecture, as opposed to having your logs isolated in a third-party service.
If you want to use Athena to query CloudTrail, there are several alternatives. The post has two parts — a survey of existing “CloudTrail logs queryable in Athena” projects, and a description of my novel cloudtrail-parquet-glue project that builds a Glue workflow to connect Athena to CloudTrail.
Existing approaches for querying CloudTrail logs with Athena
Several projects have tried to make Athena and CloudTrail work better together. In 2018, I demonstrated how Athena could query CloudTrail logs in S3 with Lambda-created partitions. In August 2019, GorillaStack published Query your CloudTrail like a pro with Athena, where they described the athena-cloudtrail-partitioner project. In October 2019, Duo published Introducing CloudTrail-Partitioner which describes the cloudtrail-partitioner project. The projects have a few differences — AWS CDK rather than CloudFormation, different Lambda error handling— but are otherwise similar.
Instead of manually defining schema and partitions, you can use Glue Crawlers to automatically identify them. AWS Labs athena-glue-service-logs project is described in an AWS blog post Easily query AWS service logs using Amazon Athena. This project uses an AWS Glue ETL (i.e. Spark) job to not only partition service logs automatically, but convert them to resource-friendly Parquet format.
I found two public Github repositories that demonstrate Glue Crawlers working on CloudTrail logs. terraform-auto-cloudtrail defines a very simple CloudTrail Glue Crawler in terraform, whereas aws_cloudtrail_pipeline describes multiple Glue Crawlers to transform CloudTrail to Parquet using a hybrid of CloudFormation and terraform.
Use Glue Crawlers to make CloudTrail logs queryable with Athena
There were several issues with the existing tools that I wanted to solve:
- Projects that use Lambda rather than Glue Crawlers to create partitions require error monitoring, code updates, and other tradeoffs associated with custom code.
- Just partitioning isn’t enough — CloudTrail logs should be converted to Parquet. It reduces the cost and time of querying them with Athena, and combines the several small files that can cause problems with Athena.
- When you use AWS Control Tower, CloudTrail logs are sent to a separate S3 bucket in the Log Archive account. The logs are partitioned by account, integrity validation is enabled creating Digest folders, and the S3 bucket is shared with AWS Config. Existing approaches didn’t work well with running Athena on this bucket structure.
To solve for my constraints, I created the cloudtrail-parquet-glue Terraform module. It creates a Glue Workflow that maintains an Athena-queryable Parquet store for CloudTrail logs.
AWS Glue Workflows can be used to combine crawlers and ETL jobs into a multi-step processes. The cloudtrail-parquet-glue Glue Workflow is three steps:
- The CloudTrailCrawler Crawler, which examines the CloudTrail logs in their native CSV format and creates a Glue table with schema and partitions.
- The CloudTrailToParquet ETL job, which converts the data made discoverable in the newly created Glue table to Parquet format.
- The CloudTrailParquet Crawler, which examines the CloudTrail logs in their Parquet format and creates a Glue table with schema and partitions.
The resulting parquet_cloudtrail Glue data catalog table can be queried with Athena using five partitions: account, region, year, month, and day. The Workflow is configured to run daily when new CloudTrail partitions can be discovered, converted, and made available.
When I started building this integration, it would be straightforward to use AWS Glue Crawlers, but they are fairly “magic” and need to be configured with some trial and error. Deciding whether to have them combine multiple schemas into one, add new columns on schema change, and so on, was a long, tiresome process. I ended up relying on a combination of converting very-loosely structured structs (e.g. responseelements) to strings and relationalizing other structs, like useridentity.
Overall, my experience using AWS managed services has been that systems that rely on them tend to get better over time, or at the very least degrade less quickly. While AWS Glue isn’t the most well-trod ground in AWS land, it’s a service you can pay someone else to operate and maintain. And, once you’ve set it up, you can move on to other things.
Thoughts on using cloudtrail-parquet-glue and similar approaches? Hit me up on Twitter at @alsmola.