Setting up a Serverless Data Lake on AWS: A Step-by-Step Guide
Serverless computing has been a game-changer in how we architect, deploy, and manage our applications and services. However, this serverless trend isn’t confined to just application development. In the world of big data, serverless data lakes have also gained traction, allowing businesses to store, analyze, and visualize data without the overhead of managing any infrastructure.
In this article, I'll walk through how to set up a serverless data lake using AWS services such as S3, Glue, Athena, and QuickSight.
What is a Serverless Data Lake?
Before diving in, let’s quickly understand the concept of a data lake and its serverless variant.
A data lake is a centralized repository designed to store vast amounts of raw data, irrespective of its format. This can include structured, semi-structured, or unstructured data.
A serverless data lake means that all the components of the data lake are managed. There are no servers to provision or manage, making it a highly scalable and cost-effective solution.
Step-by-Step Guide to Set Up a Serverless Data Lake
Our data lake flow comprises an S3 bucket, a Glue Crawler, a database table, Athena for querying, and QuickSight for visualization. We'll be using AWS CLI for all steps.
1. Setting up the S3 Bucket
We start by creating an S3 bucket that will store our data.
aws s3api create-bucket --bucket my-datalake-bucket --region us-west-1
Now, let’s upload the sample data.
aws s3 cp /path/to/your/datafile.csv s3://my-datalake-bucket/
2. Setting up AWS Glue
2.1 Creating a Glue Database
First, let’s create a new Glue database:
aws glue create-database --database-input "{\"Name\":\"mydatalake_db\", \"Description\":\"My datalake database.\"}"
2.2 Setting up Glue Crawler
Now, we create a Glue crawler to explore the data in our S3 bucket and create a table.
aws glue create-crawler \
--name my-datalake-crawler \
--role MyGlueServiceRole \
--database-name mydatalake_db \
--table-prefix "dl_" \
--targets '{"S3Targets": [{"Path": "s3://my-datalake-bucket/"}]}'
Start the Crawler:
aws glue start-crawler --name my-datalake-crawler
Wait a few minutes, then you should see a new table (with prefix "dl_") created in the mydatalake_db database.
3. Querying with Athena
Navigate to the Athena console. Select the database mydatalake_db and you should see the table created by Glue. Use the Athena console or AWS CLI to run SQL-like queries.
SELECT * FROM dl_my_table LIMIT 10;
4. Visualization with QuickSight
4.1 Setting up QuickSight
If you haven’t set up QuickSight, you'll need to sign up for it in the AWS Console. Grant QuickSight permissions to access Athena and the required S3 bucket.
4.2 Visualize Athena Data
- Open QuickSight and choose "New Analysis".
- Choose "New Dataset" and then select Athena.
- Give it a name and select the database and table.
- Import the dataset.
- Once the dataset is imported, you can create various visualizations like tables, bar charts, pie charts, etc.
Get cloudformation template here (Except for QuickSight Resources)
AWSTemplateFormatVersion: 2010-09-09
Parameters:
Env:
Type: String
Default: 'dev'
AllowedValues:
- 'dev'
- 'prod'
RawDataBucketName:
Description: 'Bucket name'
Type: String
Default: 'raw-data'
GlueDatabaseName:
Description: 'Glue database name'
Type: String
Default: 'mydatalake-db'
TableName:
Type: String
Description: Table name must not contain uppercase characters.
Default: 'tbl_raw_data'
Resources:
RawDataBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub ${RawDataBucketName}-${AWS::AccountId}-${AWS::Region}-${Env}
BucketEncryption:
ServerSideEncryptionConfiguration:
-
ServerSideEncryptionByDefault:
SSEAlgorithm: 'AES256'
BucketKeyEnabled: false
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
AthenaQueryResultBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub query-result-${AWS::AccountId}-${AWS::Region}-${Env}
BucketEncryption:
ServerSideEncryptionConfiguration:
-
ServerSideEncryptionByDefault:
SSEAlgorithm: 'AES256'
BucketKeyEnabled: false
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
LifecycleConfiguration:
Rules:
-
Id: 'auto-delete'
Status: 'Enabled'
ExpirationInDays: 7
AthenaWorkGroup:
Type: AWS::Athena::WorkGroup
DependsOn: AthenaQueryResultBucket
Properties:
Name: !Sub athena-work-group-${Env}
RecursiveDeleteOption: true
WorkGroupConfiguration:
ResultConfiguration:
OutputLocation: !Sub s3://${AthenaQueryResultBucket}/data
EncryptionConfiguration:
EncryptionOption: 'SSE_S3'
EnforceWorkGroupConfiguration: true
PublishCloudWatchMetricsEnabled: true
GlueDatabase:
Type: AWS::Glue::Database
DependsOn: RawDataBucket
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: !Ref GlueDatabaseName
GlueTable:
Type: AWS::Glue::Table
DependsOn: GlueDatabase
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: !Ref GlueDatabaseName
TableInput:
Name: !Ref TableName
TableType: EXTERNAL_TABLE
Parameters:
skip.header.line.count: 1
has_encrypted_data: false
serialization.encoding: utf-8
EXTERNAL: true
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: source
Type: string
- Name: destination
Type: string
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: !Sub 's3://${RawDataBucketName}-${AWS::AccountId}-${AWS::Region}-${Env}/data'
SerdeInfo:
Parameters:
field.delim: ","
serialization.format: ","
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Conclusion
With this setup, you now have a fully functional serverless data lake on AWS. The data in S3 can be updated and every time the Glue crawler runs, the Athena tables will be updated. You can then analyze the data using Athena and visualize it in QuickSight.
Remember to always monitor costs and be aware of any potential charges, especially when querying large datasets with Athena or storing significant amounts of data in S3.
© Waqar Ahmed.RSS