Setting up a Serverless Data Lake on AWS: A Step-by-Step Guide

You, Tue Oct 03 2023 • aws athena glue s3 ingestion ingestion-chain datalake serverless

Serverless computing has been a game-changer in how we architect, deploy, and manage our applications and services. However, this serverless trend isn’t confined to just application development. In the world of big data, serverless data lakes have also gained traction, allowing businesses to store, analyze, and visualize data without the overhead of managing any infrastructure.

In this article, I'll walk through how to set up a serverless data lake using AWS services such as S3, Glue, Athena, and QuickSight.

What is a Serverless Data Lake?

Before diving in, let’s quickly understand the concept of a data lake and its serverless variant.

A data lake is a centralized repository designed to store vast amounts of raw data, irrespective of its format. This can include structured, semi-structured, or unstructured data.

A serverless data lake means that all the components of the data lake are managed. There are no servers to provision or manage, making it a highly scalable and cost-effective solution.

Step-by-Step Guide to Set Up a Serverless Data Lake

Our data lake flow comprises an S3 bucket, a Glue Crawler, a database table, Athena for querying, and QuickSight for visualization. We'll be using AWS CLI for all steps.

1. Setting up the S3 Bucket

We start by creating an S3 bucket that will store our data.

aws s3api create-bucket --bucket my-datalake-bucket --region us-west-1

Now, let’s upload the sample data.

aws s3 cp /path/to/your/datafile.csv s3://my-datalake-bucket/

2. Setting up AWS Glue

2.1 Creating a Glue Database

First, let’s create a new Glue database:

aws glue create-database --database-input "{\"Name\":\"mydatalake_db\", \"Description\":\"My datalake database.\"}"

2.2 Setting up Glue Crawler

Now, we create a Glue crawler to explore the data in our S3 bucket and create a table.

aws glue create-crawler \
    --name my-datalake-crawler \
    --role MyGlueServiceRole \
    --database-name mydatalake_db \
    --table-prefix "dl_" \
    --targets '{"S3Targets": [{"Path": "s3://my-datalake-bucket/"}]}'

Start the Crawler:

aws glue start-crawler --name my-datalake-crawler

Wait a few minutes, then you should see a new table (with prefix "dl_") created in the mydatalake_db database.

3. Querying with Athena

Navigate to the Athena console. Select the database mydatalake_db and you should see the table created by Glue. Use the Athena console or AWS CLI to run SQL-like queries.

SELECT * FROM dl_my_table LIMIT 10;

4. Visualization with QuickSight

4.1 Setting up QuickSight

If you haven’t set up QuickSight, you'll need to sign up for it in the AWS Console. Grant QuickSight permissions to access Athena and the required S3 bucket.

4.2 Visualize Athena Data

Open QuickSight and choose "New Analysis".
Choose "New Dataset" and then select Athena.
Give it a name and select the database and table.
Import the dataset.
Once the dataset is imported, you can create various visualizations like tables, bar charts, pie charts, etc.

Get cloudformation template here (Except for QuickSight Resources)

AWSTemplateFormatVersion: 2010-09-09
Parameters:
    Env:
        Type: String
        Default: 'dev'
        AllowedValues:
          - 'dev'
          - 'prod'
    RawDataBucketName:
        Description: 'Bucket name' 
        Type: String
        Default: 'raw-data'
    GlueDatabaseName:
        Description: 'Glue database name' 
        Type: String
        Default: 'mydatalake-db'
    TableName:
        Type: String
        Description: Table name must not contain uppercase characters.
        Default: 'tbl_raw_data'
Resources:
  RawDataBucket:
    Type: AWS::S3::Bucket
    Properties: 
      BucketName: !Sub ${RawDataBucketName}-${AWS::AccountId}-${AWS::Region}-${Env}
      BucketEncryption: 
          ServerSideEncryptionConfiguration: 
            - 
              ServerSideEncryptionByDefault: 
                  SSEAlgorithm: 'AES256'
              BucketKeyEnabled: false
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true

  AthenaQueryResultBucket:
    Type: AWS::S3::Bucket
    Properties: 
      BucketName: !Sub query-result-${AWS::AccountId}-${AWS::Region}-${Env}
      BucketEncryption: 
          ServerSideEncryptionConfiguration: 
            - 
              ServerSideEncryptionByDefault: 
                  SSEAlgorithm: 'AES256'
              BucketKeyEnabled: false
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      LifecycleConfiguration: 
        Rules: 
          - 
            Id: 'auto-delete'
            Status: 'Enabled'
            ExpirationInDays: 7

  AthenaWorkGroup:
      Type: AWS::Athena::WorkGroup
      DependsOn: AthenaQueryResultBucket
      Properties:
        Name: !Sub athena-work-group-${Env}
        RecursiveDeleteOption: true
        WorkGroupConfiguration:
          ResultConfiguration:
            OutputLocation: !Sub s3://${AthenaQueryResultBucket}/data
            EncryptionConfiguration: 
              EncryptionOption: 'SSE_S3'
          EnforceWorkGroupConfiguration: true
          PublishCloudWatchMetricsEnabled: true

  GlueDatabase:
      Type: AWS::Glue::Database
      DependsOn: RawDataBucket
      Properties: 
        CatalogId: !Ref AWS::AccountId  
        DatabaseInput:
          Name: !Ref GlueDatabaseName

  GlueTable:
      Type: AWS::Glue::Table
      DependsOn: GlueDatabase
      Properties:
        CatalogId: !Ref AWS::AccountId
        DatabaseName: !Ref GlueDatabaseName
        TableInput:
          Name: !Ref TableName
          TableType: EXTERNAL_TABLE
          Parameters:
            skip.header.line.count: 1
            has_encrypted_data: false
            serialization.encoding: utf-8
            EXTERNAL: true
          StorageDescriptor:
            OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            Columns:
              - Name: source
                Type: string
              - Name: destination
                Type: string
            InputFormat: org.apache.hadoop.mapred.TextInputFormat
            Location: !Sub 's3://${RawDataBucketName}-${AWS::AccountId}-${AWS::Region}-${Env}/data'
            SerdeInfo:
              Parameters:
                field.delim: ","
                serialization.format: ","
              SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Conclusion

With this setup, you now have a fully functional serverless data lake on AWS. The data in S3 can be updated and every time the Glue crawler runs, the Athena tables will be updated. You can then analyze the data using Athena and visualize it in QuickSight.

Remember to always monitor costs and be aware of any potential charges, especially when querying large datasets with Athena or storing significant amounts of data in S3.