Blog.

Connecting AWS S3 with Snowflake: SnowPipe Auto-Ingest

Cover Image for Connecting AWS S3 with Snowflake: SnowPipe Auto-Ingest
Wayne / Devscover
Wayne / Devscover

Working at a big retail client in the UK, we had a problem where had some very sensitive data in S3 buckets, and we needed to move that data into SnowFlake without a person having access, and being able to see that data.

Many people drew big diagrams including multiple Lambdas, SQS, SNS and other AWS features to make this work, however we started looking at a new feature called “Auto Ingest”.

This is how we set it up.

Setup {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-Setup}

The setup for this may seem straightforward, however there are some constraints within our organisation’s configuration that makes this somewhat more complicated. Therefore this guide serves to detail how to set up an Auto Ingest pipeline.

Firstly, it’s important to conceptualise how this works. In this setup, we are actually using the S3 bucket as a staging area. It’s referred to as an “external stage”.

We then create notifications via SQS from that bucket to a table using a pipe.

Table {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-Table}

To create the table, we first need to run:

<td>
  <code>USE&nbsp;SCHEMA</code>&nbsp;<code>GDPR_DATA;</code>&nbsp;<code>CREATE</code>&nbsp;<code>OR</code>&nbsp;<code>REPLACE</code>&nbsp;<code>TABLE</code>&nbsp;<code>TestGDPRTable(DATA variant);</code>
</td>
12345

You can then confirm the table was created with

<td>
  <code>SELECT</code>&nbsp;<code>*&nbsp;FROM</code>&nbsp;<code>GDPR_DATA.TestGDPRTable;</code>
</td>
1

Stage {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-Stage}

We then need to create the stage. This is essentially telling Snowflake to use an S3 bucket as a stage:

<td>
  <code>CREATE</code>&nbsp;<code>OR</code>&nbsp;<code>REPLACE</code>&nbsp;<code>STAGE TestGDPRStage url='s3://gdpr-data-holding/'credentials = (aws_role =&nbsp;'arn:aws:iam::0000:role/snowflake_access_role')encryption=(type='AWS_SSE_KMS'</code>&nbsp;<code>kms_key_id =&nbsp;'aws/key');</code>
</td>
1234

The arn here is for a role in our AWS account containing the bucket. The role doesn’t have to actually exist yet. You can confirm the stage was created with

<td>
  <code>SHOW STAGES;</code>
</td>
1

Note this stage wont actuallywork until we have configured security, later on. If you are unable to create it at this point, wait until you get to “Create A Role” section.

Pipe {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-Pipe}

We now create a pipe that transfers data from the stage to the table:

<td>
  <code>CREATE</code>&nbsp;<code>OR</code>&nbsp;<code>REPLACE</code>&nbsp;<code>PIPE GDPR_DATA.TestGDPRPipe auto_ingest=true</code>&nbsp;<code>ascopy&nbsp;into</code> <code>GDPR_DATA.TestGDPRTable from</code>&nbsp;<code>@GDPR_DATA.TestGDPRStage file_format = (type =&nbsp;'JSON');</code>
</td>
1234

As with the stage and table, we can confirm the pipe has been created with:

<td>
  <code>SHOW PIPES;</code>
</td>
1

S3 Notification {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-S3Notification}

At this point we need to create a notification on the S3 bucket to one the aforementioned Snowflake Stage. To do this log into AWS, go to your S3 bucket and under Properties configure an Event Notification:

Name: Name of the event notification (e.g. Auto-ingest-UseCaseName).

Events: Select the ObjectCreate (All) option - we will need to use prefix processed_<usecaseName>

Send to: Select SQS Queue from the dropdown list.

SQS: Select Add SQS queue ARN from the dropdown list.

SQS queue ARN: Paste the SQS queue name from the SHOW PIPES output.

Security {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-Security}

Theoretically we can follow the guide to setting up security from Snowflake – https://docs.snowflake.net/manuals/user-guide/data-load-s3-config.html – However this doesn’t fully work, as even users with admin access on an account in our organisation do not have iam:UpdateAssumeRolePolicy permissions.

Therefore we must do the following, inside the AWS account with the bucket you are trying to use:

Create a Policy {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-CreateaPolicy}

Under IAM create a policy named snowflake_access&nbsp;with this JSON policy document:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
              "s3:PutObject",
              "s3:GetObject",
              "s3:GetObjectVersion",
              "s3:DeleteObject",
              "s3:DeleteObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<bucket_name>/<prefix>/*"
        },
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::<bucket_name>",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "mydata/*"
                    ]
                }
            }
        }
    ]
}

If you don’t use a prefix, you can omit the prefix from the resource name and completely remove this section

"Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "mydata/*"
                    ]
                }
            }

Create A Role {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-CreateARole}

Now we create a role for the snowflake user.

  • The type of trusted entity should be another AWS account. Snowflake specifically should have account ID: 520339422657
  • You must request an external ID, and for now we must use: 0000
  • When attaching a policy, ensure you attach the policy created in the step above.
  • The name you use for this must match the name of the role set in the stage above, so in this case “snowflake_access_role” will generate a role with this ARN:
arn:aws:iam::0000:role/snowflake_access_role

At this point if you were unable to create a stage earlier on, run the SQL for CREATE STAGE.

Update Role {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-UpdateRole}

Now we must update the role with details from the STAGE:

<td>
  <code>desc</code>&nbsp;<code>stage TestPIIStage;</code>
</td>
1

You would now normally edit the Trust Relationship of the role mentioned above, setting the values of:

  •  sts:ExternalId to the value listed as AWS_EXTERNAL_ID in the describe stage command
  • AWS under Principal to the value of SNOWFLAKE_IAM_USER

 However, even with Admin access you won’t be able to do this, therefore the best way is to create a new role, this time replacing the external ID of 0000 with the value of AWS_EXTERNAL_ID.

According to the guide, we should also change the sts:ExternalId, however you can’t do this using the wizard, and therefore it will grant the principle to root. This will, however, still work. You will then have a trust relationship policy document like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::520339422657:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "ORG_SFCRole=123_xxx"
        }
      }
    }
  ]
}

Finally {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-Finally}

At this point you should be good to go. You may need to re-run the create pipe command, as you have recreated your stage.

Note that you SHOULD NOT re-create your stage, else you will have to go through the process of applying a role policy again, with the new output from that stage.

You can attempt to list all the files in your S3 bucket:

<td>
  <code>list @TestGDPRStage;</code>
</td>
1

You can view the status of your pipe with:

<td>
  <code>SELECT</code>&nbsp;<code>SYSTEM$PIPE_STATUS(&nbsp;'GDPR_DATA.TestGDPRPipe'</code>&nbsp;<code>);</code>
</td>
1

The pipe will only pull files into the table from your stage if there is an SQS Notification. This means that at the point of creating all this, any files already in the bucket won’t be automatically pulled into a table. You can, however, run a command to start the process off for existing files using:

<td>
  <code>ALTER</code>&nbsp;<code>PIPE GDPR_DATA.TestGDPRPipe REFRESH;</code>
</td>
1

In my experience, it can take a fairly long time (1+ hour) for even small files to appear in the table. After a period of waiting time, you can check that they have appeared with:

<td>
  <code>SELECT</code>&nbsp;<code>*&nbsp;FROM</code>&nbsp;<code>GDPR_DATA.TestGDPRTable;</code>
</td>
1

Good luck, and please let me know if this helped!

Additional Info {#S3->Snowflake(SNOWPIPEAuto-Ingest)Guide-AdditionalInfo}

For our work, we simply wanted to get the JSON data from a file in S3 into a single column of a table. At this point we’d grant access to the table we created to another team, so they can move the data into what table/schema/format they wish using snowflake’s PARSE_JSON() method.