Getting Started with AWS Data Pipeline and Loading S3 File Into MySQL

AWS

Getting started with AWS Data Pipeline

AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks.

AWS Data Pipe Line Sample Workflow


AWS Data Pipe Line Sample Workflow
Default IAM Roles

AWS Data Pipeline requires IAM roles to determine what actions your pipelines can perform and who can access your pipeline’s resources.

The AWS Data Pipeline console creates the following roles for you:

DataPipelineDefaultRole

DataPipelineDefaultResourceRole

DataPipelineDefaultRole:
 
                {
                "Version": "2012-10-17",
                "Statement": [
                 {
                 "Effect": "Allow",
                 "Action": [
                 "s3:List*",
                 "s3:Put*",
                 "s3:Get*",
                 "s3:DeleteObject",
                 "dynamodb:DescribeTable",
                 "dynamodb:Scan",
                 "dynamodb:Query",
                 "dynamodb:GetItem",
                 "dynamodb:BatchGetItem",
                 "dynamodb:UpdateTable",
                 "ec2:DescribeInstances",
                 "ec2:DescribeSecurityGroups",
                 "ec2:RunInstances",
                 "ec2:CreateTags",
                 "ec2:StartInstances",
                 "ec2:StopInstances",
                 "ec2:TerminateInstances",
                 "elasticmapreduce:*",
                 "rds:DescribeDBInstances",
                 "rds:DescribeDBSecurityGroups",
                 "redshift:DescribeClusters",
                 "redshift:DescribeClusterSecurityGroups",
                "sns:GetTopicAttributes",
                 "sns:ListTopics",
                 "sns:Publish",
                 "sns:Subscribe",
                 "sns:Unsubscribe",
                 "iam:PassRole",
                 "iam:ListRolePolicies",
                 "iam:GetRole",
                 "iam:GetRolePolicy",
                 "iam:ListInstanceProfiles",
                 "cloudwatch:*",
                 "datapipeline:DescribeObjects",
                 "datapipeline:EvaluateExpression"
                 ],
                 "Resource": [
                 "*"
                 ]
                 }
                ]
                } 
DataPipelineDefaultResourceRole:
 
                {
                "Version": "2012-10-17",
                "Statement": [
                {
                 "Effect": "Allow",
                 "Action": [
                 "s3:List*",
                 "s3:Put*",
                 "s3:Get*",
                 "s3:DeleteObject",
                 "dynamodb:DescribeTable",
                 "dynamodb:Scan",
                 "dynamodb:Query",
                 "dynamodb:GetItem",
                 "dynamodb:BatchGetItem",
                 "dynamodb:UpdateTable",
                 "rds:DescribeDBInstances",
                 "rds:DescribeDBSecurityGroups",
                 "redshift:DescribeClusters",
                 "redshift:DescribeClusterSecurityGroups",
                 "cloudwatch:PutMetricData",
                 "datapipeline:*"
                 ],
                 "Resource": [
                 "*"
                 ]
                }
                ]
                } 
Error Message:

Error MessageUnable to create resource for @EC2ResourceObj_2017-05-05T04:25:32 due to: No default VPC for this user (Service: AmazonEC2; Status Code: 400; Error Code: VPCIdNotSpecified; Request ID: bd2f3abb-d1c9-4c60-977f-6a83426a947d)

Resolution:

When you look at your VPC, you would notice Default VPC is not configured. While launching EC2 Instance on Data Pipeline, by default it can’t figure out which VPC to use and that needs to be explicitly specified in Configurations.

SubNetID for EC2 Resource

Default VaPC
Build Sample Data Pipeline to Load S3 File into MySQL Table :

Use Cases for AWS Data Pipeline
Setup sample Pipeline in our develop environment
Import Text file from AWS S3 Bucket to AURORA Instance
Send out notifications through SNS to [email protected]
Export / Import Data Pipe Line Definition.

Prerequisites:

Have MySQL Instance
Access to Invoke Data Pipeline with appropriate permissions
Target Database and Target Table
SNS Notification setup with right configuration

Steps to Follow:

Create Data Pipeline with Name
Create MySQL Schema and Table
Configure Your EC2 Resource ( Make sure EC2 instance has access to MySQL Instance ).
If MySQL instance allows only certain IPS’s and VPC, then you need to configure your EC2 Resource in the same VPC or Subnet.
Configure Data Source and appropriate Data Format ( Notice this is Pipe Delimited File ant CSV File ).
Configure your SQL Insert Statement
Configure SNS Notification for PASS / FAIL Activity.
Run your Pipeline and Troubleshoot if errors occur.


Data Pipe Line JSON Definiton:

AWS_Data_PipeLine_S3_MySQL_Defintion.json

Create Table SQL :
 
                create table users_01(
                userid integer not null primary key,
                username char(8),
                firstname varchar(30),
                lastname varchar(30),
                city varchar(30),
                state char(2),
                email varchar(100),
                phone char(14),
                likesports varchar(100),
                liketheatre varchar(100),
                likeconcerts varchar(100),
                likejazz varchar(100),
                likeclassical varchar(100),
                likeopera varchar(100),
                likerock varchar(100),
                likevegas varchar(100),
                likebroadway varchar(100),
                likemusicals varchar(100)) 
 
                INSERT INTO `ETLStage`.`users_01`
                (`userid`,
                `username`,
                `firstname`,
                `lastname`,
                `city`,
                `state`,
                `email`,
                `phone`,
                `likesports`,
                `liketheatre`,
                `likeconcerts`,
                `likejazz`,
                `likeclassical`,
                `likeopera`,
                `likerock`,
                `likevegas`,
                `likebroadway`,
                `likemusicals`)
                VALUES
                (
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?,
                ?
                ); 
Errors Encountered:

errorMessageQuote character must be defined in record format

https://stackoverflow.com/questions/26111111/data-pipeline-error-on-a-template-from-rds-to-s3-copy

You can use “TSV” type as your custom format type and provide:

  • “Column separator” as pipe(|),
  • “Record separator” as new line(\n),
  • “Escape Char” as backslash(\) or any other character you wa

errorId : ActivityFailed:SQLException
errorMessage : No value specified for parameter
errorMessage : Parameter index out of range (1 > number of parameters, which is 0).
errorMessage : Incorrect integer value: ‘FALSE’ for column ‘likesports’ at row 1

Ensure the Table Column Data Type set to correct . By Default MySQL Doesn’t covert TRUE / FALSE into Boolean Data Type.

errorMessage : Parameter index out of range (1 > number of parameters, which is 0).
errorMessage for Load script: ERROR 1227 (42000) at line 1: Access denied; you need (at least one of) the LOAD FROM S3 privilege(s) for this operation

Share this:

Take a look at the lastest aricles

In the first part of this series, we introduced the idea of moving beyond dashboards to build diagnostic AI agents capable of uncovering the why behind business performance shifts. That article focused on architectural principles and the role of AWS Strands in enabling controlled agentic behavior.  In this follow-up, we take a more detailed look at how […]

Organizations continue to process a significant portion of their operational data through documents—particularly invoices, which arrive in multiple formats, structures, and levels of quality. Traditionally, handling these documents requires manual review, data entry, and routing, which introduces delays and increases the likelihood of errors.  With the steady advancement of Azure’s AI capabilities and serverless integration services, customers […]

The AI era demands more from our applications than ever before. Legacy ASP.NET applications, while reliable workhorses, often struggle with the scalability, flexibility, and integration capabilities needed to leverage modern AI services. But how do you modernize without risking business continuity?  At CloudIQ, we've not only researched and documented the best strategies—we've built them. This post brings together everything we've learned: comprehensive strategy, […]

Let’s shape your AI-powered future together.

Partner with CloudIQ to achieve immediate gains while building a strong foundation for long-term, transformative success.