Skip to main content

Copy data from S3 to Aurora Postgres

Scenario 1: To copy data from S3 to Aurora Postgres (greater than v9 or latest)

How?: We can use aws_s3.table_import_from_s3 function (to migrate the data from S3 to Aurora Postgres). 

Steps:

A sample file with columns - id, prefix, mstr_id is copied to S3. 

Create schema on Aurora Postgres (with the required columns). 

drop table core.mstr;

CREATE TABLE core.mstr (
	id varchar(300) NULL,
	prefix varchar(300) NULL,
	mstr_id float8 NULL
);

Copy command to transfer the data from S3 to Aurora Postgres

SELECT aws_s3.table_import_from_s3(
   'core.MSTR', 
   'id,prefix,mstr_id', 
   '(format csv, header true)',
   '<bucket-name>',
   'MSTR_DATA/part_file_00.csv',
   'us-east-2',
   '<secret key>', '<access key>'
);


Note: If IAM roles are given, we need not specify access keys. 

SELECT aws_s3.table_import_from_s3(
   'core.MSTR', 
   'id,prefix,mstr_id', 
   '(format csv, header true)',
   '<bucket-name>',
   'MSTR_DATA/part_file_00.csv',
   'us-east-2'
);


Note: aws_s3.table_import_from_s3 function is not supported for versions earlier than Aurora Postgres Version 9.


Scenario 2: To copy data from S3 to Aurora Postgres (older versions than 9)





How? : Use SQOOP Job to export the data from S3 to Aurora Postgres. 

Steps

Upload data files to s3

Create schema on Aurora Postgres (with the required columns). 

drop table core.mstr;

CREATE TABLE core.mstr (
	id varchar(300) NULL,
	prefix varchar(300) NULL,
	mstr_id float8 NULL
);

Spin-up EMR Cluster
  • Must have access to S3 and Aurora Postgres
  • Check for connectivity: 
    • Whether EMR cluster and RDS reside in same VPC
      • dig <Aurora hostname>
      • nc -vz <hostname> (must get a message: connectivity looks good)
  • Make sure the security groups are properly assigned to the EMR Cluster. 
    • The security group must ALLOW traffic from CORE node of EMR Cluster
    • Must be added to the RDS Security Group as a source

Sqoop Job

echo "Exporting MSTR table from S3 to Postgres"
echo "Sqoop Job in Progress"

sqoop export 
--connect <jdbc URL>
 --input-null-string NULL 
 --input-null-non-string NULL 
 --table mstr 
 --export-dir s3://<bucket_name>/MSTR/ 
 --username user_etl 
 --P
 -- --schema test
echo "Sqoop Job is executed successfully"
echo "Data will now be available on Aurora Postgres"


Comments

Popular posts from this blog

AWS Connect: Reporting and Visualizations

Amazon connect offers: - built in reports i.e., historical and real-time reports.  We can customize these reports, schedule them and can integrate with any BI tool of our requirement to query and view the connect data.  Sample solution provided by AWS: 1. Make sure Connect is exporting the CTR data using Kinesis Data Stream 2. Use Kinesis Firehose to deliver the CTR that are in KDS to S3. (CTR's can be delivered as batch of records, so one s3 object might have multiple CTR's). AWS Lambda is used to add a new line character to each record, which makes object easier to parse.  3. s3 Event Notifications are used to send an event to modify the CTR record and saves it in S3. 4. Athena queries the modified CTR's using SQL. Use partitions to restrict the amount of data scanned by each query, improving performance and reducing cost. Lambda function is used to maintain the partitions.  5. Quicksight is used to visualize the modified CTRs.  Solution variations: Convert re...

Databricks: Job aborted due to stage failure. Total size of serialized results is bigger that spark driver memory.

  While running a databricks job, especially running a job with large datasets and longer running queries that creates a lot of temp space - we might be facing below issue if we have a minimal configuration set to the cluster.  The simple way to fix this would be changing the spark driver config in the databricks cluster tab spark.driver.maxResultSize = 100G (change the GB based on your cluster size)

Terraform lifecycle

 If we are using terraform, terraform state file is the heart of all the infrastructure that we spin up using terraform templates.  There are several ways to deploy the infrastructure using terraform: 1. Using CLI (setup terraform and then run terraform commands) 2. Automated Build (terraform scripts integrated as part of your jenkins pipeline) No matter of the way we chose, we must make sure that we are using the same terraform state file, so that we are having a sync and proper checklists of the resources that we used.  I would like to share the terraform commands that we do on a daily basis: terraform init = the basic/starting command which initializes the terraform (make sure the proper provider is provided. In my case, I use AWS).  terraform workspace select <workspace name > (creates a new workspace, useful in scenarios where we have different terraform modules - database, servers, logs, storage) terraform state list = shows the list of terraform resour...