Tuesday, July 23, 2019

Issues related to S3 COPY command for Redshift

We use s3 COPY command to move data from S3 to Redshift table. Make sure the schema for Redshift table is created before running your COPY command. I'm writing this post to log all errors related to the COPY command that I faced which might help others in saving their time.

Copy command that I have used for a while:

copy <redshift table name>
from '<s3 file location>'
iam_role '<iam role arn>'
NULL AS 'NULL'
EMPTYASNULL
delimiter ','
IGNOREHEADER 1;;

Problem 1: Faced issue with delimiters on CSV file. As we know the delimiter for any CSV file will be comma separators, but the problem on keeping this as a delimiter is that it will fail with below error if the data also have commas in it.

Error Code:  1213, Error Description: Missing data for not-null field

AWS Documentation link that helped in solving this issue:
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#load-from-csv

Solution 1.1: As the issue is with comma as a delimiter, I have specified the file format as CSV and ignored the headers. This fixed the issue.

copy <Redshift Table Name>
from 's3://<location of the file>'
iam_role '<iam role arn>'
csv
IGNOREHEADER 1;;

Solution 1.2: Using ESCAPE, and remove quotes should work as well

copy <redshift table name>
from '<s3 file location>'
iam_role '<iam role arn>'
delimiter ','
escape
REMOVEQUOTES
IGNOREHEADER 1;


Problem 2: Redshift doesn't support materialized views. I have tried copying views directly to Redshift from S3, which resulted in below error:

Error: psycopg2.errors.InternalError_: Invalid table DETAIL: ----------------------------------------------- error: Invalid table code: 8001 context: ************ is not a table. Views cannot be loaded with the COPY command. query: 0 location: compression_analyzer.cpp:432 process: padbmaster [pid=16256] -----------------------------------------------

Solution 2.1: Create views as a table and then COPY, if you don't care its either a view or a table on Redshift.

Solution 2.2: First COPY all the underlying tables, and then CREATE VIEW on Redshift.

FYI, ***Problem with JSON format****
The format of the file which you are trying to copy also plays an important role in fixing the issues related to the COPY command. I'm using CSV files for most of the test runs because I faced a limitation of 4MB while dealing with JSON formats. JSON parsers related to redshift have more problems to deal with. FYI, here is the error info I got related to JSON:


psycopg2.errors.InternalError_: The total size of the JSON object exceeds the max limit of 4194304 bytes
DETAIL: 
 ----------------------------------------------
error:  The total size of the JSON object exceeds the max limit of 4194304 bytes
code:      8001
--------------------------------------------------


Problem 3: Delimiter not found while running copy command

Try with FILLRECORD - where missing columns will be with zero-length strings or NULLs

copy <redshift table name>
from '<s3 file location>'
iam_role '<iam role arn>'
csv
IGNOREHEADER 1
FILLRECORD;


Problem 4: When there are issues with timestamps and the file format is not csv but comma separated

solution:

copy <redshift table name>
from 's3://location' 
iam_role 'iam role arn'
delimiter ','
escape
NULL as 'null'
REMOVEQUOTES
TIMEFORMAT 'auto';

1 comment:

  1. I was able to differentiate Aurora RDS and Redshift Limitations at much granular detail with the help of this practical blog. I can imagine the number of hours I saved by reading this.

    ReplyDelete