Redshift COPY Command errors and how to solve them Part-2

Introduction

In my previous post I have tried to demonstrate what type of error may happen and how to solve it when we want to COPY data from a S3 region that differs from the region of the Redshift cluster. In this post I will cover more couple of COPY command exception and some possible solutions. In order to get an idea about the sample source file and Redshift target table structure, please have look on the “Preparing the environment to generate the error” section of my previous blog post.

Redshift copy command errors description:

ERROR: Load into table ‘xxxx’ failed.  Check ‘stl_load_errors’ system table for details.
Invalid digit, Value ‘O’, Pos 0, Type: Integer

How Redshift copy command errors are produced?

Now, once again, to load data into orders table execute the following COPY command (assuming S3 bucket and Redshift cluster reside in same region).

COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' ;

This time we get another exception and the error message is:

ERROR: Load into table ‘xxxx’ failed.  Check ‘stl_load_errors’ system table for details.

This error is a generic error, to get the details we need to query on the “stl_load_error” system table. You can query directly to this table but the query below is bit helpful. This query picks up the most recent COPY error message  which is generated when data loading operation is performed on ‘orders’ table.

SELECT
  top 1 starttime,
  le.filename,
  le.line_number,
  le.colname,
  le.type AS column_datatype,
  le.raw_line AS error_record_line_data,
  le.raw_field_value AS attempt_to_insert_value ,
  err_reason
FROM stl_load_errors le
 INNER JOIN
  (SELECT TOP 1 id
   FROM STV_TBL_PERM tp
   WHERE name='orders') tp
 ON le.tbl=tp.id
ORDER BY le.starttime DESC

Here is the output of the query:
Copy Error
Figure: Copy error details

Reason of Redshift copy command errors:

First of all, have look on the “err_reason” column. It displays “Invalid digit, Value ‘O’, Pos 0, Type: Integer”. This error message is also not that much helpful to understand what actually went wrong. Now let us have a look on the other column values of the query output.

Have a look on the “line_number” column. It indicates, the error occurs when COPY command attempts to insert the first (value=1) line. In other words, value of “line_number” column indicates which lines gets failed during the execution of COPY command. As “line_number” column value=1 so COPY command failed to insert the first line of the source file (s3://sourcedatainorig/order.txt).

Now we need to see in the source data file what value contains in the first line. To get this row data, you do not need to open the source text file. You can get this row data from “error_record_line_data” column of the query output above. “error_record_line_data” column value is as follows:

“CustomerName   OrderID   OrderDate   City   UnitPrice   Quantity   ShipRegion”

After having a look on the line #1 row of source file, we came to know that it is the file header row. COPY command failed to insert this header rows value to the pertinent columns of the orders table.

Let us do bit more analysis to find out which specific column of redshift orders table refused to store value during data load. Have look on the “colname”, “column_datatype” and “attempt_to_insert_value” columns of query output once again. Form these columns we can see colname value=orderid, column_datatype=”int4” and “attempt_to_insert_value”=”OrderID”. It is now clear that COPY command attempts to insert character type value “OrderID” into an integer typed orderid column. As a result, COPY command generates “Invalid digit, Value ‘O’, Pos 0, Type: Integer “ error and terminates the entire data loading operation.

So we came to know that “header” row is responsible for this data loading failure. But this not the only reason for this type of error. Unlike header row, if any column value of source data file contains any value which data type does not compatible (or not auto convertible) to the target redshift table destination column, this error may occur.

How to solve it:

Ignoring first row (header row) of source file during COPY command.

If you get error message like “Invalid digit, Value ‘O’, Pos 0, Type: Integer” try executing your copy command by eliminating the header row. Use IGNOREHEADER parameter in your copy command to ignore the first line of the data file. So the COPY command will look like below:

COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' IGNOREHEADER 1;

When you use IGNOREHEADER 1 it will ignore top 1 row during COPY command. Instead of value 1, any integer value can be used to eliminate top (N) number of rows.

Loading data using proper transformation logic that will ignore the unexpected error prone values.

You can use different popular third party ETL tool to eliminate garbage values from your source data file.

Loading data into a temp staging table that all columns are in VARCHAR type.

If your source data file contains too much unexpected error prone or garbage values and you do not want to use any third party ETL tool, you can try to create a temporary staging table with all VARCHAR columns. In that case, you may not get such error message during data load. After successful load, use custom SQL statement to clean and transfer the data into the strongly typed main table.

I hope the information above will be helpful.




Redshift COPY Command errors and how to solve them Part-1

Introduction:

Amazon Redshift is an optimized, highly scalable data warehouse system. It really offers fast and robust performance even for very large number of records. In a typical data warehouse, data loading from various source is always a challenging task. Redshift takes this challenge of data loading through COPY command. COPY command is the recommend way to load huge volume of data. Redshift is comparatively a newer technology. People face various Redshift COPY Command errors during bulk data load. In this post, I have tried to address an error along with a probable solution.

Redshift COPY Command errors description:

[Error Code: 0, SQL State: XX000]  ERROR: S3ServiceException: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint

Reason of the Redshift COPY Command errors:

During COPY command source file’s S3 bucket region is not in the same region of Redshift database cluster region.

Preparing the environment to generate the Redshift COPY Command errors:

In order to demonstrate the COPY command and to be accounted with the different error message, we will create a sample table like below:

CREATE TABLE orders
(
CustomerName     VARCHAR(50) NOT NULL,
orderid          INT NOT NULL,
orderdate        DATE NOT NULL,
city             VARCHAR(20) NOT NULL,
unitprice        FLOAT NOT NULL,
quantity         INT NOT NULL,
shipregion       VARCHAR(20) NULL
)

I have a sample text file with 100 records. You can download it from here. The file is tab delimited. I will upload this test file into S3 bucket in order to generate and fix all possible COPY related errors the file looks like below:sample file

Figure: Sample text file. 

How these Redshift COPY command errors are produced:

Now let us assume I have a cluster in “us-east” region.
redshift cluster region
Figure: 1

But my source file resides in another (us-west) region.
s3 region
Figure: 2

Now, I will execute the following COPY command to load the sample file into redshift table.

COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' ;

I got the following error message:
s3 exception
Figure 3

How to solve these Redshift COPY command errors:

  1. By creating the S3 bucket in the same region of the Redshift cluster.
  2. By specifying REGION parameter in the COPY command.

 By creating the S3 bucket in the same region of the Redshift cluster.

  1. Go to redshift console and find out your cluster region (see Figure 1). Assume, my cluster in US-East region.
  2. Go to S3 console and create a bucket in the same region.
  3. Keep your file into the newly created bucket.
  4. Modify COPY command source file path accordingly and execute it.

By specifying REGION parameter in the COPY command.

If you really need to load data in to redshift table from a S3 bucket that resides in another region, you need to specify REGION parameter in your COPY command. To get the region name from file in S3 follow the steps below:

  1. Go to S3 and open the bucket.
  2. Select the file (order.txt), right click on it and choose properties.
  3. Have a look the “Link” text in the properties right pane like below:

s3 region
Figure: 4

  1. Get the region (us-west-2) from the link property and organize the COPY command like below along with “REGION” parameter.
COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' REGION 'us-west-2' ;

The COPY command now executed without the “S3ServiceException“error. Still the COPY command would not get succeeded. I will try to cover the error details and the possible regulation in the subsequent post.