Amazon Redshift – How to get actual data loading time of COPY command

In my couple of former blog posts, I have discussed about some possible ways to solve Redshift COPY command error and also and some steps to monitor an ongoing COPY command progress. In today’s post I will cover how to actual data loading time of COPY command.

I have stated in my “Monitoring data loading progress” post that when a COPY command is triggered, actual data loading process does not start immediately into the target Redshift table (especially for a newly created table). In other words, after executing the COPY command, if you monitor the running queries using Redshift AWS Management Console or using custom SQL, you will find, instead of running the actual COPY command, Redshift does some analysis first. The following image shows some pre-data loading analysis query example.analyze
Figure 01: COPY analysis

These analysis includes target table analysis, compression analysis etc. When these analysis steps come to an end the actual COPY command commences.See below:copy runningFigure 02 : Actual data loading (COPY is running)

Now, as an ETL or BI developer, sometimes your manager or client may asks you to let them know how much time it takes to load a certain big data file into a table. Generally, you will login the Redshift cluster then go the “Load” tab and get the details of the COPY command like below:copy finish timeFigure 03 :Data loading (only COPY) finish time.

So, from the above image, you will get some handsome information about actual data loading time of COPY command. You will also get COPY start and end time and the duration of the query to determine actual data loading time of COPY command. Generally you will provide this information to your client or manager. But this is only the duration of actual data loading time not the entire query execution time. In order to get actual COPY duration you need to include the COPY ANALYZE timing along the actual COPY completion time.

From the CLI (see Figure 01) is it quite cumbersome to determine the COPY ANALYZE time. Because pre COPY analysis is not completed in a single step. The analysis process is done in multiple steps. So to get the more precise COPY ANALYZE timing, you need to aggregate the duration of all steps. And for that, you need to write a custom SQL statement using a Redshift system table to get the accurate data loading time. The following SQL statement returns the actual data loading time.

SELECT
SUM("analyze_compression")      AS "analyze_compression",
SUM("analyze_redshift_table")   AS "analyze_redshift_table",
SUM("actual_copy_command")      AS "actual_copy_command",
SUM("fetch_sample_data_of_target_table") AS "fetch_sample_data_of_target_table",
SUM("transaction_commit")       AS "transaction_commit"
FROM(
  SELECT
   CASE
     WHEN t.text LIKE 'analyze compression phase%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_compression",
   CASE
     WHEN t.text LIKE 'Analyze%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_redshift_table",
   CASE
     WHEN t.text LIKE 'copy%'
     THEN date_diff('sec',starttime,endtime )
     END AS "actual_copy_command_duration",
   CASE
     WHEN t.text LIKE 'padb_fetch_sample%'
     THEN date_diff('sec',starttime,endtime )
   END AS "fetch_sample_data_of_target_table",
   CASE
     WHEN t.text LIKE 'COMMIT%'
     THEN date_diff('sec',starttime,endtime )
   END AS "transaction_commit"
 FROM svl_statementtext t
 WHERE t.xid=(SELECT xid
              FROM stl_query
              WHERE query= 160249 - - Query ID is taken from Figure03)
) AS t

The above query uses “svl_statementtext ” system view. This view contains complete record of all of the SQL commands that have executed during the data loading session. In order to get the query metadata that run within a single transaction, we need to apply filter on xid column of the system view. “xid” is Transaction ID associated with the statement. Here, xid is taken from another system table named “stl_query” using the Query ID of COPY completion log of AWS Management Console (“Figure 03”).

Now, it is time to execute the above query and see the result. After executing the query, it returns the following output.actual data loading time of COPY commandFigure 04: Actual data loading time of COPY command.

From the above image we found:

Analyze compression time 49 seconds
Target table analysis time 111 seconds
Actual COPY (data loading) time 597 seconds
Fetch sample data (post copy verification) 111
Transaction commit time 0 seconds
Total 850 seconds

If you have a look on the only COPY command (data loading) execution time in “Figure 03” and the “Actual COPY (data loading) time” of the above table, you will see both of the time values are identical. From the image we see the time is 9m 56s and in the query output we find the COPY execution time is 597 seconds, that is also 9m 56s.

Finally, we got the actual data loading time of COPY command and the cumulative time is 14m 10s (850 seconds). Now you have more precise and accurate data loading time. I hope this information will be helpful.




Monitoring redshift copy command progress

COPY command is the recommended way to load data from source file into the Redshift table. Redshift COPY command offers fast data loading along with different facilities. Monitoring Redshift COPY command progress is one of them. This kind of file upload monitoring facility is unique in comparable to some other popular ETL tool. Such file bulk load progress can be showed to the application dashboard to ensure better user experience, especially while loading large number of records in Redshift table. COPY command progress can be monitored in various ways.

Monitoring Redshift COPY command progress using AWS Management Console

Redshift COPY command progress can easily be monitored using AWS Management Console. Through Management Console lots of information can be achieved regarding an ongoing COPY command. Management Console is really very informative in terms of monitoring COPY command progress. The following steps describe how to monitor COPY command progress using Management Console.

1. Go to Redshift cluster listing page.

ClusterListing

2. Select “Loads” tab from the Management Console
ClusterTabs

3. Find out the “Running” COPY command
RunningCopyCommand

4. Click on the “Load” column link that will take to the following COPY command details information page.
RunningCopyDetail

This page shows  number of rows already been loaded, the table and user name and the overall percent completed. These information are really useful at time of loading big data files.

Monitoring Redshift COPY command progress using custom SQL

At this moment ,we had an idea about how to the know running COPY command progress using Redshift Management Console. But what will you do when you need to display these information in your company or client dashboard? No worries, Redshift has a system table that traces COPY command progress metadata. The table is “stv_load_state”.

stv_load_state” stores the information of the running COPY command temporarily. That means, information can be found in that table as long as the COPY command is in running state only. When COPY command comes to an end, no information are available for that particular command in the table. The COPY command updates this table after loading every million records. So, if your table data file contains less than a million records, you are unlikely to get any information from this table.

Here, to test the COPY command progress, I have uploaded a big file data into a table named “teststaging”. The file has more than 15 million of rows. In the S3 bucket, I have kept the file in compressed GZIP format. The file name is “bigdatafile.gz”. To load the file into the staging table, I have triggered the following COPY command.

copy teststaging from 's3://standard-seeddata/bigdatafile.gz' credentials 'aws_access_key_id=<your access key>;aws_secret_access_key=<your secret access key>' 
delimiter 't' gzip;

In the client client tool, I have found the COPY command is in progress. After couple of seconds, to monitor the what is going on, I have executed the following query.

SELECT
 cpanalyze.query AS "queryid",
 cpanalyze.text AS querytext,
 cpanalyze.starttime,
 pct_complete,
 current_file,
 lines
FROM
 (
  SELECT
    i.query,
    i.text,
    i.starttime
  FROM
    STV_INFLIGHT i
  WHERE
    LOWER(i.text) LIKE 'copy%teststaging%'
  ) AS cpanalyze
LEFT OUTER JOIN
 (
  SELECT
   query,
   pct_complete,
   current_file,
   lines
 FROM
   stv_load_state
 WHERE
   bytes_loaded>0) AS cpload
 ON
   cpanalyze.query=cpload.query

The query returns the following result:

copyanalyze

Query result shows that instead of running actual COPY command, Redsift first ANALYZE the target table.

After a while (approximately 4-5 minutes later), I have executed the above monitoring SQL query once again, Now, I found the the actual COPY command is executing and the pct_complete (% completed) column shows the progress value like below:

copyprogress

It shows that that total 8% already completed and COPY command also uploads 2000000 rows so far from the file.

After sometimes, I executed the query and found the upload percentage is 80% and it managed to upload 13000000 rows. Finally, couple of minutes later, I executed the query once again and did not found any output. As I told earlier the “stv_load_state” stores data as long as the COPY command is in running state. So empty output indicates the COPY command is completed.

Not returning any rows in the “stv_load_state” or (getting rows to certain log tables) does not mean that COPY command successfully committed the rows into the target Redshift table. To be sure that COPY command finishes data loading, we need to execute the following query:

SELECT
 l.query,
 RTRIM(l.filename),
 q.starttime,
 q.endtime
FROM
 stl_load_commits l,
 stl_query q
WHERE
 l.query=q.query
AND l.query=-- Put query Id here
AND EXISTS
 (
  SELECT
  *
  FROM
  stl_utilitytext
  WHERE
  xid=q.xid
  AND RTRIM("text")='COMMIT');

If the above query returns any records, this indicates that COPY command is completed successfully. For more information about this have a look here.

Finally, from the above discussion we came to know that:

  1. After triggering COPY command, the actual data load does not start immediately. Redshift performs some analysis on the target table. After this analysis actual data loading commences.
  2. We need to query on “stv_load_state” table to get an on going COPY command progress.
  3. COPY command writes stv_load_state table after loading every one million rows from source file.So for relatively smaller file which total number of rows are less than a million, COPY progress may not be monitored using “stv_load_state” table for that file.
  4. After getting couple of progress monitoring output, if “stv_load_state” stops returning any rows, that indicates COPY command might have been finished.
  5. To be sure whether COPY command is actually finished or not we need to do another query in the STL_LOAD_COMMITS, STL_QUERY and STL_UTILITYTEXT table as described here.



Redshift COPY Command errors and how to solve them Part-2

Introduction

In my previous post I have tried to demonstrate what type of error may happen and how to solve it when we want to COPY data from a S3 region that differs from the region of the Redshift cluster. In this post I will cover more couple of COPY command exception and some possible solutions. In order to get an idea about the sample source file and Redshift target table structure, please have look on the “Preparing the environment to generate the error” section of my previous blog post.

Redshift copy command errors description:

ERROR: Load into table ‘xxxx’ failed.  Check ‘stl_load_errors’ system table for details.
Invalid digit, Value ‘O’, Pos 0, Type: Integer

How Redshift copy command errors are produced?

Now, once again, to load data into orders table execute the following COPY command (assuming S3 bucket and Redshift cluster reside in same region).

COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' ;

This time we get another exception and the error message is:

ERROR: Load into table ‘xxxx’ failed.  Check ‘stl_load_errors’ system table for details.

This error is a generic error, to get the details we need to query on the “stl_load_error” system table. You can query directly to this table but the query below is bit helpful. This query picks up the most recent COPY error message  which is generated when data loading operation is performed on ‘orders’ table.

SELECT
  top 1 starttime,
  le.filename,
  le.line_number,
  le.colname,
  le.type AS column_datatype,
  le.raw_line AS error_record_line_data,
  le.raw_field_value AS attempt_to_insert_value ,
  err_reason
FROM stl_load_errors le
 INNER JOIN
  (SELECT TOP 1 id
   FROM STV_TBL_PERM tp
   WHERE name='orders') tp
 ON le.tbl=tp.id
ORDER BY le.starttime DESC

Here is the output of the query:
Copy Error
Figure: Copy error details

Reason of Redshift copy command errors:

First of all, have look on the “err_reason” column. It displays “Invalid digit, Value ‘O’, Pos 0, Type: Integer”. This error message is also not that much helpful to understand what actually went wrong. Now let us have a look on the other column values of the query output.

Have a look on the “line_number” column. It indicates, the error occurs when COPY command attempts to insert the first (value=1) line. In other words, value of “line_number” column indicates which lines gets failed during the execution of COPY command. As “line_number” column value=1 so COPY command failed to insert the first line of the source file (s3://sourcedatainorig/order.txt).

Now we need to see in the source data file what value contains in the first line. To get this row data, you do not need to open the source text file. You can get this row data from “error_record_line_data” column of the query output above. “error_record_line_data” column value is as follows:

“CustomerName   OrderID   OrderDate   City   UnitPrice   Quantity   ShipRegion”

After having a look on the line #1 row of source file, we came to know that it is the file header row. COPY command failed to insert this header rows value to the pertinent columns of the orders table.

Let us do bit more analysis to find out which specific column of redshift orders table refused to store value during data load. Have look on the “colname”, “column_datatype” and “attempt_to_insert_value” columns of query output once again. Form these columns we can see colname value=orderid, column_datatype=”int4” and “attempt_to_insert_value”=”OrderID”. It is now clear that COPY command attempts to insert character type value “OrderID” into an integer typed orderid column. As a result, COPY command generates “Invalid digit, Value ‘O’, Pos 0, Type: Integer “ error and terminates the entire data loading operation.

So we came to know that “header” row is responsible for this data loading failure. But this not the only reason for this type of error. Unlike header row, if any column value of source data file contains any value which data type does not compatible (or not auto convertible) to the target redshift table destination column, this error may occur.

How to solve it:

Ignoring first row (header row) of source file during COPY command.

If you get error message like “Invalid digit, Value ‘O’, Pos 0, Type: Integer” try executing your copy command by eliminating the header row. Use IGNOREHEADER parameter in your copy command to ignore the first line of the data file. So the COPY command will look like below:

COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' IGNOREHEADER 1;

When you use IGNOREHEADER 1 it will ignore top 1 row during COPY command. Instead of value 1, any integer value can be used to eliminate top (N) number of rows.

Loading data using proper transformation logic that will ignore the unexpected error prone values.

You can use different popular third party ETL tool to eliminate garbage values from your source data file.

Loading data into a temp staging table that all columns are in VARCHAR type.

If your source data file contains too much unexpected error prone or garbage values and you do not want to use any third party ETL tool, you can try to create a temporary staging table with all VARCHAR columns. In that case, you may not get such error message during data load. After successful load, use custom SQL statement to clean and transfer the data into the strongly typed main table.

I hope the information above will be helpful.




Redshift COPY Command errors and how to solve them Part-1

Introduction:

Amazon Redshift is an optimized, highly scalable data warehouse system. It really offers fast and robust performance even for very large number of records. In a typical data warehouse, data loading from various source is always a challenging task. Redshift takes this challenge of data loading through COPY command. COPY command is the recommend way to load huge volume of data. Redshift is comparatively a newer technology. People face various Redshift COPY Command errors during bulk data load. In this post, I have tried to address an error along with a probable solution.

Redshift COPY Command errors description:

[Error Code: 0, SQL State: XX000]  ERROR: S3ServiceException: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint

Reason of the Redshift COPY Command errors:

During COPY command source file’s S3 bucket region is not in the same region of Redshift database cluster region.

Preparing the environment to generate the Redshift COPY Command errors:

In order to demonstrate the COPY command and to be accounted with the different error message, we will create a sample table like below:

CREATE TABLE orders
(
CustomerName     VARCHAR(50) NOT NULL,
orderid          INT NOT NULL,
orderdate        DATE NOT NULL,
city             VARCHAR(20) NOT NULL,
unitprice        FLOAT NOT NULL,
quantity         INT NOT NULL,
shipregion       VARCHAR(20) NULL
)

I have a sample text file with 100 records. You can download it from here. The file is tab delimited. I will upload this test file into S3 bucket in order to generate and fix all possible COPY related errors the file looks like below:sample file

Figure: Sample text file. 

How these Redshift COPY command errors are produced:

Now let us assume I have a cluster in “us-east” region.
redshift cluster region
Figure: 1

But my source file resides in another (us-west) region.
s3 region
Figure: 2

Now, I will execute the following COPY command to load the sample file into redshift table.

COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' ;

I got the following error message:
s3 exception
Figure 3

How to solve these Redshift COPY command errors:

  1. By creating the S3 bucket in the same region of the Redshift cluster.
  2. By specifying REGION parameter in the COPY command.

 By creating the S3 bucket in the same region of the Redshift cluster.

  1. Go to redshift console and find out your cluster region (see Figure 1). Assume, my cluster in US-East region.
  2. Go to S3 console and create a bucket in the same region.
  3. Keep your file into the newly created bucket.
  4. Modify COPY command source file path accordingly and execute it.

By specifying REGION parameter in the COPY command.

If you really need to load data in to redshift table from a S3 bucket that resides in another region, you need to specify REGION parameter in your COPY command. To get the region name from file in S3 follow the steps below:

  1. Go to S3 and open the bucket.
  2. Select the file (order.txt), right click on it and choose properties.
  3. Have a look the “Link” text in the properties right pane like below:

s3 region
Figure: 4

  1. Get the region (us-west-2) from the link property and organize the COPY command like below along with “REGION” parameter.
COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' REGION 'us-west-2' ;

The COPY command now executed without the “S3ServiceException“error. Still the COPY command would not get succeeded. I will try to cover the error details and the possible regulation in the subsequent post.