Amazon Redshift – How to get actual data loading time of COPY command

In my couple of former blog posts, I have discussed about some possible ways to solve Redshift COPY command error and also and some steps to monitor an ongoing COPY command progress. In today’s post I will cover how to actual data loading time of COPY command.

I have stated in my “Monitoring data loading progress” post that when a COPY command is triggered, actual data loading process does not start immediately into the target Redshift table (especially for a newly created table). In other words, after executing the COPY command, if you monitor the running queries using Redshift AWS Management Console or using custom SQL, you will find, instead of running the actual COPY command, Redshift does some analysis first. The following image shows some pre-data loading analysis query example.analyze
Figure 01: COPY analysis

These analysis includes target table analysis, compression analysis etc. When these analysis steps come to an end the actual COPY command commences.See below:copy runningFigure 02 : Actual data loading (COPY is running)

Now, as an ETL or BI developer, sometimes your manager or client may asks you to let them know how much time it takes to load a certain big data file into a table. Generally, you will login the Redshift cluster then go the “Load” tab and get the details of the COPY command like below:copy finish timeFigure 03 :Data loading (only COPY) finish time.

So, from the above image, you will get some handsome information about actual data loading time of COPY command. You will also get COPY start and end time and the duration of the query to determine actual data loading time of COPY command. Generally you will provide this information to your client or manager. But this is only the duration of actual data loading time not the entire query execution time. In order to get actual COPY duration you need to include the COPY ANALYZE timing along the actual COPY completion time.

From the CLI (see Figure 01) is it quite cumbersome to determine the COPY ANALYZE time. Because pre COPY analysis is not completed in a single step. The analysis process is done in multiple steps. So to get the more precise COPY ANALYZE timing, you need to aggregate the duration of all steps. And for that, you need to write a custom SQL statement using a Redshift system table to get the accurate data loading time. The following SQL statement returns the actual data loading time.

SELECT
SUM("analyze_compression")      AS "analyze_compression",
SUM("analyze_redshift_table")   AS "analyze_redshift_table",
SUM("actual_copy_command")      AS "actual_copy_command",
SUM("fetch_sample_data_of_target_table") AS "fetch_sample_data_of_target_table",
SUM("transaction_commit")       AS "transaction_commit"
FROM(
  SELECT
   CASE
     WHEN t.text LIKE 'analyze compression phase%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_compression",
   CASE
     WHEN t.text LIKE 'Analyze%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_redshift_table",
   CASE
     WHEN t.text LIKE 'copy%'
     THEN date_diff('sec',starttime,endtime )
     END AS "actual_copy_command_duration",
   CASE
     WHEN t.text LIKE 'padb_fetch_sample%'
     THEN date_diff('sec',starttime,endtime )
   END AS "fetch_sample_data_of_target_table",
   CASE
     WHEN t.text LIKE 'COMMIT%'
     THEN date_diff('sec',starttime,endtime )
   END AS "transaction_commit"
 FROM svl_statementtext t
 WHERE t.xid=(SELECT xid
              FROM stl_query
              WHERE query= 160249 - - Query ID is taken from Figure03)
) AS t

The above query uses “svl_statementtext ” system view. This view contains complete record of all of the SQL commands that have executed during the data loading session. In order to get the query metadata that run within a single transaction, we need to apply filter on xid column of the system view. “xid” is Transaction ID associated with the statement. Here, xid is taken from another system table named “stl_query” using the Query ID of COPY completion log of AWS Management Console (“Figure 03”).

Now, it is time to execute the above query and see the result. After executing the query, it returns the following output.actual data loading time of COPY commandFigure 04: Actual data loading time of COPY command.

From the above image we found:

Analyze compression time 49 seconds
Target table analysis time 111 seconds
Actual COPY (data loading) time 597 seconds
Fetch sample data (post copy verification) 111
Transaction commit time 0 seconds
Total 850 seconds

If you have a look on the only COPY command (data loading) execution time in “Figure 03” and the “Actual COPY (data loading) time” of the above table, you will see both of the time values are identical. From the image we see the time is 9m 56s and in the query output we find the COPY execution time is 597 seconds, that is also 9m 56s.

Finally, we got the actual data loading time of COPY command and the cumulative time is 14m 10s (850 seconds). Now you have more precise and accurate data loading time. I hope this information will be helpful.




This entry was posted in Amazon Redshift, Redshift COPY Command and tagged by Khorshed. Bookmark the permalink.

About Khorshed

Khorshed has 12 years of cumulative professional experience in areas of database administration, development, Business Intelligent, system appreciation, requirement analysis and software development (using .net technologies). Achieved MCITP in SQL Server in both SQL Server development and administration. He has vast knowledge in SQL Server, SSIS, SSRS, SSAS. He is also proficient in Amazon cloud services like Redshift, Data Pipeline, SNS, DynamoDB, EMR and many more.

2 thoughts on “Amazon Redshift – How to get actual data loading time of COPY command

Leave a Reply

Your email address will not be published. Required fields are marked *

thirteen − 6 =