In my couple of former blog posts, I have discussed about some possible ways to solve Redshift COPY command error and also and some steps to monitor an ongoing COPY command progress. In today’s post I will cover how to actual data loading time of COPY command.
I have stated in my “Monitoring data loading progress” post that when a COPY command is triggered, actual data loading process does not start immediately into the target Redshift table (especially for a newly created table). In other words, after executing the COPY command, if you monitor the running queries using Redshift AWS Management Console or using custom SQL, you will find, instead of running the actual COPY command, Redshift does some analysis first. The following image shows some pre-data loading analysis query example.
Figure 01: COPY analysis
These analysis includes target table analysis, compression analysis etc. When these analysis steps come to an end the actual COPY command commences.See below:Figure 02 : Actual data loading (COPY is running)
Now, as an ETL or BI developer, sometimes your manager or client may asks you to let them know how much time it takes to load a certain big data file into a table. Generally, you will login the Redshift cluster then go the “Load” tab and get the details of the COPY command like below:Figure 03 :Data loading (only COPY) finish time.
So, from the above image, you will get some handsome information about actual data loading time of COPY command. You will also get COPY start and end time and the duration of the query to determine actual data loading time of COPY command. Generally you will provide this information to your client or manager. But this is only the duration of actual data loading time not the entire query execution time. In order to get actual COPY duration you need to include the COPY ANALYZE timing along the actual COPY completion time.
From the CLI (see Figure 01) is it quite cumbersome to determine the COPY ANALYZE time. Because pre COPY analysis is not completed in a single step. The analysis process is done in multiple steps. So to get the more precise COPY ANALYZE timing, you need to aggregate the duration of all steps. And for that, you need to write a custom SQL statement using a Redshift system table to get the accurate data loading time. The following SQL statement returns the actual data loading time.
SELECT SUM("analyze_compression") AS "analyze_compression", SUM("analyze_redshift_table") AS "analyze_redshift_table", SUM("actual_copy_command") AS "actual_copy_command", SUM("fetch_sample_data_of_target_table") AS "fetch_sample_data_of_target_table", SUM("transaction_commit") AS "transaction_commit" FROM( SELECT CASE WHEN t.text LIKE 'analyze compression phase%' THEN date_diff('sec',starttime,endtime ) END AS "analyze_compression", CASE WHEN t.text LIKE 'Analyze%' THEN date_diff('sec',starttime,endtime ) END AS "analyze_redshift_table", CASE WHEN t.text LIKE 'copy%' THEN date_diff('sec',starttime,endtime ) END AS "actual_copy_command_duration", CASE WHEN t.text LIKE 'padb_fetch_sample%' THEN date_diff('sec',starttime,endtime ) END AS "fetch_sample_data_of_target_table", CASE WHEN t.text LIKE 'COMMIT%' THEN date_diff('sec',starttime,endtime ) END AS "transaction_commit" FROM svl_statementtext t WHERE t.xid=(SELECT xid FROM stl_query WHERE query= 160249 - - Query ID is taken from Figure03) ) AS t
The above query uses “svl_statementtext ” system view. This view contains complete record of all of the SQL commands that have executed during the data loading session. In order to get the query metadata that run within a single transaction, we need to apply filter on xid column of the system view. “xid” is Transaction ID associated with the statement. Here, xid is taken from another system table named “stl_query” using the Query ID of COPY completion log of AWS Management Console (“Figure 03”).
From the above image we found:
|Analyze compression time||49 seconds|
|Target table analysis time||111 seconds|
|Actual COPY (data loading) time||597 seconds|
|Fetch sample data (post copy verification)||111|
|Transaction commit time||0 seconds|
If you have a look on the only COPY command (data loading) execution time in “Figure 03” and the “Actual COPY (data loading) time” of the above table, you will see both of the time values are identical. From the image we see the time is 9m 56s and in the query output we find the COPY execution time is 597 seconds, that is also 9m 56s.
Finally, we got the actual data loading time of COPY command and the cumulative time is 14m 10s (850 seconds). Now you have more precise and accurate data loading time. I hope this information will be helpful.