Monitoring redshift copy command progress

COPY command is the recommended way to load data from source file into the Redshift table. Redshift COPY command offers fast data loading along with different facilities. Monitoring Redshift COPY command progress is one of them. This kind of file upload monitoring facility is unique in comparable to some other popular ETL tool. Such file bulk load progress can be showed to the application dashboard to ensure better user experience, especially while loading large number of records in Redshift table. COPY command progress can be monitored in various ways.

Monitoring Redshift COPY command progress using AWS Management Console

Redshift COPY command progress can easily be monitored using AWS Management Console. Through Management Console lots of information can be achieved regarding an ongoing COPY command. Management Console is really very informative in terms of monitoring COPY command progress. The following steps describe how to monitor COPY command progress using Management Console.

1. Go to Redshift cluster listing page.

ClusterListing

2. Select “Loads” tab from the Management Console
ClusterTabs

3. Find out the “Running” COPY command
RunningCopyCommand

4. Click on the “Load” column link that will take to the following COPY command details information page.
RunningCopyDetail

This page shows  number of rows already been loaded, the table and user name and the overall percent completed. These information are really useful at time of loading big data files.

Monitoring Redshift COPY command progress using custom SQL

At this moment ,we had an idea about how to the know running COPY command progress using Redshift Management Console. But what will you do when you need to display these information in your company or client dashboard? No worries, Redshift has a system table that traces COPY command progress metadata. The table is “stv_load_state”.

stv_load_state” stores the information of the running COPY command temporarily. That means, information can be found in that table as long as the COPY command is in running state only. When COPY command comes to an end, no information are available for that particular command in the table. The COPY command updates this table after loading every million records. So, if your table data file contains less than a million records, you are unlikely to get any information from this table.

Here, to test the COPY command progress, I have uploaded a big file data into a table named “teststaging”. The file has more than 15 million of rows. In the S3 bucket, I have kept the file in compressed GZIP format. The file name is “bigdatafile.gz”. To load the file into the staging table, I have triggered the following COPY command.

copy teststaging from 's3://standard-seeddata/bigdatafile.gz' credentials 'aws_access_key_id=<your access key>;aws_secret_access_key=<your secret access key>' 
delimiter 't' gzip;

In the client client tool, I have found the COPY command is in progress. After couple of seconds, to monitor the what is going on, I have executed the following query.

SELECT
 cpanalyze.query AS "queryid",
 cpanalyze.text AS querytext,
 cpanalyze.starttime,
 pct_complete,
 current_file,
 lines
FROM
 (
  SELECT
    i.query,
    i.text,
    i.starttime
  FROM
    STV_INFLIGHT i
  WHERE
    LOWER(i.text) LIKE 'copy%teststaging%'
  ) AS cpanalyze
LEFT OUTER JOIN
 (
  SELECT
   query,
   pct_complete,
   current_file,
   lines
 FROM
   stv_load_state
 WHERE
   bytes_loaded>0) AS cpload
 ON
   cpanalyze.query=cpload.query

The query returns the following result:

copyanalyze

Query result shows that instead of running actual COPY command, Redsift first ANALYZE the target table.

After a while (approximately 4-5 minutes later), I have executed the above monitoring SQL query once again, Now, I found the the actual COPY command is executing and the pct_complete (% completed) column shows the progress value like below:

copyprogress

It shows that that total 8% already completed and COPY command also uploads 2000000 rows so far from the file.

After sometimes, I executed the query and found the upload percentage is 80% and it managed to upload 13000000 rows. Finally, couple of minutes later, I executed the query once again and did not found any output. As I told earlier the “stv_load_state” stores data as long as the COPY command is in running state. So empty output indicates the COPY command is completed.

Not returning any rows in the “stv_load_state” or (getting rows to certain log tables) does not mean that COPY command successfully committed the rows into the target Redshift table. To be sure that COPY command finishes data loading, we need to execute the following query:

SELECT
 l.query,
 RTRIM(l.filename),
 q.starttime,
 q.endtime
FROM
 stl_load_commits l,
 stl_query q
WHERE
 l.query=q.query
AND l.query=-- Put query Id here
AND EXISTS
 (
  SELECT
  *
  FROM
  stl_utilitytext
  WHERE
  xid=q.xid
  AND RTRIM("text")='COMMIT');

If the above query returns any records, this indicates that COPY command is completed successfully. For more information about this have a look here.

Finally, from the above discussion we came to know that:

  1. After triggering COPY command, the actual data load does not start immediately. Redshift performs some analysis on the target table. After this analysis actual data loading commences.
  2. We need to query on “stv_load_state” table to get an on going COPY command progress.
  3. COPY command writes stv_load_state table after loading every one million rows from source file.So for relatively smaller file which total number of rows are less than a million, COPY progress may not be monitored using “stv_load_state” table for that file.
  4. After getting couple of progress monitoring output, if “stv_load_state” stops returning any rows, that indicates COPY command might have been finished.
  5. To be sure whether COPY command is actually finished or not we need to do another query in the STL_LOAD_COMMITS, STL_QUERY and STL_UTILITYTEXT table as described here.