Redshift COPY NOLOAD – Check data file validity without actually loading the data, facts and findings.

Redshift COPY command is the recommended and faster way to load data files from S3 to Redshift table. COPY has several parameters for different purposes. NOLOAD is one of them. When NOLOAD parameter is used in the COPY command, Redshift checks data file’s validity without inserting any records to the target table. About NOLOAD options, redshift developer guide says as follows:

Checks the validity of the data file without actually loading the data. Use the NOLOAD option to make sure that your data file will load without any errors before running the actual data load. Running COPY with the NOLOAD option is much faster than loading the data since it only parses the files.

As per above comments, if we need to check data file validity, NOLOAD option in COPY command offers faster performance. I have done some practical testing regarding this and in this post I am going to share my findings with you.

Now let us see how NOLOAD is used along with copy command?

A typical COPY command:

COPY testsales
FROM 's3://standard-seeddata/testsales_good.gz' 
credentials 'aws_access_key_id=;aws_secret_access_key=;' 
DELIMITER '\t' GZIP;

A COPY command with NOLOAD:

COPY testsales
FROM 's3://standard-seeddata/testsales_good.gz' 
credentials 'aws_access_key_id=;aws_secret_access_key=;' 
DELIMITER '\t' GZIP NOLOAD;

Environment preparation to check the data file validity.

To the check the validity of the data file, I have prepared 3 files and kept them in S3 location. Each file contains 100 million of rows. The files are like below:

SL # File Name Comments
1 testsales_good.gz This file does not contain any error records. The file can be uploaded successfully using copy command.
2 testsales_error.gz This file contains some error records at middle of the file. COPY command generates error during data load.
3 testsales_error_at_end.gz This file contains some error records at end of the file. COPY command generates error and fails to load this file.

In order to load the file I have created a table named “testsales” in Redshift using the command below:

CREATE TABLE testsales
(
	linenumber INTEGER,
	salesamount INTEGER,
	discount INTEGER
)

The table has 3 columns all are INTEGER type. In the error test files “salesamount” column contains error records like below:

Error Records in text file
Figure 01: Error records in text file

You can see, in the text file “salesamount” column contains dollar sign ($) end of the amount. As the column data type is Integer (not Currency), the table will generates error while inserting these values (1000$, 1500$).

Testing approaches to check data file validity:

Now you have an idea about the files and Redshift table. In order to check the validity of the data file and to know the impact of NOLOAD parameter, I have executed the COPY command considering the following scenario:

1. Using Good Data file:

  • Triggered COPY command to take the time without NOLOAD options. The command gets succeeded like below:COPY command using good data fileFigure 02: COPY Command using good data file (NOLOAD option not included)
  • Uploaded the data file using COPY command to take the time with NOLOAD option. This command also successfully uploaded 100 million records into the target table.COPY Command using good data file with NOLOAD optionFigure 03: COPY Command using good data file with NOLOAD option.

2. Using Error Data File which has error records at the middle of the file:

  • Tried to upload the error data file using COPY command without NOLOAD option. The COPY command generates error when attempted to load at line # 49939851 (error contains middle of the file) and rollback the data loading process. The following image shows that:Error data in middle without NOLOADFigure 04:Error data in middle – without NOLOAD.
  • Attempted to upload the data file using COPY command with NOLOAD option. This attempt also failed because the error records contains at middle of the file (line # 49939851). So no records are loaded in the table because only “COPY ANALYE”  command is executed. Have a look on the below image:Error data at the middle with NOLOADFigure 05:Error data in middle – with NOLOAD (COPY ANALYZE is executed)

3. Using Error Data File which has error records at the end of the file:

  • Tried to upload the error data file using COPY command without NOLOAD option. This time COPY command generates error at the end of the file (line number = 99996861). No records are loaded. The error message image is as follows:Error data in at the end without NOLOADFigure 06: Error data at the end – without NOLOAD
  • Attempted to upload the error data file (error records at the end of file) using COPY command with NOLOAD option. COPY command gets failed as usual at line 99996861 (near the maximum 100000000). The error image  is  like below:Error data in at the end with NOLOADFigure 07:Error data at the end – with NOLOAD (COPY ANALYZE is executed)

At this stage, COPY command execution is finished in different scenario to get an idea about the way to check the validity of the data file using NOLOAD option. In the above images, you might have been noticed the COPY “Run Time”. For example in Figure 06, you find “Run Time”= 58.84 seconds (58 seconds, 84 milliseconds). This “Run Time” is the COPY command. This is should not be considered as exact duration.

I have stated in my earlier posts (Redshift COPY command- How to get actual data loading time) that when a COPY command is triggered, actual data loading process does not start immediately. Redshift performs some analysis on the target redshift table. In order to get the actual data loading duration we need to execute a SQL script. The SQL script is present in that post. I have executed that SQL statement using the “Query Id” that is generated after COPY command execution for each scenario. For example in “Figure 05” we got a Query Id=222494 when tried to upload the error (error records at end of the source file) data file using COPY command without NOLOAD option.
So the SQL statement using Query Id=222494 is as follows:

SELECT
SUM("analyze_compression")      AS "analyze_compression",
SUM("analyze_redshift_table")   AS "analyze_redshift_table",
SUM("actual_copy_command")      AS "actual_copy_command",
SUM("fetch_sample_data_of_target_table") AS "fetch_sample_data_of_target_table",
SUM("transaction_commit")       AS "transaction_commit"
FROM(
  SELECT
   CASE
     WHEN t.text LIKE 'analyze compression phase%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_compression",
   CASE
     WHEN t.text LIKE 'Analyze%' or lower(t.text) LIKE 'copy analyze%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_redshift_table",
   CASE
     WHEN lower(t.text) LIKE 'copy%from%'
     THEN date_diff('sec',starttime,endtime )
     END AS "actual_copy_command",
   CASE
     WHEN t.text LIKE 'padb_fetch_sample%'
     THEN date_diff('sec',starttime,endtime )
   END AS "fetch_sample_data_of_target_table",
   CASE
     WHEN t.text LIKE 'COMMIT%'
     THEN date_diff('sec',starttime,endtime )
   END AS "transaction_commit"
 FROM svl_statementtext t
 WHERE t.xid=(SELECT xid
              FROM stl_query
              WHERE query= 222494)
) AS t

After executing the SQL statement for all Query IDs of all scenario, I prepared the following summary of the duration matrix.

noload matrixFigure 07: Duration matrix of different COPY attempt.

From the above information my observations are as follows:
For good file (no error records in file):

  1. For Good file “COPY Command using NOLOAD” is slower than “COPY without NOLOAD” parameter.

If source data file contains no error records, NOLOAD parameter performs slower than not to use the NOLOAD option in COPY command. For example, in the above table, if you have a look on the total duration of “Good File (No error records in file)”, you will find:

  • “COPY Command WITHOUT NOLOAD”, total duration = 69 seconds
  • “COPY Command WITH NOLOAD”, total duration = 96 seconds so “COPY Command WITH NOLOAD” takes 96-69=33 seconds more.

2.  For Good File, “COPY using NOLOAD” consumes more time for table analysis.

When “NOLOAD” option is used system spends much time (48 seconds) for table analysis (For example: COPY ANALYZE testsales). It also takes similar time (48 seconds) to execute the actual COPY command. Using NOLOAD in COPY command does not insert any records to the target table.

Error records at the middle or at the end of the file:

  1. In both cases, COPY command using NOLOAD performs faster. I took 24 seconds when NOLOAD is used and it took 31 seconds when NOLOAD is not used for the data file that has error records at the middle of the file. Similarly, for the file that contains error records at the end, 45 seconds are required with NOLOAD and 60 seconds without using NOLOAD parameter in the COPY command.
  2. When source data file contains error records and NOLOAD option is used to check the validity of the data file, Redshift only executes “COPY ANALYZE” command. As a result only table analysis is performed, no actual COPY command is executed. So “Actual Copy Command Duration” is showed “NULL” in the above duration matrix table.

Finally, from the above testing steps, the summarized findings are as follows:

  • For data file that contains no error records, COPY command with NOLOAD option took more time than executing only COPY command without NOLOAD.
  • For good data file, if NOLOAD option is used, both “COPY ANALYZE” and “COPY … FROM” command gets executed. Though “COPY … FROM” does not import any data but it consumes time.
  • For data file that contains error records, “COPY command without NOLOAD” executes both “COPY ANALYZE” and “COPY … FROM” command. In that case “COPY ANALYZE” duration is low but “COPY … FROM” continues exaction until reach towards the error line.
  • For error records containing file, “COPY command with NOLOAD” is faster. It only performs analysis on the target table by issuing “COPY ANALYZE” command. “COPY … FROM” command never gets executed.
  • Performance measurement to check the validity of data file  using “NOLOAD”, depends on where the error records exists in the data file. If error records exists beginning of the file, data file validity checking is finished earlier. In worst case scenario, if error records resides near end of the file, data file validity checking duration is much higher.

Special note:

Please be noted this test results, observations and findings related to “NOLOAD” option are very specific in a particular scenario. Test outcome may heavily vary for different data files, number of columns of the target table and data type of the table columns. So, if you want to check data file validity using “NOLOAD” option, please test it using your domain specific source data file in your own environment.