Amazon Redshift – Identity column SEED-STEP behavior with COPY command.

In my previous post, I have demonstrated how Identity column behaves with different INSERT statements. In today’s post we will come to know about Identity column SEED-STEP behavior with COPY command. In other words, how identity values are generated when a file is loaded from S3 to Redshift a table that has an auto increment identity column.

As we already know that an identity column does not guarantee the values in the column will be consecutive in all cases. It also stated in earlier post that most often we need consecutive or strict sequential ID numbers. In that scenario, we need to implement some additional actions in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases identity column SEED-STEP with COPY command does not generate consecutive identity values. We will also see some techniques to maintain the orders or the Identity values later in this post.

Environment to test Identity column SEED-STEP behavior with COPY command.

A table like below:

CREATE TABLE custorders
(
linenumber 	 INT IDENTITY(1,1),
customername     VARCHAR(50) NOT NULL,
orderid          INT NOT NULL,
orderdate        TIMESTAMP NOT NULL,
city             VARCHAR(20) NOT NULL,
unitprice        FLOAT NOT NULL,
quantity         INT NOT NULL,
shipregion       VARCHAR(20) NULL
);

Two text files and a manifest file in the S3 bucket to populate the above table using COPY command.

Test data files in S3
Figure 01: Test data and manifest files in S3.

The “order_2001.txt” file contains sales records of 2001 year and order_2002 has sales data of year 2002. In order to facilitate multiple data files loading in parallel using a single COPY command, I have created a manifest file named “order.manifest”.

All testing are done in a single dw2.large node cluster environment. Our environment is ready, now let us jump to me main test.

Identity column SEED, STEP behavior with COPY command while uploading a single file.

In order to see the identity value generated in the above table, I have executed the below COPY command to load order_2002.txt file into an empty custorders table.

COPY custorders FROM 's3://standard-seeddata/order_2002.txt' 
credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1; 

Let’s see the identity values generated by the COPY command by selecting top 100 records.

SELECT TOP 100 * 
FROM custorders
ORDER BY
 linenumber;

The above select statement produced the below results:

Identity value generated with copy command
Figure 02: Identity values generated with COPY command.

From the above image we found the following discrepancies in Identity column SEED,-STEP behavior with COPY command.

  • As we loaded the file into the empty custorders table, as per table identity definition declaration “linenumber INT IDENTITY(1,1)” the first value of linenumber column should be 1 then 2 then 3 and so on. But the first value is inserted surprisingly 4312.
  • Furthermore, according to the identity STEP, the next row’s value should be increased by one step. So the value should be 4312+1=4313. But values are increased by 2 steps. Hence, in first row identity column “linenumber” value is 4312, in second row it is 4314 and third one is 4316 etc.

Why identity starts from 4312 instead of 1?

In order to find out reason behind these issues of “Identity column SEED-STEP behavior with COPY command” I have had a look on the redshift developer guide but did not find any clue. After a while I found myself that there are total 2156 records in the source text file (order_2002.txt). The first generated identity value of the table is twice of the total records count, which is 2156*2=4312. So why does this happen? To find out the answer, I have tried COPY command using different parameters. Finally, I got another clue when I triggered COPY command using “COMPUPDATE OFF” option. The COPY command is now as follows:

COPY custorders FROM 's3://standard-seeddata/order_2002.txt' 
credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1
COMPUPDATE OFF;

And the above command produces the blow identity values:COPY with COMPUPDATE OFFFigure 03: Identity values generated with COPY with “COMPUPDATE OFF” parameter.

Now it produces much better result right? In previous attempt, the identity value started from an arbitrary 4312, but this time, at least it starts from 1. But, still, the identity values are incremented by 2 steps which is supposed to be 1. We will discuss about this issue later, but now let me tell you the reason why “using ‘COMPUPDATE OFF’ parameters with COPY command” generates initial identity value = 4312  instead  of 1 during the first attempt (see Figure02) (4312 is twice of the total records count).

“COMPUPDATE” parameter in COPY command automatically decides the optimal compression encodings for each column of the target table based on ON|OFF settings. If COPY command is executed without “COMPUPDATE” option, COPY applies automatic compression on an empty target table. For this reason COPY command completes in two steps (passes). During first pass, identity values are incremented internally. In the second pass (step) it further increments the value while committing records to the target table. This could be reason why we did not get the initial identity value as per our expectation.

Why identity value increments by 2 instead of 1?

If you have a look on both “Figure 02” and “Figure 03”, you will find identity values are incremented by 2 instead of 1. I have single node in Redshift cluster. The node has two slices. As per Redsihft developer guide data are loaded in parallel in slices when multiple files are loaded using a single COPY command. COPY command loads data in sequential mode when a single file is pointed in COPY command. During data load, COPY command increments identity in such a manner so that the identity values of each slice are not overlapped with values of another slice. For this purpose COPY command would generate identity value in odd, even basis like below:

For slice one: 1, 3, 5…..n
For slice tow: 2, 4, 6…..n

Now let us do another test to validate this fact. As stated earlier in the “Environment” section (see Figure 01) that we have three files. The order_2011.txt contains sales data of year 2001, order_2002.txt file has sales data of year 2002 and a manifest file to upload both files using a single COPY command. The manifest file looks like below:

{
  "entries": [    
    {"url":"s3://standard-seeddata/order_2001.txt", "mandatory":true},
    {"url":"s3://standard-seeddata/order_2002.txt", "mandatory":true}
  ]
}

And the COPY command is as follows:

COPY custorders FROM 's3://standard-seeddata/order.manifest' 
credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1 manifest COMPUPDATE OFF;

After executing the above COPY command, I got results like below:
Multiple file load usning single COPY
Figure 04: Identity values generated by COPY with manifest option.

Splendid!! Identity column SEED-STEP behavior with COPY command is now as per our expectation. It generates initial value with 1 and the subsequent values are incremented by 1 as well. There is no missing, no gaps amongst identity values. The per slice identity allocation also clear to us. Executing the following query will make more sense regarding per slice records and identity value distributions.

SELECT  slice,
         num_values  AS total_rows_per_slice,
         minvalue    AS min_identity_val,
         maxvalue    AS max_identity_val
 FROM   svv_diskusage
 WHERE  NAME = 'custorders'
         AND col = 0
 ORDER BY
         slice,col;

The above query generates the follow output:
Per slice identity column distribution
Figure 05: Per slice identity column distribution

From the above image, we get a clear idea about how redshift determines slice wise identity values. It seems COPY command loads the both files in parallel and allocates identity values as follows:

Loads “orders_2002.txt” files in slice one and allocates identity: 1, 3, and 5…..up to 4309
Loads orders_2001.txt in slice two and allocates identity: 2, 4, 6…..up to 4310

Finally, from the aforementioned test, we came to know loading files in parallel using COPY command with “COMPUPDATE OFF” option may generate expected identity values. It also comply with Identity column SEED-STEP behavior with COPY command. But in different scenario (multiple node cluster) Identity column SEED-STEP behavior with COPY command might be different. In that case, ROW_NUMBER window function can be used to generate sequential values. Another option could be to ask file provider (or client) to include line number in the file if this is a strict business need. Then the file can be loaded using COPY command with “EXPLICIT_IDS” parameter into an identity enabled redshift table.

References:
1. https://forums.aws.amazon.com/thread.jspa?messageID=564576
2. http://docs.aws.amazon.com/redshift/latest/dg/r_SVV_DISKUSAGE.html




Redshift COPY NOLOAD – Check data file validity without actually loading the data, facts and findings.

Redshift COPY command is the recommended and faster way to load data files from S3 to Redshift table. COPY has several parameters for different purposes. NOLOAD is one of them. When NOLOAD parameter is used in the COPY command, Redshift checks data file’s validity without inserting any records to the target table. About NOLOAD options, redshift developer guide says as follows:

Checks the validity of the data file without actually loading the data. Use the NOLOAD option to make sure that your data file will load without any errors before running the actual data load. Running COPY with the NOLOAD option is much faster than loading the data since it only parses the files.

As per above comments, if we need to check data file validity, NOLOAD option in COPY command offers faster performance. I have done some practical testing regarding this and in this post I am going to share my findings with you.

Now let us see how NOLOAD is used along with copy command?

A typical COPY command:

COPY testsales
FROM 's3://standard-seeddata/testsales_good.gz' 
credentials 'aws_access_key_id=;aws_secret_access_key=;' 
DELIMITER '\t' GZIP;

A COPY command with NOLOAD:

COPY testsales
FROM 's3://standard-seeddata/testsales_good.gz' 
credentials 'aws_access_key_id=;aws_secret_access_key=;' 
DELIMITER '\t' GZIP NOLOAD;

Environment preparation to check the data file validity.

To the check the validity of the data file, I have prepared 3 files and kept them in S3 location. Each file contains 100 million of rows. The files are like below:

SL # File Name Comments
1 testsales_good.gz This file does not contain any error records. The file can be uploaded successfully using copy command.
2 testsales_error.gz This file contains some error records at middle of the file. COPY command generates error during data load.
3 testsales_error_at_end.gz This file contains some error records at end of the file. COPY command generates error and fails to load this file.

In order to load the file I have created a table named “testsales” in Redshift using the command below:

CREATE TABLE testsales
(
	linenumber INTEGER,
	salesamount INTEGER,
	discount INTEGER
)

The table has 3 columns all are INTEGER type. In the error test files “salesamount” column contains error records like below:

Error Records in text file
Figure 01: Error records in text file

You can see, in the text file “salesamount” column contains dollar sign ($) end of the amount. As the column data type is Integer (not Currency), the table will generates error while inserting these values (1000$, 1500$).

Testing approaches to check data file validity:

Now you have an idea about the files and Redshift table. In order to check the validity of the data file and to know the impact of NOLOAD parameter, I have executed the COPY command considering the following scenario:

1. Using Good Data file:

  • Triggered COPY command to take the time without NOLOAD options. The command gets succeeded like below:COPY command using good data fileFigure 02: COPY Command using good data file (NOLOAD option not included)
  • Uploaded the data file using COPY command to take the time with NOLOAD option. This command also successfully uploaded 100 million records into the target table.COPY Command using good data file with NOLOAD optionFigure 03: COPY Command using good data file with NOLOAD option.

2. Using Error Data File which has error records at the middle of the file:

  • Tried to upload the error data file using COPY command without NOLOAD option. The COPY command generates error when attempted to load at line # 49939851 (error contains middle of the file) and rollback the data loading process. The following image shows that:Error data in middle without NOLOADFigure 04:Error data in middle – without NOLOAD.
  • Attempted to upload the data file using COPY command with NOLOAD option. This attempt also failed because the error records contains at middle of the file (line # 49939851). So no records are loaded in the table because only “COPY ANALYE”  command is executed. Have a look on the below image:Error data at the middle with NOLOADFigure 05:Error data in middle – with NOLOAD (COPY ANALYZE is executed)

3. Using Error Data File which has error records at the end of the file:

  • Tried to upload the error data file using COPY command without NOLOAD option. This time COPY command generates error at the end of the file (line number = 99996861). No records are loaded. The error message image is as follows:Error data in at the end without NOLOADFigure 06: Error data at the end – without NOLOAD
  • Attempted to upload the error data file (error records at the end of file) using COPY command with NOLOAD option. COPY command gets failed as usual at line 99996861 (near the maximum 100000000). The error image  is  like below:Error data in at the end with NOLOADFigure 07:Error data at the end – with NOLOAD (COPY ANALYZE is executed)

At this stage, COPY command execution is finished in different scenario to get an idea about the way to check the validity of the data file using NOLOAD option. In the above images, you might have been noticed the COPY “Run Time”. For example in Figure 06, you find “Run Time”= 58.84 seconds (58 seconds, 84 milliseconds). This “Run Time” is the COPY command. This is should not be considered as exact duration.

I have stated in my earlier posts (Redshift COPY command- How to get actual data loading time) that when a COPY command is triggered, actual data loading process does not start immediately. Redshift performs some analysis on the target redshift table. In order to get the actual data loading duration we need to execute a SQL script. The SQL script is present in that post. I have executed that SQL statement using the “Query Id” that is generated after COPY command execution for each scenario. For example in “Figure 05” we got a Query Id=222494 when tried to upload the error (error records at end of the source file) data file using COPY command without NOLOAD option.
So the SQL statement using Query Id=222494 is as follows:

SELECT
SUM("analyze_compression")      AS "analyze_compression",
SUM("analyze_redshift_table")   AS "analyze_redshift_table",
SUM("actual_copy_command")      AS "actual_copy_command",
SUM("fetch_sample_data_of_target_table") AS "fetch_sample_data_of_target_table",
SUM("transaction_commit")       AS "transaction_commit"
FROM(
  SELECT
   CASE
     WHEN t.text LIKE 'analyze compression phase%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_compression",
   CASE
     WHEN t.text LIKE 'Analyze%' or lower(t.text) LIKE 'copy analyze%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_redshift_table",
   CASE
     WHEN lower(t.text) LIKE 'copy%from%'
     THEN date_diff('sec',starttime,endtime )
     END AS "actual_copy_command",
   CASE
     WHEN t.text LIKE 'padb_fetch_sample%'
     THEN date_diff('sec',starttime,endtime )
   END AS "fetch_sample_data_of_target_table",
   CASE
     WHEN t.text LIKE 'COMMIT%'
     THEN date_diff('sec',starttime,endtime )
   END AS "transaction_commit"
 FROM svl_statementtext t
 WHERE t.xid=(SELECT xid
              FROM stl_query
              WHERE query= 222494)
) AS t

After executing the SQL statement for all Query IDs of all scenario, I prepared the following summary of the duration matrix.

noload matrixFigure 07: Duration matrix of different COPY attempt.

From the above information my observations are as follows:
For good file (no error records in file):

  1. For Good file “COPY Command using NOLOAD” is slower than “COPY without NOLOAD” parameter.

If source data file contains no error records, NOLOAD parameter performs slower than not to use the NOLOAD option in COPY command. For example, in the above table, if you have a look on the total duration of “Good File (No error records in file)”, you will find:

  • “COPY Command WITHOUT NOLOAD”, total duration = 69 seconds
  • “COPY Command WITH NOLOAD”, total duration = 96 seconds so “COPY Command WITH NOLOAD” takes 96-69=33 seconds more.

2.  For Good File, “COPY using NOLOAD” consumes more time for table analysis.

When “NOLOAD” option is used system spends much time (48 seconds) for table analysis (For example: COPY ANALYZE testsales). It also takes similar time (48 seconds) to execute the actual COPY command. Using NOLOAD in COPY command does not insert any records to the target table.

Error records at the middle or at the end of the file:

  1. In both cases, COPY command using NOLOAD performs faster. I took 24 seconds when NOLOAD is used and it took 31 seconds when NOLOAD is not used for the data file that has error records at the middle of the file. Similarly, for the file that contains error records at the end, 45 seconds are required with NOLOAD and 60 seconds without using NOLOAD parameter in the COPY command.
  2. When source data file contains error records and NOLOAD option is used to check the validity of the data file, Redshift only executes “COPY ANALYZE” command. As a result only table analysis is performed, no actual COPY command is executed. So “Actual Copy Command Duration” is showed “NULL” in the above duration matrix table.

Finally, from the above testing steps, the summarized findings are as follows:

  • For data file that contains no error records, COPY command with NOLOAD option took more time than executing only COPY command without NOLOAD.
  • For good data file, if NOLOAD option is used, both “COPY ANALYZE” and “COPY … FROM” command gets executed. Though “COPY … FROM” does not import any data but it consumes time.
  • For data file that contains error records, “COPY command without NOLOAD” executes both “COPY ANALYZE” and “COPY … FROM” command. In that case “COPY ANALYZE” duration is low but “COPY … FROM” continues exaction until reach towards the error line.
  • For error records containing file, “COPY command with NOLOAD” is faster. It only performs analysis on the target table by issuing “COPY ANALYZE” command. “COPY … FROM” command never gets executed.
  • Performance measurement to check the validity of data file  using “NOLOAD”, depends on where the error records exists in the data file. If error records exists beginning of the file, data file validity checking is finished earlier. In worst case scenario, if error records resides near end of the file, data file validity checking duration is much higher.

Special note:

Please be noted this test results, observations and findings related to “NOLOAD” option are very specific in a particular scenario. Test outcome may heavily vary for different data files, number of columns of the target table and data type of the table columns. So, if you want to check data file validity using “NOLOAD” option, please test it using your domain specific source data file in your own environment.




Amazon Redshift – How to get actual data loading time of COPY command

In my couple of former blog posts, I have discussed about some possible ways to solve Redshift COPY command error and also and some steps to monitor an ongoing COPY command progress. In today’s post I will cover how to actual data loading time of COPY command.

I have stated in my “Monitoring data loading progress” post that when a COPY command is triggered, actual data loading process does not start immediately into the target Redshift table (especially for a newly created table). In other words, after executing the COPY command, if you monitor the running queries using Redshift AWS Management Console or using custom SQL, you will find, instead of running the actual COPY command, Redshift does some analysis first. The following image shows some pre-data loading analysis query example.analyze
Figure 01: COPY analysis

These analysis includes target table analysis, compression analysis etc. When these analysis steps come to an end the actual COPY command commences.See below:copy runningFigure 02 : Actual data loading (COPY is running)

Now, as an ETL or BI developer, sometimes your manager or client may asks you to let them know how much time it takes to load a certain big data file into a table. Generally, you will login the Redshift cluster then go the “Load” tab and get the details of the COPY command like below:copy finish timeFigure 03 :Data loading (only COPY) finish time.

So, from the above image, you will get some handsome information about actual data loading time of COPY command. You will also get COPY start and end time and the duration of the query to determine actual data loading time of COPY command. Generally you will provide this information to your client or manager. But this is only the duration of actual data loading time not the entire query execution time. In order to get actual COPY duration you need to include the COPY ANALYZE timing along the actual COPY completion time.

From the CLI (see Figure 01) is it quite cumbersome to determine the COPY ANALYZE time. Because pre COPY analysis is not completed in a single step. The analysis process is done in multiple steps. So to get the more precise COPY ANALYZE timing, you need to aggregate the duration of all steps. And for that, you need to write a custom SQL statement using a Redshift system table to get the accurate data loading time. The following SQL statement returns the actual data loading time.

SELECT
SUM("analyze_compression")      AS "analyze_compression",
SUM("analyze_redshift_table")   AS "analyze_redshift_table",
SUM("actual_copy_command")      AS "actual_copy_command",
SUM("fetch_sample_data_of_target_table") AS "fetch_sample_data_of_target_table",
SUM("transaction_commit")       AS "transaction_commit"
FROM(
  SELECT
   CASE
     WHEN t.text LIKE 'analyze compression phase%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_compression",
   CASE
     WHEN t.text LIKE 'Analyze%'
     THEN date_diff('sec',starttime,endtime )
   END AS "analyze_redshift_table",
   CASE
     WHEN t.text LIKE 'copy%'
     THEN date_diff('sec',starttime,endtime )
     END AS "actual_copy_command_duration",
   CASE
     WHEN t.text LIKE 'padb_fetch_sample%'
     THEN date_diff('sec',starttime,endtime )
   END AS "fetch_sample_data_of_target_table",
   CASE
     WHEN t.text LIKE 'COMMIT%'
     THEN date_diff('sec',starttime,endtime )
   END AS "transaction_commit"
 FROM svl_statementtext t
 WHERE t.xid=(SELECT xid
              FROM stl_query
              WHERE query= 160249 - - Query ID is taken from Figure03)
) AS t

The above query uses “svl_statementtext ” system view. This view contains complete record of all of the SQL commands that have executed during the data loading session. In order to get the query metadata that run within a single transaction, we need to apply filter on xid column of the system view. “xid” is Transaction ID associated with the statement. Here, xid is taken from another system table named “stl_query” using the Query ID of COPY completion log of AWS Management Console (“Figure 03”).

Now, it is time to execute the above query and see the result. After executing the query, it returns the following output.actual data loading time of COPY commandFigure 04: Actual data loading time of COPY command.

From the above image we found:

Analyze compression time 49 seconds
Target table analysis time 111 seconds
Actual COPY (data loading) time 597 seconds
Fetch sample data (post copy verification) 111
Transaction commit time 0 seconds
Total 850 seconds

If you have a look on the only COPY command (data loading) execution time in “Figure 03” and the “Actual COPY (data loading) time” of the above table, you will see both of the time values are identical. From the image we see the time is 9m 56s and in the query output we find the COPY execution time is 597 seconds, that is also 9m 56s.

Finally, we got the actual data loading time of COPY command and the cumulative time is 14m 10s (850 seconds). Now you have more precise and accurate data loading time. I hope this information will be helpful.




Monitoring redshift copy command progress

COPY command is the recommended way to load data from source file into the Redshift table. Redshift COPY command offers fast data loading along with different facilities. Monitoring Redshift COPY command progress is one of them. This kind of file upload monitoring facility is unique in comparable to some other popular ETL tool. Such file bulk load progress can be showed to the application dashboard to ensure better user experience, especially while loading large number of records in Redshift table. COPY command progress can be monitored in various ways.

Monitoring Redshift COPY command progress using AWS Management Console

Redshift COPY command progress can easily be monitored using AWS Management Console. Through Management Console lots of information can be achieved regarding an ongoing COPY command. Management Console is really very informative in terms of monitoring COPY command progress. The following steps describe how to monitor COPY command progress using Management Console.

1. Go to Redshift cluster listing page.

ClusterListing

2. Select “Loads” tab from the Management Console
ClusterTabs

3. Find out the “Running” COPY command
RunningCopyCommand

4. Click on the “Load” column link that will take to the following COPY command details information page.
RunningCopyDetail

This page shows  number of rows already been loaded, the table and user name and the overall percent completed. These information are really useful at time of loading big data files.

Monitoring Redshift COPY command progress using custom SQL

At this moment ,we had an idea about how to the know running COPY command progress using Redshift Management Console. But what will you do when you need to display these information in your company or client dashboard? No worries, Redshift has a system table that traces COPY command progress metadata. The table is “stv_load_state”.

stv_load_state” stores the information of the running COPY command temporarily. That means, information can be found in that table as long as the COPY command is in running state only. When COPY command comes to an end, no information are available for that particular command in the table. The COPY command updates this table after loading every million records. So, if your table data file contains less than a million records, you are unlikely to get any information from this table.

Here, to test the COPY command progress, I have uploaded a big file data into a table named “teststaging”. The file has more than 15 million of rows. In the S3 bucket, I have kept the file in compressed GZIP format. The file name is “bigdatafile.gz”. To load the file into the staging table, I have triggered the following COPY command.

copy teststaging from 's3://standard-seeddata/bigdatafile.gz' credentials 'aws_access_key_id=<your access key>;aws_secret_access_key=<your secret access key>' 
delimiter 't' gzip;

In the client client tool, I have found the COPY command is in progress. After couple of seconds, to monitor the what is going on, I have executed the following query.

SELECT
 cpanalyze.query AS "queryid",
 cpanalyze.text AS querytext,
 cpanalyze.starttime,
 pct_complete,
 current_file,
 lines
FROM
 (
  SELECT
    i.query,
    i.text,
    i.starttime
  FROM
    STV_INFLIGHT i
  WHERE
    LOWER(i.text) LIKE 'copy%teststaging%'
  ) AS cpanalyze
LEFT OUTER JOIN
 (
  SELECT
   query,
   pct_complete,
   current_file,
   lines
 FROM
   stv_load_state
 WHERE
   bytes_loaded>0) AS cpload
 ON
   cpanalyze.query=cpload.query

The query returns the following result:

copyanalyze

Query result shows that instead of running actual COPY command, Redsift first ANALYZE the target table.

After a while (approximately 4-5 minutes later), I have executed the above monitoring SQL query once again, Now, I found the the actual COPY command is executing and the pct_complete (% completed) column shows the progress value like below:

copyprogress

It shows that that total 8% already completed and COPY command also uploads 2000000 rows so far from the file.

After sometimes, I executed the query and found the upload percentage is 80% and it managed to upload 13000000 rows. Finally, couple of minutes later, I executed the query once again and did not found any output. As I told earlier the “stv_load_state” stores data as long as the COPY command is in running state. So empty output indicates the COPY command is completed.

Not returning any rows in the “stv_load_state” or (getting rows to certain log tables) does not mean that COPY command successfully committed the rows into the target Redshift table. To be sure that COPY command finishes data loading, we need to execute the following query:

SELECT
 l.query,
 RTRIM(l.filename),
 q.starttime,
 q.endtime
FROM
 stl_load_commits l,
 stl_query q
WHERE
 l.query=q.query
AND l.query=-- Put query Id here
AND EXISTS
 (
  SELECT
  *
  FROM
  stl_utilitytext
  WHERE
  xid=q.xid
  AND RTRIM("text")='COMMIT');

If the above query returns any records, this indicates that COPY command is completed successfully. For more information about this have a look here.

Finally, from the above discussion we came to know that:

  1. After triggering COPY command, the actual data load does not start immediately. Redshift performs some analysis on the target table. After this analysis actual data loading commences.
  2. We need to query on “stv_load_state” table to get an on going COPY command progress.
  3. COPY command writes stv_load_state table after loading every one million rows from source file.So for relatively smaller file which total number of rows are less than a million, COPY progress may not be monitored using “stv_load_state” table for that file.
  4. After getting couple of progress monitoring output, if “stv_load_state” stops returning any rows, that indicates COPY command might have been finished.
  5. To be sure whether COPY command is actually finished or not we need to do another query in the STL_LOAD_COMMITS, STL_QUERY and STL_UTILITYTEXT table as described here.



Redshift COPY Command errors and how to solve them Part-2

Introduction

In my previous post I have tried to demonstrate what type of error may happen and how to solve it when we want to COPY data from a S3 region that differs from the region of the Redshift cluster. In this post I will cover more couple of COPY command exception and some possible solutions. In order to get an idea about the sample source file and Redshift target table structure, please have look on the “Preparing the environment to generate the error” section of my previous blog post.

Redshift copy command errors description:

ERROR: Load into table ‘xxxx’ failed.  Check ‘stl_load_errors’ system table for details.
Invalid digit, Value ‘O’, Pos 0, Type: Integer

How Redshift copy command errors are produced?

Now, once again, to load data into orders table execute the following COPY command (assuming S3 bucket and Redshift cluster reside in same region).

COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' ;

This time we get another exception and the error message is:

ERROR: Load into table ‘xxxx’ failed.  Check ‘stl_load_errors’ system table for details.

This error is a generic error, to get the details we need to query on the “stl_load_error” system table. You can query directly to this table but the query below is bit helpful. This query picks up the most recent COPY error message  which is generated when data loading operation is performed on ‘orders’ table.

SELECT
  top 1 starttime,
  le.filename,
  le.line_number,
  le.colname,
  le.type AS column_datatype,
  le.raw_line AS error_record_line_data,
  le.raw_field_value AS attempt_to_insert_value ,
  err_reason
FROM stl_load_errors le
 INNER JOIN
  (SELECT TOP 1 id
   FROM STV_TBL_PERM tp
   WHERE name='orders') tp
 ON le.tbl=tp.id
ORDER BY le.starttime DESC

Here is the output of the query:
Copy Error
Figure: Copy error details

Reason of Redshift copy command errors:

First of all, have look on the “err_reason” column. It displays “Invalid digit, Value ‘O’, Pos 0, Type: Integer”. This error message is also not that much helpful to understand what actually went wrong. Now let us have a look on the other column values of the query output.

Have a look on the “line_number” column. It indicates, the error occurs when COPY command attempts to insert the first (value=1) line. In other words, value of “line_number” column indicates which lines gets failed during the execution of COPY command. As “line_number” column value=1 so COPY command failed to insert the first line of the source file (s3://sourcedatainorig/order.txt).

Now we need to see in the source data file what value contains in the first line. To get this row data, you do not need to open the source text file. You can get this row data from “error_record_line_data” column of the query output above. “error_record_line_data” column value is as follows:

“CustomerName   OrderID   OrderDate   City   UnitPrice   Quantity   ShipRegion”

After having a look on the line #1 row of source file, we came to know that it is the file header row. COPY command failed to insert this header rows value to the pertinent columns of the orders table.

Let us do bit more analysis to find out which specific column of redshift orders table refused to store value during data load. Have look on the “colname”, “column_datatype” and “attempt_to_insert_value” columns of query output once again. Form these columns we can see colname value=orderid, column_datatype=”int4” and “attempt_to_insert_value”=”OrderID”. It is now clear that COPY command attempts to insert character type value “OrderID” into an integer typed orderid column. As a result, COPY command generates “Invalid digit, Value ‘O’, Pos 0, Type: Integer “ error and terminates the entire data loading operation.

So we came to know that “header” row is responsible for this data loading failure. But this not the only reason for this type of error. Unlike header row, if any column value of source data file contains any value which data type does not compatible (or not auto convertible) to the target redshift table destination column, this error may occur.

How to solve it:

Ignoring first row (header row) of source file during COPY command.

If you get error message like “Invalid digit, Value ‘O’, Pos 0, Type: Integer” try executing your copy command by eliminating the header row. Use IGNOREHEADER parameter in your copy command to ignore the first line of the data file. So the COPY command will look like below:

COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' IGNOREHEADER 1;

When you use IGNOREHEADER 1 it will ignore top 1 row during COPY command. Instead of value 1, any integer value can be used to eliminate top (N) number of rows.

Loading data using proper transformation logic that will ignore the unexpected error prone values.

You can use different popular third party ETL tool to eliminate garbage values from your source data file.

Loading data into a temp staging table that all columns are in VARCHAR type.

If your source data file contains too much unexpected error prone or garbage values and you do not want to use any third party ETL tool, you can try to create a temporary staging table with all VARCHAR columns. In that case, you may not get such error message during data load. After successful load, use custom SQL statement to clean and transfer the data into the strongly typed main table.

I hope the information above will be helpful.




Redshift COPY Command errors and how to solve them Part-1

Introduction:

Amazon Redshift is an optimized, highly scalable data warehouse system. It really offers fast and robust performance even for very large number of records. In a typical data warehouse, data loading from various source is always a challenging task. Redshift takes this challenge of data loading through COPY command. COPY command is the recommend way to load huge volume of data. Redshift is comparatively a newer technology. People face various Redshift COPY Command errors during bulk data load. In this post, I have tried to address an error along with a probable solution.

Redshift COPY Command errors description:

[Error Code: 0, SQL State: XX000]  ERROR: S3ServiceException: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint

Reason of the Redshift COPY Command errors:

During COPY command source file’s S3 bucket region is not in the same region of Redshift database cluster region.

Preparing the environment to generate the Redshift COPY Command errors:

In order to demonstrate the COPY command and to be accounted with the different error message, we will create a sample table like below:

CREATE TABLE orders
(
CustomerName     VARCHAR(50) NOT NULL,
orderid          INT NOT NULL,
orderdate        DATE NOT NULL,
city             VARCHAR(20) NOT NULL,
unitprice        FLOAT NOT NULL,
quantity         INT NOT NULL,
shipregion       VARCHAR(20) NULL
)

I have a sample text file with 100 records. You can download it from here. The file is tab delimited. I will upload this test file into S3 bucket in order to generate and fix all possible COPY related errors the file looks like below:sample file

Figure: Sample text file. 

How these Redshift COPY command errors are produced:

Now let us assume I have a cluster in “us-east” region.
redshift cluster region
Figure: 1

But my source file resides in another (us-west) region.
s3 region
Figure: 2

Now, I will execute the following COPY command to load the sample file into redshift table.

COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' ;

I got the following error message:
s3 exception
Figure 3

How to solve these Redshift COPY command errors:

  1. By creating the S3 bucket in the same region of the Redshift cluster.
  2. By specifying REGION parameter in the COPY command.

 By creating the S3 bucket in the same region of the Redshift cluster.

  1. Go to redshift console and find out your cluster region (see Figure 1). Assume, my cluster in US-East region.
  2. Go to S3 console and create a bucket in the same region.
  3. Keep your file into the newly created bucket.
  4. Modify COPY command source file path accordingly and execute it.

By specifying REGION parameter in the COPY command.

If you really need to load data in to redshift table from a S3 bucket that resides in another region, you need to specify REGION parameter in your COPY command. To get the region name from file in S3 follow the steps below:

  1. Go to S3 and open the bucket.
  2. Select the file (order.txt), right click on it and choose properties.
  3. Have a look the “Link” text in the properties right pane like below:

s3 region
Figure: 4

  1. Get the region (us-west-2) from the link property and organize the COPY command like below along with “REGION” parameter.
COPY orders FROM 's3://sourcedatainorig/order.txtcredentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter 't' REGION 'us-west-2' ;

The COPY command now executed without the “S3ServiceException“error. Still the COPY command would not get succeeded. I will try to cover the error details and the possible regulation in the subsequent post.