Amazon Redshift – Identity column SEED-STEP behavior with COPY command.

In my previous post, I have demonstrated how Identity column behaves with different INSERT statements. In today’s post we will come to know about Identity column SEED-STEP behavior with COPY command. In other words, how identity values are generated when a file is loaded from S3 to Redshift a table that has an auto increment identity column.

As we already know that an identity column does not guarantee the values in the column will be consecutive in all cases. It also stated in earlier post that most often we need consecutive or strict sequential ID numbers. In that scenario, we need to implement some additional actions in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases identity column SEED-STEP with COPY command does not generate consecutive identity values. We will also see some techniques to maintain the orders or the Identity values later in this post.

Environment to test Identity column SEED-STEP behavior with COPY command.

A table like below:

CREATE TABLE custorders
(
linenumber 	 INT IDENTITY(1,1),
customername     VARCHAR(50) NOT NULL,
orderid          INT NOT NULL,
orderdate        TIMESTAMP NOT NULL,
city             VARCHAR(20) NOT NULL,
unitprice        FLOAT NOT NULL,
quantity         INT NOT NULL,
shipregion       VARCHAR(20) NULL
);

Two text files and a manifest file in the S3 bucket to populate the above table using COPY command.

Test data files in S3
Figure 01: Test data and manifest files in S3.

The “order_2001.txt” file contains sales records of 2001 year and order_2002 has sales data of year 2002. In order to facilitate multiple data files loading in parallel using a single COPY command, I have created a manifest file named “order.manifest”.

All testing are done in a single dw2.large node cluster environment. Our environment is ready, now let us jump to me main test.

Identity column SEED, STEP behavior with COPY command while uploading a single file.

In order to see the identity value generated in the above table, I have executed the below COPY command to load order_2002.txt file into an empty custorders table.

COPY custorders FROM 's3://standard-seeddata/order_2002.txt' 
credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1; 

Let’s see the identity values generated by the COPY command by selecting top 100 records.

SELECT TOP 100 * 
FROM custorders
ORDER BY
 linenumber;

The above select statement produced the below results:

Identity value generated with copy command
Figure 02: Identity values generated with COPY command.

From the above image we found the following discrepancies in Identity column SEED,-STEP behavior with COPY command.

  • As we loaded the file into the empty custorders table, as per table identity definition declaration “linenumber INT IDENTITY(1,1)” the first value of linenumber column should be 1 then 2 then 3 and so on. But the first value is inserted surprisingly 4312.
  • Furthermore, according to the identity STEP, the next row’s value should be increased by one step. So the value should be 4312+1=4313. But values are increased by 2 steps. Hence, in first row identity column “linenumber” value is 4312, in second row it is 4314 and third one is 4316 etc.

Why identity starts from 4312 instead of 1?

In order to find out reason behind these issues of “Identity column SEED-STEP behavior with COPY command” I have had a look on the redshift developer guide but did not find any clue. After a while I found myself that there are total 2156 records in the source text file (order_2002.txt). The first generated identity value of the table is twice of the total records count, which is 2156*2=4312. So why does this happen? To find out the answer, I have tried COPY command using different parameters. Finally, I got another clue when I triggered COPY command using “COMPUPDATE OFF” option. The COPY command is now as follows:

COPY custorders FROM 's3://standard-seeddata/order_2002.txt' 
credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1
COMPUPDATE OFF;

And the above command produces the blow identity values:COPY with COMPUPDATE OFFFigure 03: Identity values generated with COPY with “COMPUPDATE OFF” parameter.

Now it produces much better result right? In previous attempt, the identity value started from an arbitrary 4312, but this time, at least it starts from 1. But, still, the identity values are incremented by 2 steps which is supposed to be 1. We will discuss about this issue later, but now let me tell you the reason why “using ‘COMPUPDATE OFF’ parameters with COPY command” generates initial identity value = 4312  instead  of 1 during the first attempt (see Figure02) (4312 is twice of the total records count).

“COMPUPDATE” parameter in COPY command automatically decides the optimal compression encodings for each column of the target table based on ON|OFF settings. If COPY command is executed without “COMPUPDATE” option, COPY applies automatic compression on an empty target table. For this reason COPY command completes in two steps (passes). During first pass, identity values are incremented internally. In the second pass (step) it further increments the value while committing records to the target table. This could be reason why we did not get the initial identity value as per our expectation.

Why identity value increments by 2 instead of 1?

If you have a look on both “Figure 02” and “Figure 03”, you will find identity values are incremented by 2 instead of 1. I have single node in Redshift cluster. The node has two slices. As per Redsihft developer guide data are loaded in parallel in slices when multiple files are loaded using a single COPY command. COPY command loads data in sequential mode when a single file is pointed in COPY command. During data load, COPY command increments identity in such a manner so that the identity values of each slice are not overlapped with values of another slice. For this purpose COPY command would generate identity value in odd, even basis like below:

For slice one: 1, 3, 5…..n
For slice tow: 2, 4, 6…..n

Now let us do another test to validate this fact. As stated earlier in the “Environment” section (see Figure 01) that we have three files. The order_2011.txt contains sales data of year 2001, order_2002.txt file has sales data of year 2002 and a manifest file to upload both files using a single COPY command. The manifest file looks like below:

{
  "entries": [    
    {"url":"s3://standard-seeddata/order_2001.txt", "mandatory":true},
    {"url":"s3://standard-seeddata/order_2002.txt", "mandatory":true}
  ]
}

And the COPY command is as follows:

COPY custorders FROM 's3://standard-seeddata/order.manifest' 
credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1 manifest COMPUPDATE OFF;

After executing the above COPY command, I got results like below:
Multiple file load usning single COPY
Figure 04: Identity values generated by COPY with manifest option.

Splendid!! Identity column SEED-STEP behavior with COPY command is now as per our expectation. It generates initial value with 1 and the subsequent values are incremented by 1 as well. There is no missing, no gaps amongst identity values. The per slice identity allocation also clear to us. Executing the following query will make more sense regarding per slice records and identity value distributions.

SELECT  slice,
         num_values  AS total_rows_per_slice,
         minvalue    AS min_identity_val,
         maxvalue    AS max_identity_val
 FROM   svv_diskusage
 WHERE  NAME = 'custorders'
         AND col = 0
 ORDER BY
         slice,col;

The above query generates the follow output:
Per slice identity column distribution
Figure 05: Per slice identity column distribution

From the above image, we get a clear idea about how redshift determines slice wise identity values. It seems COPY command loads the both files in parallel and allocates identity values as follows:

Loads “orders_2002.txt” files in slice one and allocates identity: 1, 3, and 5…..up to 4309
Loads orders_2001.txt in slice two and allocates identity: 2, 4, 6…..up to 4310

Finally, from the aforementioned test, we came to know loading files in parallel using COPY command with “COMPUPDATE OFF” option may generate expected identity values. It also comply with Identity column SEED-STEP behavior with COPY command. But in different scenario (multiple node cluster) Identity column SEED-STEP behavior with COPY command might be different. In that case, ROW_NUMBER window function can be used to generate sequential values. Another option could be to ask file provider (or client) to include line number in the file if this is a strict business need. Then the file can be loaded using COPY command with “EXPLICIT_IDS” parameter into an identity enabled redshift table.

References:
1. https://forums.aws.amazon.com/thread.jspa?messageID=564576
2. http://docs.aws.amazon.com/redshift/latest/dg/r_SVV_DISKUSAGE.html




This entry was posted in Amazon Redshift, Redshift COPY Command, Uncategorized by Khorshed. Bookmark the permalink.

About Khorshed

Khorshed has 12 years of cumulative professional experience in areas of database administration, development, Business Intelligent, system appreciation, requirement analysis and software development (using .net technologies). Achieved MCITP in SQL Server in both SQL Server development and administration. He has vast knowledge in SQL Server, SSIS, SSRS, SSAS. He is also proficient in Amazon cloud services like Redshift, Data Pipeline, SNS, DynamoDB, EMR and many more.

12 thoughts on “Amazon Redshift – Identity column SEED-STEP behavior with COPY command.

  1. Wow!
    Awesome explanation !!!!!!!!!!
    After read the whole post I have no confusion about “Amazon Redshift – Identity column SEED-STEP behavior with COPY command”
    Many many thanks for nice article.

  2. Nice article.
    But in the last case, how did the values start coming with a step value of 1 instead of 2 (as was in the second case when you did the compupdate off).?
    In the second case, the values were like 1,3,5 etc. then without any change, how did the values start coming as 1,2,3?

    • Thanks indeed for your comments. First COPY command (result in Figure 2) data are loaded without “COMPUPDATE OFF” parameter. There data are loaded sequentially (no manifest file). In second COPY command was executed with “COMPUPDATE OFF” parameter. In that COPY command data only one file was loaded in sequential manner as well. But in the third COPY command, 2 files are loaded using a manifest file. So files are loaded in parallel. So loading files in parallel using manifest files loaded data in proper sequence in the table. It also depends on how many node (slices) you are loading your data. If you load single file in single node cluster(2 slices) without using “COMUPDATE” parameter you are highly likely to get a start identity value twice of your total record count in the file. If you load single file with the parameter you will get odd..even. odd or even…odd..even like identity value starting from either 1 or 2. But if you use a manifest file and load the data in parallel you are like to get sequential identity values. The reason of such behavior is specified as follows
      ““COMPUTEDATE” parameter in COPY command automatically decides the optimal compression encodings for each column of the target table based on ON|OFF settings. If COPY command is executed without “COMPUPDATE” option, COPY applies automatic compression on an empty target table. For this reason COPY command completes in two steps (passes). During first pass, identity values are incremented internally. In the second pass (step) it further increments the value while committing records to the target table. This could be reason why we did not get the initial identity value as per our expectation.”

  3. Great explanation and help indeed.
    Though i had a question, when i wanted to copy one file from s3 into a table, and tried with the compupdate off it skipped indexes as mentioned above and appeared either in odd or even. Is there any way of generating it sequentially without incrementing it by 2?

    • Actually, due to nature of MPP(Massive Parallel Processing), you are highly unlikely to get your expected Identity behavior. I would suggest to include the the IDs in source file (most of the time it is not possible thought) or you can upload the records in staging table first,then load data into the table using ROW_NUMBER() window function.

  4. Amazing article.
    I have just finished moving and copying my data and the test server (1 node) behaved great, but my production (6 nodes) created wierd index numbers. now i know why. thanks so much. now i know how to solve this!

  5. Extremely Helpful!
    I do have a question regarding this. Here is what I’m trying to do. I have a redshift cluster with 3 nodes and I’m loading data into it through a spark job which would run on a schedule. When i try to load say a 1000 records, the ID generated by the identity column still skips some numbers. I did use .option(“extracopyoptions”, “COMPUPDATE OFF TIMEFORMAT ‘auto'”) with a dataset.write() in the spark job and i see that the copy command generated by this has the COMPUPDATE set to OFF:

    COPY “PUBLIC”.”lineitem” FROM ‘s3://bla/redshift/billing/temp/ fbd04f98-837a-48ca-9450-0725d6af44eb /manifest.json’ CREDENTIALS ” FORMAT AS CSV NULL AS ‘@NULL@’ manifest COMPUPDATE OFF TIMEFORMAT ‘auto’

    I did delete the table and recreated it before trying to load the data. Is there something I’m missing? Any help is appreciated.

  6. Great Info,
    Could you please let us know is it a good practice to have a IDENTITY(1,1) on PKEY of a fact table this is very huge table.
    it makes sense to have a IDENTITY(1,1) on dimension table ,but do you think its normal / best practice to generate sequence for fact table similarly

Leave a Reply

Your email address will not be published. Required fields are marked *

11 + two =