In my previous post, I have demonstrated how Identity column behaves with different INSERT statements. In today’s post we will come to know about Identity column SEED-STEP behavior with COPY command. In other words, how identity values are generated when a file is loaded from S3 to Redshift a table that has an auto increment identity column.
As we already know that an identity column does not guarantee the values in the column will be consecutive in all cases. It also stated in earlier post that most often we need consecutive or strict sequential ID numbers. In that scenario, we need to implement some additional actions in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases identity column SEED-STEP with COPY command does not generate consecutive identity values. We will also see some techniques to maintain the orders or the Identity values later in this post.
Environment to test Identity column SEED-STEP behavior with COPY command.
A table like below:
CREATE TABLE custorders ( linenumber INT IDENTITY(1,1), customername VARCHAR(50) NOT NULL, orderid INT NOT NULL, orderdate TIMESTAMP NOT NULL, city VARCHAR(20) NOT NULL, unitprice FLOAT NOT NULL, quantity INT NOT NULL, shipregion VARCHAR(20) NULL );
Two text files and a manifest file in the S3 bucket to populate the above table using COPY command.
Figure 01: Test data and manifest files in S3.
The “order_2001.txt” file contains sales records of 2001 year and order_2002 has sales data of year 2002. In order to facilitate multiple data files loading in parallel using a single COPY command, I have created a manifest file named “order.manifest”.
All testing are done in a single dw2.large node cluster environment. Our environment is ready, now let us jump to me main test.
Identity column SEED, STEP behavior with COPY command while uploading a single file.
In order to see the identity value generated in the above table, I have executed the below COPY command to load order_2002.txt file into an empty custorders table.
COPY custorders FROM 's3://standard-seeddata/order_2002.txt' credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1;
Let’s see the identity values generated by the COPY command by selecting top 100 records.
SELECT TOP 100 * FROM custorders ORDER BY linenumber;
The above select statement produced the below results:
Figure 02: Identity values generated with COPY command.
From the above image we found the following discrepancies in Identity column SEED,-STEP behavior with COPY command.
- As we loaded the file into the empty custorders table, as per table identity definition declaration “linenumber INT IDENTITY(1,1)” the first value of linenumber column should be 1 then 2 then 3 and so on. But the first value is inserted surprisingly 4312.
- Furthermore, according to the identity STEP, the next row’s value should be increased by one step. So the value should be 4312+1=4313. But values are increased by 2 steps. Hence, in first row identity column “linenumber” value is 4312, in second row it is 4314 and third one is 4316 etc.
Why identity starts from 4312 instead of 1?
In order to find out reason behind these issues of “Identity column SEED-STEP behavior with COPY command” I have had a look on the redshift developer guide but did not find any clue. After a while I found myself that there are total 2156 records in the source text file (order_2002.txt). The first generated identity value of the table is twice of the total records count, which is 2156*2=4312. So why does this happen? To find out the answer, I have tried COPY command using different parameters. Finally, I got another clue when I triggered COPY command using “COMPUPDATE OFF” option. The COPY command is now as follows:
COPY custorders FROM 's3://standard-seeddata/order_2002.txt' credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1 COMPUPDATE OFF;
And the above command produces the blow identity values:Figure 03: Identity values generated with COPY with “COMPUPDATE OFF” parameter.
Now it produces much better result right? In previous attempt, the identity value started from an arbitrary 4312, but this time, at least it starts from 1. But, still, the identity values are incremented by 2 steps which is supposed to be 1. We will discuss about this issue later, but now let me tell you the reason why “using ‘COMPUPDATE OFF’ parameters with COPY command” generates initial identity value = 4312 instead of 1 during the first attempt (see Figure02) (4312 is twice of the total records count).
“COMPUPDATE” parameter in COPY command automatically decides the optimal compression encodings for each column of the target table based on ON|OFF settings. If COPY command is executed without “COMPUPDATE” option, COPY applies automatic compression on an empty target table. For this reason COPY command completes in two steps (passes). During first pass, identity values are incremented internally. In the second pass (step) it further increments the value while committing records to the target table. This could be reason why we did not get the initial identity value as per our expectation.
Why identity value increments by 2 instead of 1?
If you have a look on both “Figure 02” and “Figure 03”, you will find identity values are incremented by 2 instead of 1. I have single node in Redshift cluster. The node has two slices. As per Redsihft developer guide data are loaded in parallel in slices when multiple files are loaded using a single COPY command. COPY command loads data in sequential mode when a single file is pointed in COPY command. During data load, COPY command increments identity in such a manner so that the identity values of each slice are not overlapped with values of another slice. For this purpose COPY command would generate identity value in odd, even basis like below:
For slice one: 1, 3, 5…..n
For slice tow: 2, 4, 6…..n
Now let us do another test to validate this fact. As stated earlier in the “Environment” section (see Figure 01) that we have three files. The order_2011.txt contains sales data of year 2001, order_2002.txt file has sales data of year 2002 and a manifest file to upload both files using a single COPY command. The manifest file looks like below:
{ "entries": [ {"url":"s3://standard-seeddata/order_2001.txt", "mandatory":true}, {"url":"s3://standard-seeddata/order_2002.txt", "mandatory":true} ] }
And the COPY command is as follows:
COPY custorders FROM 's3://standard-seeddata/order.manifest' credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1 manifest COMPUPDATE OFF;
After executing the above COPY command, I got results like below:
Figure 04: Identity values generated by COPY with manifest option.
Splendid!! Identity column SEED-STEP behavior with COPY command is now as per our expectation. It generates initial value with 1 and the subsequent values are incremented by 1 as well. There is no missing, no gaps amongst identity values. The per slice identity allocation also clear to us. Executing the following query will make more sense regarding per slice records and identity value distributions.
SELECT slice, num_values AS total_rows_per_slice, minvalue AS min_identity_val, maxvalue AS max_identity_val FROM svv_diskusage WHERE NAME = 'custorders' AND col = 0 ORDER BY slice,col;
The above query generates the follow output:
Figure 05: Per slice identity column distribution
From the above image, we get a clear idea about how redshift determines slice wise identity values. It seems COPY command loads the both files in parallel and allocates identity values as follows:
Loads “orders_2002.txt” files in slice one and allocates identity: 1, 3, and 5…..up to 4309
Loads orders_2001.txt in slice two and allocates identity: 2, 4, 6…..up to 4310
Finally, from the aforementioned test, we came to know loading files in parallel using COPY command with “COMPUPDATE OFF” option may generate expected identity values. It also comply with Identity column SEED-STEP behavior with COPY command. But in different scenario (multiple node cluster) Identity column SEED-STEP behavior with COPY command might be different. In that case, ROW_NUMBER window function can be used to generate sequential values. Another option could be to ask file provider (or client) to include line number in the file if this is a strict business need. Then the file can be loaded using COPY command with “EXPLICIT_IDS” parameter into an identity enabled redshift table.
References:
1. https://forums.aws.amazon.com/thread.jspa?messageID=564576
2. http://docs.aws.amazon.com/redshift/latest/dg/r_SVV_DISKUSAGE.html