During table design auto increment column is one of the best choices among database developers. Most of the cases an auto increment identity column is considered as the primary key of the table. It is also widely used as a surrogate key of dimension tables in a typical data warehouse system. Identity column SEED-STEP are used to generate the sequential values in the table. An identity column takes the value of current seed incremented by the step when a row is inserted into a table.
Identity column SEED, STEP can be used with CREATE TABLE statement in Amazon Redshift. A point to be noted that an identity column does not guarantee the values in the column will be consecutive in all cases. In real life, sometimes we need consecutive or strict sequential ID numbers. In that scenario, we need to adopt some additional measures in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases of INSERT statement identity column SEED-STEP does not generate consecutive identity values. We will also see some techniques to maintain the orders of the Identity values.
Environment to test Identity column SEED-STEP behavior with different INSERT statements.
I have used a single dw2.large node cluster to in my test environment and a table with an identity column like below:
CREATE TABLE custorders ( linenumber INT IDENTITY(1,1), customername VARCHAR(50) NOT NULL, orderid INT NOT NULL, orderdate TIMESTAMP NOT NULL, city VARCHAR(20) NOT NULL, unitprice FLOAT NOT NULL, quantity INT NOT NULL, shipregion VARCHAR(20) NULL );
Identity column SEED-STEP behavior in different INSERT statements.
In my environment, the following approaches are used to test the identity column SEED, STEP behavior.
- Inserting records using traditional INSERT INTO (column1, column2…) VALUES statement (PART-1)
- Inserting records using MULTY ROW INSERT approach (PART-2) like:
INSERT INTO … VALUES
- Inserting records using INSERT INTO … followed by multiple SELECT statement combined using UNION ALL (PART-3).
According the above points the following SQL scripts are written:
INSERT INTO custorders(CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('A',1234,'01-01-2002','Dhaka',12,1,'DK'); INSERT INTO custorders(CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('B',1235,'01-01-2002','Sylhet',12,2,'SL'); INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('C',1236,'01-01-2004','CTG',12,2,'CT');
INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('D',1234,'01-01-2002','Dhaka',12,1,'DK'), ('E',1235,'01-01-2002','Sylhet',12,2,'SL'), ('F',1236,'01-01-2004','CTG',12,2,'CT');
INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) SELECT 'G',1234,cast('01-01-2002' AS TIMESTAMP),'Dhaka',12,1,'DK' UNION ALL SELECT 'H',1235,'01-01-2002','Sylhet',12,2,'SL' UNION ALL SELECT 'I',1236,'01-01-2004','CTG',12,2,'CT';
These scripts are executed one part at a time, PART-1 first then PART-2 and finally PART-3. The result produces uninterrupted sequential identity values (line numbers). See below image:
Figure 02: Identity column generates uninterrupted unique values.
Now, let us execute all insert scripts at a time (PART-1 + 2 +3). Prior to do such testing I have dropped and recreated the “custorders” table. Executing all three SQL insert statement parts at a time produces the following result:
Figure 03:Identity column SEED-STEP behavior using multiple insert statements.
From the above image we see, SQL script PART-1 and PART-2 honored identity column SEED, STEP default behavior (linenumber 1 to 6). But Inserting records using INSERT INTO … followed by multiple SELECT statement combined using UNION ALL (PART-3) does not.
The possible reason why “Identity column SEED-STEP behavior” is different:
Possible reason of this issue could be the parallelism. Several months ago, I have combined records of more than 10 big tables using UNION ALL in order to make a PIVOT table in SQL SERVER. During query execution, I have noticed all of my CPU cores (1 socket, 4 cores, and 8 logical processors) were busy. To know more about that I conducted a Google search and found that optimizer might execute queries in parallel when multiple queries (branches) combined to a single SQL statement. We know Redshift is designed to facilitate huge parallel processing. Although it is not documented anywhere, but I presume, Redshift attempts to insert each SQL statement of UNION ALL in parallel into each node slice. In order to ensure that there is no single point of conflict in assigning identity values to rows during parallel insertion, Redshift might maintain gaps while allocating identity value for each rows. This is only my thoughts based on previous experience. Please inform me if anyone has more concrete idea about this behavior.
How to solve the issue:
There is not straight forward way that can resolve this identity missing issue. A new table can be created to copy all data from the existing table with CREATE TABLE AS statement using ROW_NUMBER window function like below:
CREATE TABLE custorders_new AS( SELECT ROW_NUMBER() OVER(ORDER BY linenumber) as linenumber, CustomerName, orderid, orderdate, city, unitprice, quantity, shipregion FROM custorders); DROP TABLE custorders; ALTER TABLE custorders_new RENAME TO custorders
Please be noted the above approach may not suitable because CREATE TABLE AS does not inherits IDENTITY property from parent table. In that case create a VIEW over the table using the same ROW_NUMBER window function would be the perfect choice. See below:
CREATE VIEW custorder_vw as SELECT ROW_NUMBER() OVER(ORDER BY linenumber) as linenumber, CustomerName, orderid, orderdate, city, unitprice, quantity, shipregion FROM custorders;
Finally, in this post we have had a look on Identity column SEED-STEP behavior in different insert statement and some possible ways to resolve this issue. In one of my subsequent posts I will try to demonstrate the Identity column SEED-STEP behavior when data files are uploaded using Redshift COPY command. Till then thanks lot to read this post with patient. Hope this post is informative to you.
References (UNION ALL in different databases):