Amazon Redshift – Identity column SEED-STEP behavior with different INSERT statements.

During table design auto increment column is one of the best choices among database developers. Most of the cases an auto increment identity column is considered as the primary key of the table. It is also widely used as a surrogate key of dimension tables in a typical data warehouse system. Identity column SEED-STEP are used to generate the sequential values in the table. An identity column takes the value of current seed incremented by the step when a row is inserted into a table.

Identity column SEED, STEP can be used with CREATE TABLE statement in Amazon Redshift. A point to be noted that an identity column does not guarantee the values in the column will be consecutive in all cases. In real life, sometimes we need consecutive or strict sequential ID numbers. In that scenario, we need to adopt some additional measures in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases of INSERT statement identity column SEED-STEP does not generate consecutive identity values. We will also see some techniques to maintain the orders of the Identity values.

Environment to test Identity column SEED-STEP behavior with different INSERT statements.

I have used a single dw2.large node cluster to in my test environment and a table with an identity column like below:

CREATE TABLE custorders
(
 linenumber       INT IDENTITY(1,1),
 customername     VARCHAR(50) NOT NULL,
 orderid          INT NOT NULL,
 orderdate        TIMESTAMP NOT NULL,
 city             VARCHAR(20) NOT NULL,
 unitprice        FLOAT NOT NULL,
 quantity         INT NOT NULL,
 shipregion       VARCHAR(20) NULL
);

Identity column SEED-STEP behavior in different INSERT statements.

In my environment, the following approaches are used to test the identity column SEED, STEP behavior.

  • Inserting records using traditional INSERT INTO (column1, column2…) VALUES statement (PART-1)
  • Inserting records using MULTY ROW INSERT approach (PART-2) like:
    INSERT INTO … VALUES
    (value1, value1),
    (value2, value4);
  • Inserting records using INSERT INTO … followed by multiple SELECT statement combined using UNION ALL (PART-3).

According the above points the following SQL scripts are written:

PART-1:

INSERT INTO custorders(CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('A',1234,'01-01-2002','Dhaka',12,1,'DK');
INSERT INTO custorders(CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('B',1235,'01-01-2002','Sylhet',12,2,'SL');
INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('C',1236,'01-01-2004','CTG',12,2,'CT');

PART-2:

INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES
('D',1234,'01-01-2002','Dhaka',12,1,'DK'),
('E',1235,'01-01-2002','Sylhet',12,2,'SL'),
('F',1236,'01-01-2004','CTG',12,2,'CT');

PART-3:

INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion)
SELECT 'G',1234,cast('01-01-2002' AS TIMESTAMP),'Dhaka',12,1,'DK'
UNION ALL
SELECT 'H',1235,'01-01-2002','Sylhet',12,2,'SL'
UNION ALL
SELECT 'I',1236,'01-01-2004','CTG',12,2,'CT';

These scripts are executed one part at a time, PART-1 first then PART-2 and finally PART-3. The result produces uninterrupted sequential identity values (line numbers). See below image: Identity column seed step begavior in single insert
Figure 02: Identity column generates uninterrupted unique values.

Now, let us execute all insert scripts at a time (PART-1 + 2 +3). Prior to do such testing I have dropped and recreated the “custorders” table. Executing all three SQL insert statement parts at a time produces the following result:

Insert UNION ALL not generates identity
Figure 03:Identity column SEED-STEP behavior using multiple insert statements.

From the above image we see, SQL script PART-1 and PART-2 honored identity column SEED, STEP default behavior (linenumber 1 to 6). But Inserting records using INSERT INTO … followed by multiple SELECT statement combined using UNION ALL (PART-3) does not.

The possible reason why “Identity column SEED-STEP behavior” is different:

Possible reason of this issue could be the parallelism. Several months ago, I have combined records of more than 10 big tables using UNION ALL in order to make a PIVOT table in SQL SERVER. During query execution, I have noticed all of my CPU cores (1 socket, 4 cores, and 8 logical processors) were busy. To know more about that I conducted a Google search and found that optimizer might execute queries in parallel when multiple queries (branches) combined to a single SQL statement. We know Redshift is designed to facilitate huge parallel processing. Although it is not documented anywhere, but I presume, Redshift attempts to insert each SQL statement of UNION ALL in parallel into each node slice. In order to ensure that there is no single point of conflict in assigning identity values to rows during parallel insertion, Redshift might maintain gaps while allocating identity value for each rows. This is only my thoughts based on previous experience. Please inform me if anyone has more concrete idea about this behavior.

How to solve the issue:

There is not straight forward way that can resolve this identity missing issue. A new table can be created to copy all data from the existing table with CREATE TABLE AS statement using ROW_NUMBER window function like below:

CREATE TABLE custorders_new AS(
 SELECT ROW_NUMBER() OVER(ORDER BY linenumber) as linenumber,
  CustomerName,
  orderid,
  orderdate,
  city,
  unitprice,
  quantity,
  shipregion 
FROM custorders);
DROP TABLE custorders;
ALTER TABLE custorders_new RENAME TO custorders

Please be noted the above approach may not suitable because CREATE TABLE AS does not inherits IDENTITY property from parent table. In that case create a VIEW over the table using the same ROW_NUMBER window function would be the perfect choice. See below:

CREATE VIEW custorder_vw as 
SELECT ROW_NUMBER() OVER(ORDER BY linenumber) as linenumber, 
 CustomerName,
 orderid,
 orderdate,
 city,
 unitprice,
 quantity,
 shipregion 
FROM custorders;

Finally, in this post we have had a look on Identity column SEED-STEP behavior in different insert statement and some possible ways to resolve this issue. In one of my subsequent posts I will try to demonstrate the Identity column SEED-STEP behavior when data files are uploaded using Redshift COPY command. Till then thanks lot to read this post with patient. Hope this post is informative to you.

References (UNION ALL in different databases):

1. http://docs.oracle.com/database/121/VLDBG/parallel003.htm
2. http://forums.teradata.com/forum/database/sql-statements-in-union-executes-serial-or-parallel




7 thoughts on “Amazon Redshift – Identity column SEED-STEP behavior with different INSERT statements.

  1. Thanks for the insight in generating the surrogate keys .Is there an alternative way to eradicate this identity missing issue during parallel processing ? .Row_number() option may not be the wise one for voluminous record processing (say for historic loads ,adhoc ETL loads ) ..

    Appreciate your efforts in sharing the thoughts and experience !!

    -Mohamed

  2. Hey,
    I implemented the identity logic in a table where i have to truncate and reload it every time.
    Here after truncating the table, the increment starts from the last increment value which was inserted in the table before truncating.However i want the count to start from the initial value(seed value) everytime.

    Kindly suggest.

Leave a Reply

Your email address will not be published. Required fields are marked *

fifteen − six =