Connect a SQL Server Instance using MuleSoft (JDBC)

In recent years MuleSoft has become a proven Data Integration and API management platform. It connects almost any databases including SQL Server. In one my recent data integration projects, I had a development server that has multiple instances with different SQL Server versions. While connecting to SQL Server default instance through MuleSoft is easy – like native ADO or OLEDB, talking with SQL Server Instances through JDBC is however, bit tricky. Let me show you how to overcome this issue.

SQL Server Instances – to connect MuleSoft

As specified earlier, I have two SQL Server Instances, DevServer-01 and DevServer-01\SQL2014. The former one is legacy 2012 and latter has bit newer, SQL 2014 version installed. Using MuleSoft, I need to integrate data from both servers.
sql-environment

So, what is the issue
Connecting to default SQL Server Instance (DevServer-01) in MuleSoft was quiet hassle free,
con-default-instance
Figure-02: Successful connection to Default instance

But when I attempted to connect SQL Server 2014 instance in MuleSoft just by changing the Host name like “DevServer-01\SQL2014”, I got the following error:

Cannot get connection for URL jdbc:sqlserver://DevServer-01\SQL2014:1433;databaseName=CallData : Cannot open database "CallData" requested by the login. The login failed.

Wait a minute! what does this error say “Cannot open database “CallData” requested by the login”. It does mean, JDBC tries connecting to my default instance where database CallData does not exists. But in the connection string I have mentioned the instance name like “DevServer-01\SQL2014“! so where is the problem?
sql-mule-error

The remedy
So, the issue was in the default port, 1433 that I have used in order to connect SQL Server instance. Although SQL Server native driver (ADO, OLEDB) can handle this issue, but JDBC (MuleSoft) needs the TCP Dynamic Ports to be specified in connection string instead of the default one. Now we need this, right? let’s find out the TCP Dynamic Port.

  1. Login the server and open SQL Server Configuration Manger
  2. Expand SQL Server Network Configuration and select “Protocol for {Your Instance Name}. In my case it is “Protocol for SQL2014”
  3. Select TCP/IP from right panel
  4. Pick up port number from “TCP Dynamic Ports”. For me it shows 64203. For you, it could be a different one.

Now you have what you need, just use this port number instead of the default (1433) one in mule connection string.

sql-mule-dynamic-port-ok

You are done!

Finally, one thing to remember, as “TCP Dynamic Ports” might be changed in server restarts, do not forget to get and update your connection string with newly generated port in MuleSoft. Hope this helps!

Amazon Redshift – Getting number of records affected in last query executed in current session

Most of the time we need to know the number of records affected by insert or delete query in current session. Popular DBMS like SQL Server or MySQL offers built in mechanism to achieve this. For example, SQL Server has @@ROWCOUNT global variable and MySQL gets ROW_COUNT function that helps getting number of records affected in last query executed in current session. Redshift does not offer such built-in function to have this data. But by querying the system tables, we can get this information easily.

Getting number of records affected in last query executed in current session using system table

To get his number we need to use the following system tables of Redshift:

STV_SESSIONS: This table contains the active user sessions for Amazon Redshift.
STL_QUERY: the table only contain information about queries.
STL_INSERT: Contains insert execution steps for INSERT queries.

In order to get number of records affected in last query executed in current session for delete operation we need another system table named STL_DELETE. Alright, now we are familiar about some required system tables. It is time to build the query:

WITH inserted_result AS (
         SELECT i.tbl        AS tableid,
                SUM(i.rows)  AS total_affected_rows
         FROM   STV_SESSIONS s
                INNER JOIN stl_query q
                     ON  s.process = q.pid
                INNER JOIN stl_insert i
                     ON  i.query = q.query
         WHERE  i.rows > 0
         GROUP BY
                i.tbl
     )

SELECT DISTINCT t.name AS tablename,
       total_affected_rows
FROM   inserted_result ir
       INNER JOIN STV_TBL_PERM t
            ON  t.id = ir.tableid;

Testing the query

To test the query we will create an empty table like STV_TBL_PERM system table of redshift using the below command:

CREATE TABLE tmp_redshift_tableinfo (LIKE STV_TBL_PERM);

Open a new query editor and execute the scripts together like the image below.

records affected in last query executed in current session

Figure: Redshift Record affect by last query

The above scripts insert 100 rows from STV_TBL_PERM system table to the newly created tmp_redshift_tableinfo table. Afterwards, our query is getting number of records affected in last query executed in current session. Also, we can get this information for delete query by simply using STL_DELETE table in the above query instead of STL_INSERT table.

Hope this helps. Enjoy!!!

Amazon Redshift – What you need to think before defining primary key

Amazon Redshift suggests defining primary key or foreign key constraint wherever applicable. But they are information only. Redshift does not enforce these constraints. In other words, a column with a primary key accepts duplicate values as well as a foreign key column also allows such a value that does not exist in the referenced table. So why does Redshift consider defining primary or foreign key as their best practice list? Because query optimizer uses them to choose the most suitable execution plan. But you need to be very careful while defining these constraints in your table. Let us see why with some real-life examples.

Need to think before defining primary key:

As specified earlier, it is fine to have a primary key in a Redshift table. But keep in mind it is your responsibility to make sure the column’s values are unique. Otherwise, you are likely to get the wrong query result. An example gives you more insight.

I have created a tabled named employee using the script below:

CREATE TABLE employee
(
 emp_id       INT NOT NULL PRIMARY KEY,
 emp_name     VARCHAR(100) NOT NULL,
 dept_id      INT
);

Then executed the following script to insert some records in to it.

INSERT INTO employee VALUES(1, 'Foo',1);
INSERT INTO employee VALUES(2, 'Bar',1);
INSERT INTO employee VALUES(2, 'Bar',1)

You might have noticed, the emp_id is declared as the primary key and in the insert scripts, ‘Bar’ is inserted twice having emp_id 2. In Redshift, this is not a problem as the Primary Key constraint is not enforced. Assume we have another table called “department” with the following rows:
dept_rows
Figure-1: Department table.

Now, let’s execute the following query to see, employee name and department of employee Id = 2. The query looks like below:

SELECT DISTINCT e.emp_id,
       d.dept_name
FROM   employee e
       INNER JOIN dbo.department d
            ON  e.dept_id = d.dept_id
WHERE  e.emp_id = 2;

We know there are duplicate values for employee id 2. So “DISTINCT”  keyword is used to eliminate duplicate values from the query results. Here is the output of the query:
wrong_result
Figure-2: Wrong query result.

See, it produces the wrong result! The “DISTINCT” keyword is not honored at all! The reason behind that is Redshift assumes all values of a primary key column are inserted by ensuring uniqueness. So the database engine does not care to make them distinct during query execution. Let’s remove the primary key constraint from the table:

ALTER TABLE employee
DROP CONSTRAINT employee_pkey;

And execute the select statement once again. We are getting the expected result now!
corrent_result
Figure-3: Correct Result

So, you got an idea of what you need to think before defining the primary key. And of course, always remember the Redshift developer guide’s following comments.

Do not define the primary key and foreign key constraints unless your application enforces the constraints. Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.




Consider this after deleting a large number of rows from a Redshift Table

Who wants to delete rows after inserting them into tables? But to comply with business needs, we sometimes delete unnecessary records from tables. After deleting a large number of rows from a table, usually, we like to see disk space consumed by these rows are released automatically. Popular RDBMS like SQL Server usually marks the rows affected by an immediate Delete command as Ghost records rather than physically deleting them from disks at once. Afterward, using an asynchronous internal process, the database engine removes these Ghost records physically from the disk. But Amazon-Redshift database does not offer such a facility. You need to do it by yourself. In this post, you will come to know what you must need to do after deleting a large number of rows from a Redshift Table.

The issue you may face after deleting a large number of rows from a Redshift Table.

Database developers sometimes query on the system catalog tables to know the total row count of a table that contains huge records for faster response. In terms of Redshift, this approach would be dangerous. Because after a delete operation, Redshift removes records from the table but does not update the catalog table metadata. Let us see an example:

I have populated a table named “lineorder” with AWS sample records. The table contains around 16 million rows. Now let us use the following catalog table query to know the total records of the table.

SELECT datname,
        nspname,
        relname,
        SUM(ROWS) AS ROWS
FROM   pg_class,
       pg_namespace,
       pg_database,
       stv_tbl_perm
WHERE  pg_namespace.oid = relnamespace
   AND pg_class.oid = stv_tbl_perm.id
   AND pg_database.oid = stv_tbl_perm.db_id
   AND trim(datname) = 'database name here'
   AND trim(relname) = 'lineorder'
GROUP BY
 datname, nspname, relname
ORDER BY
 datname, nspname, relname;

After, executing the above query we get the following result:
redshift_rec_count_cat_table
Figure 01: Table record count using a catalog query.

Now let us verify the record count by issuing the query below:

SELECT COUNT(*)
FROM  lineorder;

select_count
Figure 02: Getting row count by querying on the table itself.

Alright, we are getting the identical result by executing both queries.

In, “lineorder” table there is a total of 248900 rows for order date = ‘19920712’. We will delete all these rows from “lineorder” table. For this the following delete statement is executed:

DELETE
FROM   lineorder
WHERE  lo_orderdate = '19920712'

And here is the result.
delete records
Figure 03: 248900 rows are deleted.

At this moment total records of the table will be 600037902-248900=599789002.
Let us verify this with the below SQL Statement once again:

SELECT COUNT(*)
FROM lineorder;

Here is the result:
select_count_after_delete
Figure 04: Record count after deleting records.

Alright, we got our result now that after triggering the delete query we have 599789002 rows left in the table. Now, we are going to execute the catalog query to get the record count of the table. Let’s execute the following query once again:

SELECT datname,
       nspname,
       relname,
       SUM(ROWS) AS ROWS
FROM   pg_class,
       pg_namespace,
       pg_database,
       stv_tbl_perm
WHERE  pg_namespace.oid = relnamespace
   AND pg_class.oid = stv_tbl_perm.id
   AND pg_database.oid = stv_tbl_perm.db_id
   AND trim(datname) = 'your database name here'
   AND trim(relname) = 'lineorder'
GROUP BY
 datname, nspname, relname
ORDER BY
 datname, nspname, relname;

The query returns the below result:

 rec_count_cat_query
Figure 05: Catalog tables are not refreshed after deleting a large number of rows.

Oh!! The records count we are getting by executing catalog query is 600037902!!! which was the initial records of the table (before deleting records). But total rows should be 599789002 as we have already deleted 248900 rows. That means the catalog table’s metadata is not refreshed after the records are deleted from the table.

This could be a huge problem if you use catalog queries in your application to get the total row count of a table. Sometimes it is a common practice for database developers that they query the system tables (catalog tables) in order to get the record count of relatively bigger tables instead of triggering the SELECT COUNT(*) statement. Because querying on the system table (catalog table) is much faster than querying on the actual table. But particularly in terms of Redshift, this approach would be dangerous as specified earlier in this post!

So, what you must need to do after deleting a large number of rows from a Redshift Table. In order to solve this issue, after deleting a large number of rows from a table, it is obvious to execute the VACUUM command. I executed the command like below:

VACUUM lineorder;

While VACUUM gets finished, I executed the catalog query again to get total record count of the table. The image below depicts the result:

select_count_cat_query_after_vac
Figure 06: Accurate record count after running VACUUM command.

Now we are getting an accurate result by executing catalog query as well.

Finally, VACUUM command also reclaims storage space and resort rows according to the Sort Key defined in the table. It also a best practice to ANALYZE redshift table after deleting a large number of rows to keep the table statistic up to date. Updated statistics ensures faster query execution.

Hope this information will help you in your real-life Redshift development.

Note: VACUUM is a slower and resource-intensive operation. The best practice is to execute it in an off-peak time window. So if you really need immediate record count after a delete operation, better to use SELECT COUNT (*) instead of using catalog query.




Interesting aggregate query result using WITH clause in Redshift

UPDATE: This issue has been fixed in Redshift now.

Amazon Redshift does not support table variable. For a quick testing, we sometimes do not want to store data in a real table rather we usually declare a table variable and complete our testing. This approach also saves our time from cleaning up the unused test tables and records. Redshift supports Common Table Expression and we can use it as temp table (still have limitation). Using WITH block (CTE) multiple logical tables can be generated combining arbitrary select statements for each logical expression. For more information about this visit WITH Clause  in Redshift Developer Guide. In my recent project, while doing a quick test by generating multiple logical tables with some records, I noticed Interesting aggregate query result using WITH clause in Redshift. Let me show you in details.

Generating tables to see Interesting aggregate query result using WITH clause in Redshift:

I have written a script like below:

WITH t1 AS
(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
),
t2 AS

(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
)

I have generated table expression “t1” and “t2” just to do an aggregate query to take the SUM of all “col1” values. So my final query is as follows:

WITH t1 AS
(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
),
t2 AS

(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
)

SELECT SUM(sumofcol1) AS Total
FROM(
  SELECT SUM(col1) AS sumofcol1
  FROM   t1
  UNION ALL
  SELECT SUM(col1)
  FROM   t2
) t

After executing the above query surprisingly I got result “10” but the result is supposed to be 120. See the below image:

Interesting Result
Figure 01: Incorrect query result.

To be confirm about the result, I have executed the query in SQL Server environment and got the expected output as 120. Have a look:

Result In SSMS
Figure 02: Correct Query result in SQL Server Management Studio.

So really an interesting aggregate query result using WITH clause in Redshift! I don’t know why it happens, I will be grateful if anyone let me know the reason of such interesting query behavior.

How to solve it.

In order to resolve the issue, I have tried executing the query by removing the “SUM” from the last inner most sub query and it gave the correct result! Have a look on the corrected one below:

WITH t1 AS
(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
),
t2 AS

(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
)

SELECT SUM(sumofcol1) AS Total
FROM(
  SELECT col1 AS sumofcol1
  FROM   t1
  UNION ALL
  SELECT col1
  FROM   t2
) t

For curiosity, I wanted to see how the query behaves when temp tables (# tables) are used instead of WITH Clause in Redshift. For this I have re-written the queries as follows:

CREATE TABLE #t (col1 INT);
CREATE TABLE #t1(col1 INT);

INSERT INTO #t
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1 
UNION ALL
SELECT 30 AS col1;
 
INSERT INTO #t1
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1 
UNION ALL
SELECT 30 AS col1;
 
SELECT SUM(sumofcol1) AS Total
FROM   (
           SELECT SUM(col1) AS sumofcol1
           FROM   #t
           UNION ALL
           SELECT SUM(col1)
           FROM   #t1
       ) t

DROP TABLE #t;
DROP TABLE #t1;

After executing the query I have found the expected result which is “120”.

Finally, Redshift generates interesting aggregate query result using WITH clause with logical, non persisted rows. Bit it produces correct result using the rows reside in actual table on the disk.




Amazon Redshift create drop table if exists

As a part of maintenance and releasing new features, the “Redshift create drop table if exists” option is now available in the database after Oct 16, 2014. Previously, it was a bit cumbersome to check the existence of a table prior to create or drop it.

The previous approach of Redshift create-drop table if exists

Querying in the Redshift catalog or information schema table, was the only way to know if any table exists or not. For example, in past, in order to check the existence of a table the following extra query need to be issued in Redshift:

SELECT TOP 1 tablename
FROM   pg_table_def t
WHERE  t.tablename = 'test';

Based on the result of the above query, new tables are created or existing tables are dropped as per developer requirement.

The current approach of Redshift create-drop table if exists

Amazon has realized this essential user need and addressed this issue in their most latest update. Table existence can be checked now along with CREATE TABLE or DROP table statements using the new IF EXISTS keyword. The following statement creates a table if it does not exist.

CREATE TABLE IF NOT EXISTS Test (ID INT);

Similarly, IF EXISTS can be used along with DROP, VIEW, SCHEMA and USER DDL statement as well.  Have a look at the query below:

DROP TABLE IF EXISTS test;

The query above will delete the table named “test” if this is present in the database.I hope this information will be helpful to you.

References:
http://docs.aws.amazon.com/redshift/latest/dg/r_DROP_TABLE.html
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
https://forums.aws.amazon.com/ann.jspa?annID=2668




Amazon Redshift – Identity column SEED-STEP behavior with COPY command.

In my previous post, I have demonstrated how Identity column behaves with different INSERT statements. In today’s post we will come to know about Identity column SEED-STEP behavior with COPY command. In other words, how identity values are generated when a file is loaded from S3 to Redshift a table that has an auto increment identity column.

As we already know that an identity column does not guarantee the values in the column will be consecutive in all cases. It also stated in earlier post that most often we need consecutive or strict sequential ID numbers. In that scenario, we need to implement some additional actions in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases identity column SEED-STEP with COPY command does not generate consecutive identity values. We will also see some techniques to maintain the orders or the Identity values later in this post.

Environment to test Identity column SEED-STEP behavior with COPY command.

A table like below:

CREATE TABLE custorders
(
linenumber 	 INT IDENTITY(1,1),
customername     VARCHAR(50) NOT NULL,
orderid          INT NOT NULL,
orderdate        TIMESTAMP NOT NULL,
city             VARCHAR(20) NOT NULL,
unitprice        FLOAT NOT NULL,
quantity         INT NOT NULL,
shipregion       VARCHAR(20) NULL
);

Two text files and a manifest file in the S3 bucket to populate the above table using COPY command.

Test data files in S3
Figure 01: Test data and manifest files in S3.

The “order_2001.txt” file contains sales records of 2001 year and order_2002 has sales data of year 2002. In order to facilitate multiple data files loading in parallel using a single COPY command, I have created a manifest file named “order.manifest”.

All testing are done in a single dw2.large node cluster environment. Our environment is ready, now let us jump to me main test.

Identity column SEED, STEP behavior with COPY command while uploading a single file.

In order to see the identity value generated in the above table, I have executed the below COPY command to load order_2002.txt file into an empty custorders table.

COPY custorders FROM 's3://standard-seeddata/order_2002.txt' 
credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1; 

Let’s see the identity values generated by the COPY command by selecting top 100 records.

SELECT TOP 100 * 
FROM custorders
ORDER BY
 linenumber;

The above select statement produced the below results:

Identity value generated with copy command
Figure 02: Identity values generated with COPY command.

From the above image we found the following discrepancies in Identity column SEED,-STEP behavior with COPY command.

  • As we loaded the file into the empty custorders table, as per table identity definition declaration “linenumber INT IDENTITY(1,1)” the first value of linenumber column should be 1 then 2 then 3 and so on. But the first value is inserted surprisingly 4312.
  • Furthermore, according to the identity STEP, the next row’s value should be increased by one step. So the value should be 4312+1=4313. But values are increased by 2 steps. Hence, in first row identity column “linenumber” value is 4312, in second row it is 4314 and third one is 4316 etc.

Why identity starts from 4312 instead of 1?

In order to find out reason behind these issues of “Identity column SEED-STEP behavior with COPY command” I have had a look on the redshift developer guide but did not find any clue. After a while I found myself that there are total 2156 records in the source text file (order_2002.txt). The first generated identity value of the table is twice of the total records count, which is 2156*2=4312. So why does this happen? To find out the answer, I have tried COPY command using different parameters. Finally, I got another clue when I triggered COPY command using “COMPUPDATE OFF” option. The COPY command is now as follows:

COPY custorders FROM 's3://standard-seeddata/order_2002.txt' 
credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1
COMPUPDATE OFF;

And the above command produces the blow identity values:COPY with COMPUPDATE OFFFigure 03: Identity values generated with COPY with “COMPUPDATE OFF” parameter.

Now it produces much better result right? In previous attempt, the identity value started from an arbitrary 4312, but this time, at least it starts from 1. But, still, the identity values are incremented by 2 steps which is supposed to be 1. We will discuss about this issue later, but now let me tell you the reason why “using ‘COMPUPDATE OFF’ parameters with COPY command” generates initial identity value = 4312  instead  of 1 during the first attempt (see Figure02) (4312 is twice of the total records count).

“COMPUPDATE” parameter in COPY command automatically decides the optimal compression encodings for each column of the target table based on ON|OFF settings. If COPY command is executed without “COMPUPDATE” option, COPY applies automatic compression on an empty target table. For this reason COPY command completes in two steps (passes). During first pass, identity values are incremented internally. In the second pass (step) it further increments the value while committing records to the target table. This could be reason why we did not get the initial identity value as per our expectation.

Why identity value increments by 2 instead of 1?

If you have a look on both “Figure 02” and “Figure 03”, you will find identity values are incremented by 2 instead of 1. I have single node in Redshift cluster. The node has two slices. As per Redsihft developer guide data are loaded in parallel in slices when multiple files are loaded using a single COPY command. COPY command loads data in sequential mode when a single file is pointed in COPY command. During data load, COPY command increments identity in such a manner so that the identity values of each slice are not overlapped with values of another slice. For this purpose COPY command would generate identity value in odd, even basis like below:

For slice one: 1, 3, 5…..n
For slice tow: 2, 4, 6…..n

Now let us do another test to validate this fact. As stated earlier in the “Environment” section (see Figure 01) that we have three files. The order_2011.txt contains sales data of year 2001, order_2002.txt file has sales data of year 2002 and a manifest file to upload both files using a single COPY command. The manifest file looks like below:

{
  "entries": [    
    {"url":"s3://standard-seeddata/order_2001.txt", "mandatory":true},
    {"url":"s3://standard-seeddata/order_2002.txt", "mandatory":true}
  ]
}

And the COPY command is as follows:

COPY custorders FROM 's3://standard-seeddata/order.manifest' 
credentials 'aws_access_key_id=;aws_secret_access_key=' delimiter '\t' IGNOREHEADER 1 manifest COMPUPDATE OFF;

After executing the above COPY command, I got results like below:
Multiple file load usning single COPY
Figure 04: Identity values generated by COPY with manifest option.

Splendid!! Identity column SEED-STEP behavior with COPY command is now as per our expectation. It generates initial value with 1 and the subsequent values are incremented by 1 as well. There is no missing, no gaps amongst identity values. The per slice identity allocation also clear to us. Executing the following query will make more sense regarding per slice records and identity value distributions.

SELECT  slice,
         num_values  AS total_rows_per_slice,
         minvalue    AS min_identity_val,
         maxvalue    AS max_identity_val
 FROM   svv_diskusage
 WHERE  NAME = 'custorders'
         AND col = 0
 ORDER BY
         slice,col;

The above query generates the follow output:
Per slice identity column distribution
Figure 05: Per slice identity column distribution

From the above image, we get a clear idea about how redshift determines slice wise identity values. It seems COPY command loads the both files in parallel and allocates identity values as follows:

Loads “orders_2002.txt” files in slice one and allocates identity: 1, 3, and 5…..up to 4309
Loads orders_2001.txt in slice two and allocates identity: 2, 4, 6…..up to 4310

Finally, from the aforementioned test, we came to know loading files in parallel using COPY command with “COMPUPDATE OFF” option may generate expected identity values. It also comply with Identity column SEED-STEP behavior with COPY command. But in different scenario (multiple node cluster) Identity column SEED-STEP behavior with COPY command might be different. In that case, ROW_NUMBER window function can be used to generate sequential values. Another option could be to ask file provider (or client) to include line number in the file if this is a strict business need. Then the file can be loaded using COPY command with “EXPLICIT_IDS” parameter into an identity enabled redshift table.

References:
1. https://forums.aws.amazon.com/thread.jspa?messageID=564576
2. http://docs.aws.amazon.com/redshift/latest/dg/r_SVV_DISKUSAGE.html




Amazon Redshift – Identity column SEED-STEP behavior with different INSERT statements.

During table design auto increment column is one of the best choices among database developers. Most of the cases an auto increment identity column is considered as the primary key of the table. It is also widely used as a surrogate key of dimension tables in a typical data warehouse system. Identity column SEED-STEP are used to generate the sequential values in the table. An identity column takes the value of current seed incremented by the step when a row is inserted into a table.

Identity column SEED, STEP can be used with CREATE TABLE statement in Amazon Redshift. A point to be noted that an identity column does not guarantee the values in the column will be consecutive in all cases. In real life, sometimes we need consecutive or strict sequential ID numbers. In that scenario, we need to adopt some additional measures in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases of INSERT statement identity column SEED-STEP does not generate consecutive identity values. We will also see some techniques to maintain the orders of the Identity values.

Environment to test Identity column SEED-STEP behavior with different INSERT statements.

I have used a single dw2.large node cluster to in my test environment and a table with an identity column like below:

CREATE TABLE custorders
(
 linenumber       INT IDENTITY(1,1),
 customername     VARCHAR(50) NOT NULL,
 orderid          INT NOT NULL,
 orderdate        TIMESTAMP NOT NULL,
 city             VARCHAR(20) NOT NULL,
 unitprice        FLOAT NOT NULL,
 quantity         INT NOT NULL,
 shipregion       VARCHAR(20) NULL
);

Identity column SEED-STEP behavior in different INSERT statements.

In my environment, the following approaches are used to test the identity column SEED, STEP behavior.

  • Inserting records using traditional INSERT INTO (column1, column2…) VALUES statement (PART-1)
  • Inserting records using MULTY ROW INSERT approach (PART-2) like:
    INSERT INTO … VALUES
    (value1, value1),
    (value2, value4);
  • Inserting records using INSERT INTO … followed by multiple SELECT statement combined using UNION ALL (PART-3).

According the above points the following SQL scripts are written:

PART-1:

INSERT INTO custorders(CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('A',1234,'01-01-2002','Dhaka',12,1,'DK');
INSERT INTO custorders(CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('B',1235,'01-01-2002','Sylhet',12,2,'SL');
INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('C',1236,'01-01-2004','CTG',12,2,'CT');

PART-2:

INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES
('D',1234,'01-01-2002','Dhaka',12,1,'DK'),
('E',1235,'01-01-2002','Sylhet',12,2,'SL'),
('F',1236,'01-01-2004','CTG',12,2,'CT');

PART-3:

INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion)
SELECT 'G',1234,cast('01-01-2002' AS TIMESTAMP),'Dhaka',12,1,'DK'
UNION ALL
SELECT 'H',1235,'01-01-2002','Sylhet',12,2,'SL'
UNION ALL
SELECT 'I',1236,'01-01-2004','CTG',12,2,'CT';

These scripts are executed one part at a time, PART-1 first then PART-2 and finally PART-3. The result produces uninterrupted sequential identity values (line numbers). See below image: Identity column seed step begavior in single insert
Figure 02: Identity column generates uninterrupted unique values.

Now, let us execute all insert scripts at a time (PART-1 + 2 +3). Prior to do such testing I have dropped and recreated the “custorders” table. Executing all three SQL insert statement parts at a time produces the following result:

Insert UNION ALL not generates identity
Figure 03:Identity column SEED-STEP behavior using multiple insert statements.

From the above image we see, SQL script PART-1 and PART-2 honored identity column SEED, STEP default behavior (linenumber 1 to 6). But Inserting records using INSERT INTO … followed by multiple SELECT statement combined using UNION ALL (PART-3) does not.

The possible reason why “Identity column SEED-STEP behavior” is different:

Possible reason of this issue could be the parallelism. Several months ago, I have combined records of more than 10 big tables using UNION ALL in order to make a PIVOT table in SQL SERVER. During query execution, I have noticed all of my CPU cores (1 socket, 4 cores, and 8 logical processors) were busy. To know more about that I conducted a Google search and found that optimizer might execute queries in parallel when multiple queries (branches) combined to a single SQL statement. We know Redshift is designed to facilitate huge parallel processing. Although it is not documented anywhere, but I presume, Redshift attempts to insert each SQL statement of UNION ALL in parallel into each node slice. In order to ensure that there is no single point of conflict in assigning identity values to rows during parallel insertion, Redshift might maintain gaps while allocating identity value for each rows. This is only my thoughts based on previous experience. Please inform me if anyone has more concrete idea about this behavior.

How to solve the issue:

There is not straight forward way that can resolve this identity missing issue. A new table can be created to copy all data from the existing table with CREATE TABLE AS statement using ROW_NUMBER window function like below:

CREATE TABLE custorders_new AS(
 SELECT ROW_NUMBER() OVER(ORDER BY linenumber) as linenumber,
  CustomerName,
  orderid,
  orderdate,
  city,
  unitprice,
  quantity,
  shipregion 
FROM custorders);
DROP TABLE custorders;
ALTER TABLE custorders_new RENAME TO custorders

Please be noted the above approach may not suitable because CREATE TABLE AS does not inherits IDENTITY property from parent table. In that case create a VIEW over the table using the same ROW_NUMBER window function would be the perfect choice. See below:

CREATE VIEW custorder_vw as 
SELECT ROW_NUMBER() OVER(ORDER BY linenumber) as linenumber, 
 CustomerName,
 orderid,
 orderdate,
 city,
 unitprice,
 quantity,
 shipregion 
FROM custorders;

Finally, in this post we have had a look on Identity column SEED-STEP behavior in different insert statement and some possible ways to resolve this issue. In one of my subsequent posts I will try to demonstrate the Identity column SEED-STEP behavior when data files are uploaded using Redshift COPY command. Till then thanks lot to read this post with patient. Hope this post is informative to you.

References (UNION ALL in different databases):

1. http://docs.oracle.com/database/121/VLDBG/parallel003.htm
2. http://forums.teradata.com/forum/database/sql-statements-in-union-executes-serial-or-parallel




Amazon Redshift- CREATE TABLE AS vs CREATE TABLE LIKE

In one of my earlier posts, I have discussed about different approaches to create tables in Amazon Redshift database. Among these approaches, CREATE TABLE AS (CATS) and CREATE TABLE LIKE are two widely used create table command. Each command has its own significance. In this post, the differences, usage scenario and similarities of both commands will be discussed.

Both CREATE TABLE AS (CATS) and CREATE TABLE LIKE command can not create table independently. In other words, CREATE TABLE AS, CREATE TABLE LIKE command can create a table by copying column settings and records (CATS only) from and existing table. Both commands can be used in following scenario.

Sort key, distribution key and column null/not null behavior during table creation using CREATE TABLE AS and CREATE TABLE LIKE.

In Redshift, there is no way to include sort key, distribution key and some others table properties on an existing table. The only way is to create a new table with required sort key, distribution key and copy data into the that table. Using both CREATE TABLE AS and CREATE TABLE LIKE commands, a table can be created with these table properties. For an example:

The following command creates a new table with Sort Key, Distribution Key and inserts three rows into the table.

CREATE TABLE product(
 product_id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
 product_name VARCHAR(100) NOT NULL,
 category VARCHAR(20) NOT NULL DEFAULT 'Byke',
 store_location_id INT NOT NULL,
 reorder_level INT NULL 
)
DISTKEY(store_location_id)
SORTKEY(product_name);

INSERT INTO product(product_name,category,store_location_id,reorder_level) VALUES
('Phonix',default,1,10),
('Terix','Easy Byke',1,10),
('Hon','Motor Byke',2,20);

Now to serve the business we will need to include “category” along with existing sort key product_name and also want to change the distribution key as product_id. Hence the statement portion will be as follows:

DISTKEY (product_id),
SORTKEY(product_name,category)

As Redshift does not offer any ALTER TABLE statement to modify the existing table, the only way to achieve this goal either by using CREATE TABLE AS or LIKE statement. The below CREATE TABLE AS statement creates a new table named product_new_cats.

CREATE TABLE product_new_cats 
SORTKEY(product_name,category)
DISTKEY(product_id) AS
(SELECT * FROM product)

Now to the following command is used to get the records of the new “product_new_cats” table.

SELECT * FROM product_new_cats

The result is as follows:create table as all productsFigure 01: All records in product_new_cats

Now, we become sure, CATS statements copied all records from product table into the product_new_cats table. But what about sort key, distribution key and other settings? Let’s execute the following two commands:

SELECT tablename,
 "column",
 TYPE,
 distkey,
 sortkey,
 "notnull"
FROM pg_table_def
WHERE tablename = 'product';

SELECT tablename,
 "column",
 TYPE,
 distkey,
 sortkey,
 "notnull"
FROM pg_table_def
WHERE tablename = 'product_new_cats';

The above two commands returns two results below:main table settingsFigure 02: product table settings

create table as settingsFigure 03: product_new_cats table settings.

From the above tow images, we found CREATE TABLE AS successfully created new sort and distribution keys. This command also inherits these settings from parent table. But one thing needs to point out here, CREATE TABLE AS command does not inherits “NOT NULL” setting from the parent table. Here, all columns of product_new_cats table are created as NULL(see Figure 03). But all columns of parent “product” table were declared as “NOT NULL” (Figure 02).

Alright, so far we have an idea about how “CREATE TABLE AS” command behaves. Now we will notice what happens when we create table using “CREATE TABLE LIKE” statement. Let’s execute the following scripts:

CREATE TABLE product_new_like (LIKE product);
SELECT * FROM product_new_like;

The above statements creates a table named “product_new_like” using CREATE TABLE LIKE statement and later command select all records from the newly created table. CREATE TABLE LIKE does not copy data from source table. So the SELECT * command will not return any rows. But it inherits columns settings. Let’s execute the SQL statement below and have a look the result:

SELECT tablename,
 "column",
 TYPE,
 distkey,
 sortkey,
 "notnull"
FROM pg_table_def
WHERE tablename = 'product_new_like';

Result:create table like settingsFigure 04: Create table like settings

By comparing output of “Figure 02” and “Figure 04” we see CREATE TABLE LIKE statement also inherits sort key, distribution key. But the main point to to note here that, CREATE TABLE LIKE command additionally inherits “NOT NULL” settings from the source table that CREATE TABLE AS does not.

Primary key, identity and default constraint behavior when table is created

using CREATE TABLE AS and CREATE TABLE LIKE:

In order to check whether CREATE TABLE AS and CREATE TABLE LIKE statement inherits primary key, default constraint and identity settings from source table or not.the following scripts can be executed.

 To check Primary key:

SELECT DISTINCT
     NAME AS tablename,
     pc.conname AS constraint_name
FROM stv_tbl_perm tp
     LEFT OUTER JOIN pg_constraint pc
          ON pc.conrelid = tp.id
WHERE name IN ('product', 'product_new_cats', 'product_new_like')

create table as like primary key

Figure 05: CATS and LIKE does not inherits primary key.

From the above image, we can see both CREATE TABLE AS, CREATE TABLE LIKE do not inherit primary key constraint from source table.

To check Default and Identity:

SELECT DISTINCT p.name AS tablename,
       attname AS columnname,
       a.atthasdef t
FROM pg_attribute a
     INNER JOIN stv_tbl_perm p
          ON p.id = attrelid
WHERE name IN ('product','product_new_cats','product_new_like')
      AND a.atthasdef = 1

default cons and identity

Figure 06: CATS and LIKE does not inherits default constraint and identity.

The above query is used to select default constraint and identity column from all  three tables (product, product_new_cats,product_new_like). But we found only the source table , product is returned here. CREATE TABLE AS, CREATE TABLE LIKE does not inherit default value as well as identity settings. CREATE TABLE LIKE has an option to copy “DEFAULT” expression from the source table by using “INCLUDING DEFAULTS”.




Amazon Redshift- Create table in different ways

Amazon Redshift database is getting popular day by day specially for its super fast performance. During database development or administration everyone needs to create table either permanently or temporarily. Developers usually create tables using DDL like “CREATE TABLE” statement. But sometimes they need to create table quickly, without issuing the long, restricted DDL statements for different purposes. In this post I will try to show some techniques to create new tables in Redshift database.

Using DDL:

DDL is the most common approach to create tables in Redshift database. A typical DDL command is as follows:

CREATE TABLE testtable
 (
 id INT NOT NULL identity(1,1),
 name VARCHAR(50)
 );

Using SELECT INTO:

SELECT INTO is another way to create new tables. SELECT INTO facilitates fast bulk data copy to new tables from an existing table in the most easiest way. This approach also keeps the column names and data type same as the source table. Let us see an example:

SELECT
 patiendId,
 patientName,
 dischargedate
INTO adultpatients 
FROM patients
WHERE age >18

The above statement creates a new table named “adultpatients” by selecting records from patients table considering the where clause.

Using CREATE TABLE LIKE:

When you want to create a table by copying columns names, data type, default constraint, NOT NULL attribute from an existing table, CREATE TABLE LIKE is the your best choice. The following statement creates a “product_backup” table by copying settings from the exiting “product” table.

CREATE TABLE product_backup (LIKE product)

Please be noted, CREATE TABLE LIKE does not copy data to the newly created table as SELECT INTO  and CREATE TABLE AS command does.

CREATE TABLE AS (CATS):

Alike SELECT INTO, CREATE TABLE AS (CATS) statements also create a new table along with data. In order to create a table using CATS command, you need to provided a query. The newly created table is populated with data as per the query defined in the CREATE TABLE AS (CATS) statement. The statement below can can create a table using CATS syntax.

CREATE TABLE product_backup AS
 (
  SELECT
   product_id,
   product_name
  FROM product
 )

The above query creates “product_backup” table based on the query provided inside the “AS” clause.

So, these  are the commonly used approaches to create tables in Redshift based on different scenario. Hope this information will be helpful. In my future posts, I have a plan to describe about the in depth functionality, similarities and unique behaviors of each CREATE TABLE approach.