Amazon Redshift – Getting number of records affected in last query executed in current session

Most of the time we need to know the number of records affected by insert or delete query in current session. Popular DBMS like SQL Server or MySQL offers built in mechanism to achieve this. For example, SQL Server has @@ROWCOUNT global variable and MySQL gets ROW_COUNT function that helps getting number of records affected in last query executed in current session. Redshift does not offer such built-in function to have this data. But by querying the system tables, we can get this information easily.

Getting number of records affected in last query executed in current session using system table

To get his number we need to use the following system tables of Redshift:

STV_SESSIONS: This table contains the active user sessions for Amazon Redshift.
STL_QUERY: the table only contain information about queries.
STL_INSERT: Contains insert execution steps for INSERT queries.

In order to get number of records affected in last query executed in current session for delete operation we need another system table named STL_DELETE. Alright, now we are familiar about some required system tables. It is time to build the query:

WITH inserted_result AS (
         SELECT i.tbl        AS tableid,
                SUM(i.rows)  AS total_affected_rows
         FROM   STV_SESSIONS s
                INNER JOIN stl_query q
                     ON  s.process = q.pid
                INNER JOIN stl_insert i
                     ON  i.query = q.query
         WHERE  i.rows > 0
         GROUP BY
                i.tbl
     )

SELECT DISTINCT t.name AS tablename,
       total_affected_rows
FROM   inserted_result ir
       INNER JOIN STV_TBL_PERM t
            ON  t.id = ir.tableid;

Testing the query

To test the query we will create an empty table like STV_TBL_PERM system table of redshift using the below command:

CREATE TABLE tmp_redshift_tableinfo (LIKE STV_TBL_PERM);

Open a new query editor and execute the scripts together like the image below.

records affected in last query executed in current session

Figure: Redshift Record affect by last query

The above scripts insert 100 rows from STV_TBL_PERM system table to the newly created tmp_redshift_tableinfo table. Afterwards, our query is getting number of records affected in last query executed in current session. Also, we can get this information for delete query by simply using STL_DELETE table in the above query instead of STL_INSERT table.

Hope this helps. Enjoy!!!

Amazon Redshift – What you need to think before defining primary key

Amazon Redshift suggests to define primary key or foreign key constraint wherever applicable. But they are information only. Redshift does not enforce these constraints. In other words, a column with primary key accepts duplicate values as well as a foreign key column also allows such a value that does not exists in the referenced table. So why does Redshift consider defining primary or foreign key as their best practice list? Because query optimizer uses them to choose the most suitable execution plan. But you need to be very careful while defining these constraint in your table. Let us see why with some real-life example.

Need to think before defining primary key:

As spoke earlier, it is fine to have a primary key in a Redshift table. But keep in mind it is your responsibility to make sure the column’s values are unique. Otherwise, you are likely to get wrong query result. An example gives you more insight.

I have created a tabled named employee using the script below:

CREATE TABLE employee
(
 emp_id       INT NOT NULL PRIMARY KEY,
 emp_name     VARCHAR(100) NOT NULL,
 dept_id      INT
);

Then executed the following script to insert some records in to it.

INSERT INTO employee VALUES(1, 'Foo',1);
INSERT INTO employee VALUES(2, 'Bar',1);
INSERT INTO employee VALUES(2, 'Bar',1)

You might have noticed, the emp_id is declared as primary key and in the insert scripts, ‘Bar’ is inserted twice having emp_id 2. In Redshift this is not a problem as Primary Key constraint is not enforced. Assume we have another table called “department” with following rows:
dept_rows
Figure-1: Department table.

Now, let’s execute the following query to see, employee name and department of employee Id = 2. The query looks like below:

SELECT DISTINCT e.emp_id,
       d.dept_name
FROM   employee e
       INNER JOIN dbo.department d
            ON  e.dept_id = d.dept_id
WHERE  e.emp_id = 2;

We know there are duplicate values for employee id 2. So “DISTINCT”  keyword is used to eliminate duplicate values from query result. Here is the output of the query:
wrong_result
Figure-2: Wrong query result.

See, it produces wrong result! The “DISTINCT” keyword is not honored at all! The reason behind that is Redshift assumes all values of a primary key column are inserted by ensuring the uniqueness. So database engine does not care to make them distinct during query execution. Let’s remove the primary key constraint from the table:

ALTER TABLE employee
DROP CONSTRAINT employee_pkey;

And execute the select statement once again. We are getting the expected result now!
corrent_result
Figure-3: Correct Result

So, you got an idea of what you need to think before defining primary key. And of course, always remember Redhsift developer guide’s following comments.

Do not define primary key and foreign key constraints unless your application enforces the constraints. Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.

I my next post I will figure out the same thing for foreign key. Until then keep playing!!!




What you must need to do after deleting a large number of rows from a Redshift Table

Who wants to delete rows after inserting them into tables? But to comply with business need, we delete unnecessary records from tables. After deleting large number of rows from a table, usually we like to see disk space consumed by these rows are released automatically. Popular RDBMS like SQL Server usually marks the rows affected by an immediate Delete command as Ghost records rather physically deleting them from disk at once. Afterwards, using an asynchronous internal process, database engine removes these Ghost records physically from disk. But Amazon-Redshift database does not offer such facility.You need to do it by yourself. In this post you will come to know what you must need to do after deleting a large number of rows from a Redshift Table.

The issue you may face after deleting a large number of rows from a Redshift Table.

Database developers sometimes query on the system catalog tables to know total row count of a table that contains huge records for faster response. In terms of Redshift this approach would be dangerous.Because after a delete operation, Redshift removes records from the table but does not update the catalog table metadata. Let us see an example:

I have populated a table named “lineorder” with AWS sample records. The table contains around 16 million of rows. Now let us use the following catalog table query to know the total records of the table.

SELECT datname,
        nspname,
        relname,
        SUM(ROWS) AS ROWS
FROM   pg_class,
       pg_namespace,
       pg_database,
       stv_tbl_perm
WHERE  pg_namespace.oid = relnamespace
   AND pg_class.oid = stv_tbl_perm.id
   AND pg_database.oid = stv_tbl_perm.db_id
   AND trim(datname) = 'database name here'
   AND trim(relname) = 'lineorder'
GROUP BY
 datname, nspname, relname
ORDER BY
 datname, nspname, relname;

After, executing the above query we get the following result:
redshift_rec_count_cat_table
Figure 01: Table record count using catalog query.

Now let us verify the record count by issuing the query below:

SELECT COUNT(*)
FROM  lineorder;

select_count
Figure 02: Getting row count by querying on the table itself.

Alright, we are getting the identical result by executing both queries.

In, “lineorder” table there are total 248900 rows against order date = ‘19920712’. We will delete all these rows from “lineorder” table. For this the following delete statement is executed:

DELETE
FROM   lineorder
WHERE  lo_orderdate = '19920712'

And here is the result.
delete records
Figure 03: 248900 rows are deleted.

At this moment total records of the table will be 600037902-248900=599789002.
Let us verify this with the below SQL Statement once again:

SELECT COUNT(*)
FROM lineorder;

Here is the result:
select_count_after_delete
Figure 04: Record count after deleting records.

Alright, we got our result now that after triggering the delete query we have 599789002 rows left in the table.Now, we are going to execute the catalog query to get the record count of the table. Let’s execute the following query once again:

SELECT datname,
       nspname,
       relname,
       SUM(ROWS) AS ROWS
FROM   pg_class,
       pg_namespace,
       pg_database,
       stv_tbl_perm
WHERE  pg_namespace.oid = relnamespace
   AND pg_class.oid = stv_tbl_perm.id
   AND pg_database.oid = stv_tbl_perm.db_id
   AND trim(datname) = 'your database name here'
   AND trim(relname) = 'lineorder'
GROUP BY
 datname, nspname, relname
ORDER BY
 datname, nspname, relname;

Query returns the below result:

 rec_count_cat_query
Figure 05: Catalog tables are not refreshed after deleting large number of rows.

Oh!! The records count we are getting by executing catalog query is 600037902!!! which was the initial records of the table (before deleting records). But total rows should be 599789002 as we have already deleted 248900 rows. That means catalog table’s metadata is not refreshed after the records are deleted from the table.

This could be huge problem if you use catalog query in your application to get total row count of a table. Sometimes it is a common practice for database developers that they query the system tables (catalog tables) in order to get record count of relatively bigger tables instead of triggering SELECT COUNT(*) statement. Because querying on system table (catalog table) is much faster than querying on the actual table. But particularly in terms of Redshift this approach would be dangerous as specified earlier in this post!!

So, what you must need to do after deleting a large number of rows from a Redshift Table. In order to solve this issue, after deleting a large number of rows from a table, it is obvious to execute VACUUM command. I executed the command like below:

VACUUM lineorder;

While VACUUM gets finished, I executed the catalog query again to get total record count of the table. The image below depicts the result:

select_count_cat_query_after_vac
Figure 06: Accurate record count after running VACUUM command.

Now we are getting accurate result by executing catalog query as well.

Finally, VACUUM command also reclaim storage space and resort rows according to the Sort Key defined in the table. It also a best practice to ANALYZE redshift table after deleting large number of rows to keep the table statistic up to date. Updated statistics ensures faster query execution.

Hope this information will help you in your real life Redshift development.

Note: VACUUM is a slower and resource intensive operation. Best practice is to execute it in an off peak time window. So if you really need immediate record count after a delete operation, better to use SELECT COUNT (*) instead of using catalog query.




Interesting aggregate query result using WITH clause in Redshift

UPDATE: This issue has been fixed in Redshift now.

Amazon Redshift does not support table variable. For a quick testing, we sometimes do not want to store data in a real table rather we usually declare a table variable and complete our testing. This approach also saves our time from cleaning up the unused test tables and records. Redshift supports Common Table Expression and we can use it as temp table (still have limitation). Using WITH block (CTE) multiple logical tables can be generated combining arbitrary select statements for each logical expression. For more information about this visit WITH Clause  in Redshift Developer Guide. In my recent project, while doing a quick test by generating multiple logical tables with some records, I noticed Interesting aggregate query result using WITH clause in Redshift. Let me show you in details.

Generating tables to see Interesting aggregate query result using WITH clause in Redshift:

I have written a script like below:

WITH t1 AS
(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
),
t2 AS

(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
)

I have generated table expression “t1” and “t2” just to do an aggregate query to take the SUM of all “col1” values. So my final query is as follows:

WITH t1 AS
(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
),
t2 AS

(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
)

SELECT SUM(sumofcol1) AS Total
FROM(
  SELECT SUM(col1) AS sumofcol1
  FROM   t1
  UNION ALL
  SELECT SUM(col1)
  FROM   t2
) t

After executing the above query surprisingly I got result “10” but the result is supposed to be 120. See the below image:

Interesting Result
Figure 01: Incorrect query result.

To be confirm about the result, I have executed the query in SQL Server environment and got the expected output as 120. Have a look:

Result In SSMS
Figure 02: Correct Query result in SQL Server Management Studio.

So really an interesting aggregate query result using WITH clause in Redshift! I don’t know why it happens, I will be grateful if anyone let me know the reason of such interesting query behavior.

How to solve it.

In order to resolve the issue, I have tried executing the query by removing the “SUM” from the last inner most sub query and it gave the correct result! Have a look on the corrected one below:

WITH t1 AS
(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
),
t2 AS

(
 SELECT 10 AS col1
 UNION ALL
 SELECT 20 AS col1
 UNION ALL
 SELECT 30 AS col1
)

SELECT SUM(sumofcol1) AS Total
FROM(
  SELECT col1 AS sumofcol1
  FROM   t1
  UNION ALL
  SELECT col1
  FROM   t2
) t

For curiosity, I wanted to see how the query behaves when temp tables (# tables) are used instead of WITH Clause in Redshift. For this I have re-written the queries as follows:

CREATE TABLE #t (col1 INT);
CREATE TABLE #t1(col1 INT);

INSERT INTO #t
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1 
UNION ALL
SELECT 30 AS col1;
 
INSERT INTO #t1
SELECT 10 AS col1
UNION ALL
SELECT 20 AS col1 
UNION ALL
SELECT 30 AS col1;
 
SELECT SUM(sumofcol1) AS Total
FROM   (
           SELECT SUM(col1) AS sumofcol1
           FROM   #t
           UNION ALL
           SELECT SUM(col1)
           FROM   #t1
       ) t

DROP TABLE #t;
DROP TABLE #t1;

After executing the query I have found the expected result which is “120”.

Finally, Redshift generates interesting aggregate query result using WITH clause with logical, non persisted rows. Bit it produces correct result using the rows reside in actual table on the disk.




Amazon Redshift create drop table if exists

As a part of maintenance and releasing new features, “Redshift create drop table if exists” option is now available in the database after Oct 16, 2014. Previously, it was bit cumbersome to check the existence of a table prior to create or drop it.

Previous approach of Redshift create drop table if exists

Querying in the Redshift catalog or information schema table, was the only way to know if any table exists or not. For example, in past, in order to check the existence of a table the following extra query need to be issued in Redshift:

SELECT TOP 1 tablename
FROM   pg_table_def t
WHERE  t.tablename = 'test';

Based on the result of the above query, new tables are created or existing tables are dropped as per developer requirement.

Current approach of Redshift create drop table if exists

Amazon has realized this essential user need and addressed this issue in their most latest update. Table existence can be checked now along with CREATE TABLE or DROP table statements using new IF EXISTS keyword. The following statement creates a table if it does not exist.

CREATE TABLE IF NOT EXISTS Test (ID INT);

Similarly, IF EXISTS can be used along with DROP, VIEW, SCHEMA and USER DDL statement as well.  Have a look on the query below:

DROP TABLE IF EXISTS test;

The query above will delete the table named “test” if this is present in the database.I hope this information will be helpful to you.

References:
http://docs.aws.amazon.com/redshift/latest/dg/r_DROP_TABLE.html
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
https://forums.aws.amazon.com/ann.jspa?annID=2668




Amazon Redshift – Identity column SEED-STEP behavior with different INSERT statements.

During table design auto increment column is one of the best choices among database developers. Most of the cases an auto increment identity column is considered as the primary key of the table. It is also widely used as a surrogate key of dimension tables in a typical data warehouse system. Identity column SEED-STEP are used to generate the sequential values in the table. An identity column takes the value of current seed incremented by the step when a row is inserted into a table.

Identity column SEED, STEP can be used with CREATE TABLE statement in Amazon Redshift. A point to be noted that an identity column does not guarantee the values in the column will be consecutive in all cases. In real life, sometimes we need consecutive or strict sequential ID numbers. In that scenario, we need to adopt some additional measures in order to maintain sequence among the ID values to comply the business need. In today’s post we will see in which cases of INSERT statement identity column SEED-STEP does not generate consecutive identity values. We will also see some techniques to maintain the orders of the Identity values.

Environment to test Identity column SEED-STEP behavior with different INSERT statements.

I have used a single dw2.large node cluster to in my test environment and a table with an identity column like below:

CREATE TABLE custorders
(
 linenumber       INT IDENTITY(1,1),
 customername     VARCHAR(50) NOT NULL,
 orderid          INT NOT NULL,
 orderdate        TIMESTAMP NOT NULL,
 city             VARCHAR(20) NOT NULL,
 unitprice        FLOAT NOT NULL,
 quantity         INT NOT NULL,
 shipregion       VARCHAR(20) NULL
);

Identity column SEED-STEP behavior in different INSERT statements.

In my environment, the following approaches are used to test the identity column SEED, STEP behavior.

  • Inserting records using traditional INSERT INTO (column1, column2…) VALUES statement (PART-1)
  • Inserting records using MULTY ROW INSERT approach (PART-2) like:
    INSERT INTO … VALUES
    (value1, value1),
    (value2, value4);
  • Inserting records using INSERT INTO … followed by multiple SELECT statement combined using UNION ALL (PART-3).

According the above points the following SQL scripts are written:

PART-1:

INSERT INTO custorders(CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('A',1234,'01-01-2002','Dhaka',12,1,'DK');
INSERT INTO custorders(CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('B',1235,'01-01-2002','Sylhet',12,2,'SL');
INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES ('C',1236,'01-01-2004','CTG',12,2,'CT');

PART-2:

INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion) VALUES
('D',1234,'01-01-2002','Dhaka',12,1,'DK'),
('E',1235,'01-01-2002','Sylhet',12,2,'SL'),
('F',1236,'01-01-2004','CTG',12,2,'CT');

PART-3:

INSERT INTO custorders (CustomerName,orderid,orderdate,city,unitprice,quantity,shipregion)
SELECT 'G',1234,cast('01-01-2002' AS TIMESTAMP),'Dhaka',12,1,'DK'
UNION ALL
SELECT 'H',1235,'01-01-2002','Sylhet',12,2,'SL'
UNION ALL
SELECT 'I',1236,'01-01-2004','CTG',12,2,'CT';

These scripts are executed one part at a time, PART-1 first then PART-2 and finally PART-3. The result produces uninterrupted sequential identity values (line numbers). See below image: Identity column seed step begavior in single insert
Figure 02: Identity column generates uninterrupted unique values.

Now, let us execute all insert scripts at a time (PART-1 + 2 +3). Prior to do such testing I have dropped and recreated the “custorders” table. Executing all three SQL insert statement parts at a time produces the following result:

Insert UNION ALL not generates identity
Figure 03:Identity column SEED-STEP behavior using multiple insert statements.

From the above image we see, SQL script PART-1 and PART-2 honored identity column SEED, STEP default behavior (linenumber 1 to 6). But Inserting records using INSERT INTO … followed by multiple SELECT statement combined using UNION ALL (PART-3) does not.

The possible reason why “Identity column SEED-STEP behavior” is different:

Possible reason of this issue could be the parallelism. Several months ago, I have combined records of more than 10 big tables using UNION ALL in order to make a PIVOT table in SQL SERVER. During query execution, I have noticed all of my CPU cores (1 socket, 4 cores, and 8 logical processors) were busy. To know more about that I conducted a Google search and found that optimizer might execute queries in parallel when multiple queries (branches) combined to a single SQL statement. We know Redshift is designed to facilitate huge parallel processing. Although it is not documented anywhere, but I presume, Redshift attempts to insert each SQL statement of UNION ALL in parallel into each node slice. In order to ensure that there is no single point of conflict in assigning identity values to rows during parallel insertion, Redshift might maintain gaps while allocating identity value for each rows. This is only my thoughts based on previous experience. Please inform me if anyone has more concrete idea about this behavior.

How to solve the issue:

There is not straight forward way that can resolve this identity missing issue. A new table can be created to copy all data from the existing table with CREATE TABLE AS statement using ROW_NUMBER window function like below:

CREATE TABLE custorders_new AS(
 SELECT ROW_NUMBER() OVER(ORDER BY linenumber) as linenumber,
  CustomerName,
  orderid,
  orderdate,
  city,
  unitprice,
  quantity,
  shipregion 
FROM custorders);
DROP TABLE custorders;
ALTER TABLE custorders_new RENAME TO custorders

Please be noted the above approach may not suitable because CREATE TABLE AS does not inherits IDENTITY property from parent table. In that case create a VIEW over the table using the same ROW_NUMBER window function would be the perfect choice. See below:

CREATE VIEW custorder_vw as 
SELECT ROW_NUMBER() OVER(ORDER BY linenumber) as linenumber, 
 CustomerName,
 orderid,
 orderdate,
 city,
 unitprice,
 quantity,
 shipregion 
FROM custorders;

Finally, in this post we have had a look on Identity column SEED-STEP behavior in different insert statement and some possible ways to resolve this issue. In one of my subsequent posts I will try to demonstrate the Identity column SEED-STEP behavior when data files are uploaded using Redshift COPY command. Till then thanks lot to read this post with patient. Hope this post is informative to you.

References (UNION ALL in different databases):

1. http://docs.oracle.com/database/121/VLDBG/parallel003.htm
2. http://forums.teradata.com/forum/database/sql-statements-in-union-executes-serial-or-parallel




Amazon Redshift- CREATE TABLE AS vs CREATE TABLE LIKE

In one of my earlier posts, I have discussed about different approaches to create tables in Amazon Redshift database. Among these approaches, CREATE TABLE AS (CATS) and CREATE TABLE LIKE are two widely used create table command. Each command has its own significance. In this post, the differences, usage scenario and similarities of both commands will be discussed.

Both CREATE TABLE AS (CATS) and CREATE TABLE LIKE command can not create table independently. In other words, CREATE TABLE AS, CREATE TABLE LIKE command can create a table by copying column settings and records (CATS only) from and existing table. Both commands can be used in following scenario.

Sort key, distribution key and column null/not null behavior during table creation using CREATE TABLE AS and CREATE TABLE LIKE.

In Redshift, there is no way to include sort key, distribution key and some others table properties on an existing table. The only way is to create a new table with required sort key, distribution key and copy data into the that table. Using both CREATE TABLE AS and CREATE TABLE LIKE commands, a table can be created with these table properties. For an example:

The following command creates a new table with Sort Key, Distribution Key and inserts three rows into the table.

CREATE TABLE product(
 product_id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
 product_name VARCHAR(100) NOT NULL,
 category VARCHAR(20) NOT NULL DEFAULT 'Byke',
 store_location_id INT NOT NULL,
 reorder_level INT NULL 
)
DISTKEY(store_location_id)
SORTKEY(product_name);

INSERT INTO product(product_name,category,store_location_id,reorder_level) VALUES
('Phonix',default,1,10),
('Terix','Easy Byke',1,10),
('Hon','Motor Byke',2,20);

Now to serve the business we will need to include “category” along with existing sort key product_name and also want to change the distribution key as product_id. Hence the statement portion will be as follows:

DISTKEY (product_id),
SORTKEY(product_name,category)

As Redshift does not offer any ALTER TABLE statement to modify the existing table, the only way to achieve this goal either by using CREATE TABLE AS or LIKE statement. The below CREATE TABLE AS statement creates a new table named product_new_cats.

CREATE TABLE product_new_cats 
SORTKEY(product_name,category)
DISTKEY(product_id) AS
(SELECT * FROM product)

Now to the following command is used to get the records of the new “product_new_cats” table.

SELECT * FROM product_new_cats

The result is as follows:create table as all productsFigure 01: All records in product_new_cats

Now, we become sure, CATS statements copied all records from product table into the product_new_cats table. But what about sort key, distribution key and other settings? Let’s execute the following two commands:

SELECT tablename,
 "column",
 TYPE,
 distkey,
 sortkey,
 "notnull"
FROM pg_table_def
WHERE tablename = 'product';

SELECT tablename,
 "column",
 TYPE,
 distkey,
 sortkey,
 "notnull"
FROM pg_table_def
WHERE tablename = 'product_new_cats';

The above two commands returns two results below:main table settingsFigure 02: product table settings

create table as settingsFigure 03: product_new_cats table settings.

From the above tow images, we found CREATE TABLE AS successfully created new sort and distribution keys. This command also inherits these settings from parent table. But one thing needs to point out here, CREATE TABLE AS command does not inherits “NOT NULL” setting from the parent table. Here, all columns of product_new_cats table are created as NULL(see Figure 03). But all columns of parent “product” table were declared as “NOT NULL” (Figure 02).

Alright, so far we have an idea about how “CREATE TABLE AS” command behaves. Now we will notice what happens when we create table using “CREATE TABLE LIKE” statement. Let’s execute the following scripts:

CREATE TABLE product_new_like (LIKE product);
SELECT * FROM product_new_like;

The above statements creates a table named “product_new_like” using CREATE TABLE LIKE statement and later command select all records from the newly created table. CREATE TABLE LIKE does not copy data from source table. So the SELECT * command will not return any rows. But it inherits columns settings. Let’s execute the SQL statement below and have a look the result:

SELECT tablename,
 "column",
 TYPE,
 distkey,
 sortkey,
 "notnull"
FROM pg_table_def
WHERE tablename = 'product_new_like';

Result:create table like settingsFigure 04: Create table like settings

By comparing output of “Figure 02” and “Figure 04” we see CREATE TABLE LIKE statement also inherits sort key, distribution key. But the main point to to note here that, CREATE TABLE LIKE command additionally inherits “NOT NULL” settings from the source table that CREATE TABLE AS does not.

Primary key, identity and default constraint behavior when table is created

using CREATE TABLE AS and CREATE TABLE LIKE:

In order to check whether CREATE TABLE AS and CREATE TABLE LIKE statement inherits primary key, default constraint and identity settings from source table or not.the following scripts can be executed.

 To check Primary key:

SELECT DISTINCT
     NAME AS tablename,
     pc.conname AS constraint_name
FROM stv_tbl_perm tp
     LEFT OUTER JOIN pg_constraint pc
          ON pc.conrelid = tp.id
WHERE name IN ('product', 'product_new_cats', 'product_new_like')

create table as like primary key

Figure 05: CATS and LIKE does not inherits primary key.

From the above image, we can see both CREATE TABLE AS, CREATE TABLE LIKE do not inherit primary key constraint from source table.

To check Default and Identity:

SELECT DISTINCT p.name AS tablename,
       attname AS columnname,
       a.atthasdef t
FROM pg_attribute a
     INNER JOIN stv_tbl_perm p
          ON p.id = attrelid
WHERE name IN ('product','product_new_cats','product_new_like')
      AND a.atthasdef = 1

default cons and identity

Figure 06: CATS and LIKE does not inherits default constraint and identity.

The above query is used to select default constraint and identity column from all  three tables (product, product_new_cats,product_new_like). But we found only the source table , product is returned here. CREATE TABLE AS, CREATE TABLE LIKE does not inherit default value as well as identity settings. CREATE TABLE LIKE has an option to copy “DEFAULT” expression from the source table by using “INCLUDING DEFAULTS”.




Amazon Redshift- Create table in different ways

Amazon Redshift database is getting popular day by day specially for its super fast performance. During database development or administration everyone needs to create table either permanently or temporarily. Developers usually create tables using DDL like “CREATE TABLE” statement. But sometimes they need to create table quickly, without issuing the long, restricted DDL statements for different purposes. In this post I will try to show some techniques to create new tables in Redshift database.

Using DDL:

DDL is the most common approach to create tables in Redshift database. A typical DDL command is as follows:

CREATE TABLE testtable
 (
 id INT NOT NULL identity(1,1),
 name VARCHAR(50)
 );

Using SELECT INTO:

SELECT INTO is another way to create new tables. SELECT INTO facilitates fast bulk data copy to new tables from an existing table in the most easiest way. This approach also keeps the column names and data type same as the source table. Let us see an example:

SELECT
 patiendId,
 patientName,
 dischargedate
INTO adultpatients 
FROM patients
WHERE age >18

The above statement creates a new table named “adultpatients” by selecting records from patients table considering the where clause.

Using CREATE TABLE LIKE:

When you want to create a table by copying columns names, data type, default constraint, NOT NULL attribute from an existing table, CREATE TABLE LIKE is the your best choice. The following statement creates a “product_backup” table by copying settings from the exiting “product” table.

CREATE TABLE product_backup (LIKE product)

Please be noted, CREATE TABLE LIKE does not copy data to the newly created table as SELECT INTO  and CREATE TABLE AS command does.

CREATE TABLE AS (CATS):

Alike SELECT INTO, CREATE TABLE AS (CATS) statements also create a new table along with data. In order to create a table using CATS command, you need to provided a query. The newly created table is populated with data as per the query defined in the CREATE TABLE AS (CATS) statement. The statement below can can create a table using CATS syntax.

CREATE TABLE product_backup AS
 (
  SELECT
   product_id,
   product_name
  FROM product
 )

The above query creates “product_backup” table based on the query provided inside the “AS” clause.

So, these  are the commonly used approaches to create tables in Redshift based on different scenario. Hope this information will be helpful. In my future posts, I have a plan to describe about the in depth functionality, similarities and unique behaviors of each CREATE TABLE approach.