Amazon Redshift – What you need to think before defining primary key

Amazon Redshift suggests to define primary key or foreign key constraint wherever applicable. But they are information only. Redshift does not enforce these constraints. In other words, a column with primary key accepts duplicate values as well as a foreign key column also allows such a value that does not exists in the referenced table. So why does Redshift consider defining primary or foreign key as their best practice list? Because query optimizer uses them to choose the most suitable execution plan. But you need to be very careful while defining these constraint in your table. Let us see why with some real-life example.

Need to think before defining primary key:

As spoke earlier, it is fine to have a primary key in a Redshift table. But keep in mind it is your responsibility to make sure the column’s values are unique. Otherwise, you are likely to get wrong query result. An example gives you more insight.

I have created a tabled named employee using the script below:

CREATE TABLE employee
(
 emp_id       INT NOT NULL PRIMARY KEY,
 emp_name     VARCHAR(100) NOT NULL,
 dept_id      INT
);

Then executed the following script to insert some records in to it.

INSERT INTO employee VALUES(1, 'Foo',1);
INSERT INTO employee VALUES(2, 'Bar',1);
INSERT INTO employee VALUES(2, 'Bar',1)

You might have noticed, the emp_id is declared as primary key and in the insert scripts, ‘Bar’ is inserted twice having emp_id 2. In Redshift this is not a problem as Primary Key constraint is not enforced. Assume we have another table called “department” with following rows:
dept_rows
Figure-1: Department table.

Now, let’s execute the following query to see, employee name and department of employee Id = 2. The query looks like below:

SELECT DISTINCT e.emp_id,
       d.dept_name
FROM   employee e
       INNER JOIN dbo.department d
            ON  e.dept_id = d.dept_id
WHERE  e.emp_id = 2;

We know there are duplicate values for employee id 2. So “DISTINCT”  keyword is used to eliminate duplicate values from query result. Here is the output of the query:
wrong_result
Figure-2: Wrong query result.

See, it produces wrong result! The “DISTINCT” keyword is not honored at all! The reason behind that is Redshift assumes all values of a primary key column are inserted by ensuring the uniqueness. So database engine does not care to make them distinct during query execution. Let’s remove the primary key constraint from the table:

ALTER TABLE employee
DROP CONSTRAINT employee_pkey;

And execute the select statement once again. We are getting the expected result now!
corrent_result
Figure-3: Correct Result

So, you got an idea of what you need to think before defining primary key. And of course, always remember Redhsift developer guide’s following comments.

Do not define primary key and foreign key constraints unless your application enforces the constraints. Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.

I my next post I will figure out the same thing for foreign key. Until then keep playing!!!




This entry was posted in Amazon Redshift, Redshift Database Development by Khorshed. Bookmark the permalink.

About Khorshed

Khorshed has 12 years of cumulative professional experience in areas of database administration, development, Business Intelligent, system appreciation, requirement analysis and software development (using .net technologies). Achieved MCITP in SQL Server in both SQL Server development and administration. He has vast knowledge in SQL Server, SSIS, SSRS, SSAS. He is also proficient in Amazon cloud services like Redshift, Data Pipeline, SNS, DynamoDB, EMR and many more.

5 thoughts on “Amazon Redshift – What you need to think before defining primary key

Leave a Reply

Your email address will not be published. Required fields are marked *

one + 7 =