Amazon Redshift – What you need to think before defining primary key

Amazon Redshift suggests defining primary key or foreign key constraint wherever applicable. But they are information only. Redshift does not enforce these constraints. In other words, a column with a primary key accepts duplicate values as well as a foreign key column also allows such a value that does not exist in the referenced table. So why does Redshift consider defining primary or foreign key as their best practice list? Because query optimizer uses them to choose the most suitable execution plan. But you need to be very careful while defining these constraints in your table. Let us see why with some real-life examples.

Need to think before defining primary key:

As specified earlier, it is fine to have a primary key in a Redshift table. But keep in mind it is your responsibility to make sure the column’s values are unique. Otherwise, you are likely to get the wrong query result. An example gives you more insight.

I have created a tabled named employee using the script below:

CREATE TABLE employee
(
 emp_id       INT NOT NULL PRIMARY KEY,
 emp_name     VARCHAR(100) NOT NULL,
 dept_id      INT
);

Then executed the following script to insert some records in to it.

INSERT INTO employee VALUES(1, 'Foo',1);
INSERT INTO employee VALUES(2, 'Bar',1);
INSERT INTO employee VALUES(2, 'Bar',1)

You might have noticed, the emp_id is declared as the primary key and in the insert scripts, ‘Bar’ is inserted twice having emp_id 2. In Redshift, this is not a problem as the Primary Key constraint is not enforced. Assume we have another table called “department” with the following rows:
dept_rows
Figure-1: Department table.

Now, let’s execute the following query to see, employee name and department of employee Id = 2. The query looks like below:

SELECT DISTINCT e.emp_id,
       d.dept_name
FROM   employee e
       INNER JOIN dbo.department d
            ON  e.dept_id = d.dept_id
WHERE  e.emp_id = 2;

We know there are duplicate values for employee id 2. So “DISTINCT”  keyword is used to eliminate duplicate values from the query results. Here is the output of the query:
wrong_result
Figure-2: Wrong query result.

See, it produces the wrong result! The “DISTINCT” keyword is not honored at all! The reason behind that is Redshift assumes all values of a primary key column are inserted by ensuring uniqueness. So the database engine does not care to make them distinct during query execution. Let’s remove the primary key constraint from the table:

ALTER TABLE employee
DROP CONSTRAINT employee_pkey;

And execute the select statement once again. We are getting the expected result now!
corrent_result
Figure-3: Correct Result

So, you got an idea of what you need to think before defining the primary key. And of course, always remember the Redshift developer guide’s following comments.

Do not define the primary key and foreign key constraints unless your application enforces the constraints. Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.




This entry was posted in Amazon Redshift, Redshift Database Development by Khorshed. Bookmark the permalink.

About Khorshed

Khorshed has 12 years of cumulative professional experience in areas of database administration, development, Business Intelligent, system appreciation, requirement analysis and software development (using .net technologies). Achieved MCITP in SQL Server in both SQL Server development and administration. He has vast knowledge in SQL Server, SSIS, SSRS, SSAS. He is also proficient in Amazon cloud services like Redshift, Data Pipeline, SNS, DynamoDB, EMR and many more.

9 thoughts on “Amazon Redshift – What you need to think before defining primary key

  1. Good article.
    I think it is not a good idea to use DISTINCT in Redshift. Instead of that GROUP BY should be preferred. In your case, instead of DISTINCT, if you use GROUP BY to get the result, there will be no issues. You don’t even need to drop the constraint.
    Below is the query I used:
    SELECT e.emp_id,
    d.dept_name
    FROM scratchdb.employee e
    INNER JOIN scratchdb.department d
    ON e.dept_id = d.dept_id
    WHERE e.emp_id = 2
    group by e.emp_id,d.dept_name ;
    Output was:
    emp_id dept_name
    2 Dept-1

Leave a Reply

Your email address will not be published. Required fields are marked *