Amazon Redshift suggests defining primary key or foreign key constraint wherever applicable. But they are information only. Redshift does not enforce these constraints. In other words, a column with a primary key accepts duplicate values as well as a foreign key column also allows such a value that does not exist in the referenced table. So why does Redshift consider defining primary or foreign key as their best practice list? Because query optimizer uses them to choose the most suitable execution plan. But you need to be very careful while defining these constraints in your table. Let us see why with some real-life examples.
Need to think before defining primary key:
As specified earlier, it is fine to have a primary key in a Redshift table. But keep in mind it is your responsibility to make sure the column’s values are unique. Otherwise, you are likely to get the wrong query result. An example gives you more insight.
I have created a tabled named employee using the script below:
CREATE TABLE employee ( emp_id INT NOT NULL PRIMARY KEY, emp_name VARCHAR(100) NOT NULL, dept_id INT );
Then executed the following script to insert some records in to it.
INSERT INTO employee VALUES(1, 'Foo',1); INSERT INTO employee VALUES(2, 'Bar',1); INSERT INTO employee VALUES(2, 'Bar',1)
You might have noticed, the emp_id is declared as the primary key and in the insert scripts, ‘Bar’ is inserted twice having emp_id 2. In Redshift, this is not a problem as the Primary Key constraint is not enforced. Assume we have another table called “department” with the following rows:
Figure-1: Department table.
Now, let’s execute the following query to see, employee name and department of employee Id = 2. The query looks like below:
SELECT DISTINCT e.emp_id, d.dept_name FROM employee e INNER JOIN dbo.department d ON e.dept_id = d.dept_id WHERE e.emp_id = 2;
We know there are duplicate values for employee id 2. So “DISTINCT” keyword is used to eliminate duplicate values from the query results. Here is the output of the query:
Figure-2: Wrong query result.
See, it produces the wrong result! The “DISTINCT” keyword is not honored at all! The reason behind that is Redshift assumes all values of a primary key column are inserted by ensuring uniqueness. So the database engine does not care to make them distinct during query execution. Let’s remove the primary key constraint from the table:
ALTER TABLE employee DROP CONSTRAINT employee_pkey;
And execute the select statement once again. We are getting the expected result now!
Figure-3: Correct Result
So, you got an idea of what you need to think before defining the primary key. And of course, always remember the Redshift developer guide’s following comments.
“Do not define the primary key and foreign key constraints unless your application enforces the constraints. Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.”
Nice post and very informative. Thank you for this nice article.
Really an informative post.
Thanks!
:mindblown:
Thanks for this
good finding
very informative and nicely explained !!
Good article.
I think it is not a good idea to use DISTINCT in Redshift. Instead of that GROUP BY should be preferred. In your case, instead of DISTINCT, if you use GROUP BY to get the result, there will be no issues. You don’t even need to drop the constraint.
Below is the query I used:
SELECT e.emp_id,
d.dept_name
FROM scratchdb.employee e
INNER JOIN scratchdb.department d
ON e.dept_id = d.dept_id
WHERE e.emp_id = 2
group by e.emp_id,d.dept_name ;
Output was:
emp_id dept_name
2 Dept-1
Very informative , Great explanation. Thank you!
Thanks!