Amazon Redshift suggests defining primary key or foreign key constraint wherever applicable. But they are information only. Redshift does not enforce these constraints. In other words, a column with a primary key accepts duplicate values as well as a foreign key column also allows such a value that does not exist in the referenced table. So why does Redshift consider defining primary or foreign key as their best practice list? Because query optimizer uses them to choose the most suitable execution plan. But you need to be very careful while defining these constraints in your table. Let us see why with some real-life examples.
Need to think before defining primary key:
As specified earlier, it is fine to have a primary key in a Redshift table. But keep in mind it is your responsibility to make sure the column’s values are unique. Otherwise, you are likely to get the wrong query result. An example gives you more insight.
I have created a tabled named employee using the script below:
CREATE TABLE employee ( emp_id INT NOT NULL PRIMARY KEY, emp_name VARCHAR(100) NOT NULL, dept_id INT );
Then executed the following script to insert some records in to it.
INSERT INTO employee VALUES(1, 'Foo',1); INSERT INTO employee VALUES(2, 'Bar',1); INSERT INTO employee VALUES(2, 'Bar',1)
You might have noticed, the emp_id is declared as the primary key and in the insert scripts, ‘Bar’ is inserted twice having emp_id 2. In Redshift, this is not a problem as the Primary Key constraint is not enforced. Assume we have another table called “department” with the following rows:
Figure-1: Department table.
Now, let’s execute the following query to see, employee name and department of employee Id = 2. The query looks like below:
SELECT DISTINCT e.emp_id, d.dept_name FROM employee e INNER JOIN dbo.department d ON e.dept_id = d.dept_id WHERE e.emp_id = 2;
We know there are duplicate values for employee id 2. So “DISTINCT” keyword is used to eliminate duplicate values from the query results. Here is the output of the query:
Figure-2: Wrong query result.
See, it produces the wrong result! The “DISTINCT” keyword is not honored at all! The reason behind that is Redshift assumes all values of a primary key column are inserted by ensuring uniqueness. So the database engine does not care to make them distinct during query execution. Let’s remove the primary key constraint from the table:
ALTER TABLE employee DROP CONSTRAINT employee_pkey;
So, you got an idea of what you need to think before defining the primary key. And of course, always remember the Redshift developer guide’s following comments.
“Do not define the primary key and foreign key constraints unless your application enforces the constraints. Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.”