Who wants to delete rows after inserting them into tables? But to comply with business needs, we sometimes delete unnecessary records from tables. After deleting a large number of rows from a table, usually, we like to see disk space consumed by these rows are released automatically. Popular RDBMS like SQL Server usually marks the rows affected by an immediate Delete command as Ghost records rather than physically deleting them from disks at once. Afterward, using an asynchronous internal process, the database engine removes these Ghost records physically from the disk. But Amazon-Redshift database does not offer such a facility. You need to do it by yourself. In this post, you will come to know what you must need to do after deleting a large number of rows from a Redshift Table.
The issue you may face after deleting a large number of rows from a Redshift Table.
Database developers sometimes query on the system catalog tables to know the total row count of a table that contains huge records for faster response. In terms of Redshift, this approach would be dangerous. Because after a delete operation, Redshift removes records from the table but does not update the catalog table metadata. Let us see an example:
I have populated a table named “lineorder” with AWS sample records. The table contains around 16 million rows. Now let us use the following catalog table query to know the total records of the table.
SELECT datname, nspname, relname, SUM(ROWS) AS ROWS FROM pg_class, pg_namespace, pg_database, stv_tbl_perm WHERE pg_namespace.oid = relnamespace AND pg_class.oid = stv_tbl_perm.id AND pg_database.oid = stv_tbl_perm.db_id AND trim(datname) = 'database name here' AND trim(relname) = 'lineorder' GROUP BY datname, nspname, relname ORDER BY datname, nspname, relname;
Now let us verify the record count by issuing the query below:
SELECT COUNT(*) FROM lineorder;
Alright, we are getting the identical result by executing both queries.
In, “lineorder” table there is a total of 248900 rows for order date = ‘19920712’. We will delete all these rows from “lineorder” table. For this the following delete statement is executed:
DELETE FROM lineorder WHERE lo_orderdate = '19920712'
At this moment total records of the table will be 600037902-248900=599789002.
Let us verify this with the below SQL Statement once again:
SELECT COUNT(*) FROM lineorder;
Alright, we got our result now that after triggering the delete query we have 599789002 rows left in the table. Now, we are going to execute the catalog query to get the record count of the table. Let’s execute the following query once again:
SELECT datname, nspname, relname, SUM(ROWS) AS ROWS FROM pg_class, pg_namespace, pg_database, stv_tbl_perm WHERE pg_namespace.oid = relnamespace AND pg_class.oid = stv_tbl_perm.id AND pg_database.oid = stv_tbl_perm.db_id AND trim(datname) = 'your database name here' AND trim(relname) = 'lineorder' GROUP BY datname, nspname, relname ORDER BY datname, nspname, relname;
The query returns the below result:
Oh!! The records count we are getting by executing catalog query is 600037902!!! which was the initial records of the table (before deleting records). But total rows should be 599789002 as we have already deleted 248900 rows. That means the catalog table’s metadata is not refreshed after the records are deleted from the table.
This could be a huge problem if you use catalog queries in your application to get the total row count of a table. Sometimes it is a common practice for database developers that they query the system tables (catalog tables) in order to get the record count of relatively bigger tables instead of triggering the SELECT COUNT(*) statement. Because querying on the system table (catalog table) is much faster than querying on the actual table. But particularly in terms of Redshift, this approach would be dangerous as specified earlier in this post!
So, what you must need to do after deleting a large number of rows from a Redshift Table. In order to solve this issue, after deleting a large number of rows from a table, it is obvious to execute the VACUUM command. I executed the command like below:
While VACUUM gets finished, I executed the catalog query again to get total record count of the table. The image below depicts the result:
Now we are getting an accurate result by executing catalog query as well.
Finally, VACUUM command also reclaims storage space and resort rows according to the Sort Key defined in the table. It also a best practice to ANALYZE redshift table after deleting a large number of rows to keep the table statistic up to date. Updated statistics ensures faster query execution.
Hope this information will help you in your real-life Redshift development.
Note: VACUUM is a slower and resource-intensive operation. The best practice is to execute it in an off-peak time window. So if you really need immediate record count after a delete operation, better to use SELECT COUNT (*) instead of using catalog query.