Who wants to delete rows after inserting them into tables? But to comply with business need, we delete unnecessary records from tables. After deleting large number of rows from a table, usually we like to see disk space consumed by these rows are released automatically. Popular RDBMS like SQL Server usually marks the rows affected by an immediate Delete command as Ghost records rather physically deleting them from disk at once. Afterwards, using an asynchronous internal process, database engine removes these Ghost records physically from disk. But Amazon-Redshift database does not offer such facility.You need to do it by yourself. In this post you will come to know what you must need to do after deleting a large number of rows from a Redshift Table.
The issue you may face after deleting a large number of rows from a Redshift Table.
Database developers sometimes query on the system catalog tables to know total row count of a table that contains huge records for faster response. In terms of Redshift this approach would be dangerous.Because after a delete operation, Redshift removes records from the table but does not update the catalog table metadata. Let us see an example:
I have populated a table named “lineorder” with AWS sample records. The table contains around 16 million of rows. Now let us use the following catalog table query to know the total records of the table.
SELECT datname, nspname, relname, SUM(ROWS) AS ROWS FROM pg_class, pg_namespace, pg_database, stv_tbl_perm WHERE pg_namespace.oid = relnamespace AND pg_class.oid = stv_tbl_perm.id AND pg_database.oid = stv_tbl_perm.db_id AND trim(datname) = 'database name here' AND trim(relname) = 'lineorder' GROUP BY datname, nspname, relname ORDER BY datname, nspname, relname;
Now let us verify the record count by issuing the query below:
SELECT COUNT(*) FROM lineorder;
Alright, we are getting the identical result by executing both queries.
In, “lineorder” table there are total 248900 rows against order date = ‘19920712’. We will delete all these rows from “lineorder” table. For this the following delete statement is executed:
DELETE FROM lineorder WHERE lo_orderdate = '19920712'
At this moment total records of the table will be 600037902-248900=599789002.
Let us verify this with the below SQL Statement once again:
SELECT COUNT(*) FROM lineorder;
Alright, we got our result now that after triggering the delete query we have 599789002 rows left in the table.Now, we are going to execute the catalog query to get the record count of the table. Let’s execute the following query once again:
SELECT datname, nspname, relname, SUM(ROWS) AS ROWS FROM pg_class, pg_namespace, pg_database, stv_tbl_perm WHERE pg_namespace.oid = relnamespace AND pg_class.oid = stv_tbl_perm.id AND pg_database.oid = stv_tbl_perm.db_id AND trim(datname) = 'your database name here' AND trim(relname) = 'lineorder' GROUP BY datname, nspname, relname ORDER BY datname, nspname, relname;
Query returns the below result:
Oh!! The records count we are getting by executing catalog query is 600037902!!! which was the initial records of the table (before deleting records). But total rows should be 599789002 as we have already deleted 248900 rows. That means catalog table’s metadata is not refreshed after the records are deleted from the table.
This could be huge problem if you use catalog query in your application to get total row count of a table. Sometimes it is a common practice for database developers that they query the system tables (catalog tables) in order to get record count of relatively bigger tables instead of triggering SELECT COUNT(*) statement. Because querying on system table (catalog table) is much faster than querying on the actual table. But particularly in terms of Redshift this approach would be dangerous as specified earlier in this post!!
So, what you must need to do after deleting a large number of rows from a Redshift Table. In order to solve this issue, after deleting a large number of rows from a table, it is obvious to execute VACUUM command. I executed the command like below:
While VACUUM gets finished, I executed the catalog query again to get total record count of the table. The image below depicts the result:
Now we are getting accurate result by executing catalog query as well.
Finally, VACUUM command also reclaim storage space and resort rows according to the Sort Key defined in the table. It also a best practice to ANALYZE redshift table after deleting large number of rows to keep the table statistic up to date. Updated statistics ensures faster query execution.
Hope this information will help you in your real life Redshift development.
Note: VACUUM is a slower and resource intensive operation. Best practice is to execute it in an off peak time window. So if you really need immediate record count after a delete operation, better to use SELECT COUNT (*) instead of using catalog query.