What you must need to do after deleting a large number of rows from a Redshift Table

Who wants to delete rows after inserting them into tables? But to comply with business need, we delete unnecessary records from tables. After deleting large number of rows from a table, usually we like to see disk space consumed by these rows are released automatically. Popular RDBMS like SQL Server usually marks the rows affected by an immediate Delete command as Ghost records rather physically deleting them from disk at once. Afterwards, using an asynchronous internal process, database engine removes these Ghost records physically from disk. But Amazon-Redshift database does not offer such facility.You need to do it by yourself. In this post you will come to know what you must need to do after deleting a large number of rows from a Redshift Table.

The issue you may face after deleting a large number of rows from a Redshift Table.

Database developers sometimes query on the system catalog tables to know total row count of a table that contains huge records for faster response. In terms of Redshift this approach would be dangerous.Because after a delete operation, Redshift removes records from the table but does not update the catalog table metadata. Let us see an example:

I have populated a table named “lineorder” with AWS sample records. The table contains around 16 million of rows. Now let us use the following catalog table query to know the total records of the table.

SELECT datname,
        nspname,
        relname,
        SUM(ROWS) AS ROWS
FROM   pg_class,
       pg_namespace,
       pg_database,
       stv_tbl_perm
WHERE  pg_namespace.oid = relnamespace
   AND pg_class.oid = stv_tbl_perm.id
   AND pg_database.oid = stv_tbl_perm.db_id
   AND trim(datname) = 'database name here'
   AND trim(relname) = 'lineorder'
GROUP BY
 datname, nspname, relname
ORDER BY
 datname, nspname, relname;

After, executing the above query we get the following result:
redshift_rec_count_cat_table
Figure 01: Table record count using catalog query.

Now let us verify the record count by issuing the query below:

SELECT COUNT(*)
FROM  lineorder;

select_count
Figure 02: Getting row count by querying on the table itself.

Alright, we are getting the identical result by executing both queries.

In, “lineorder” table there are total 248900 rows against order date = ‘19920712’. We will delete all these rows from “lineorder” table. For this the following delete statement is executed:

DELETE
FROM   lineorder
WHERE  lo_orderdate = '19920712'

And here is the result.
delete records
Figure 03: 248900 rows are deleted.

At this moment total records of the table will be 600037902-248900=599789002.
Let us verify this with the below SQL Statement once again:

SELECT COUNT(*)
FROM lineorder;

Here is the result:
select_count_after_delete
Figure 04: Record count after deleting records.

Alright, we got our result now that after triggering the delete query we have 599789002 rows left in the table.Now, we are going to execute the catalog query to get the record count of the table. Let’s execute the following query once again:

SELECT datname,
       nspname,
       relname,
       SUM(ROWS) AS ROWS
FROM   pg_class,
       pg_namespace,
       pg_database,
       stv_tbl_perm
WHERE  pg_namespace.oid = relnamespace
   AND pg_class.oid = stv_tbl_perm.id
   AND pg_database.oid = stv_tbl_perm.db_id
   AND trim(datname) = 'your database name here'
   AND trim(relname) = 'lineorder'
GROUP BY
 datname, nspname, relname
ORDER BY
 datname, nspname, relname;

Query returns the below result:

 rec_count_cat_query
Figure 05: Catalog tables are not refreshed after deleting large number of rows.

Oh!! The records count we are getting by executing catalog query is 600037902!!! which was the initial records of the table (before deleting records). But total rows should be 599789002 as we have already deleted 248900 rows. That means catalog table’s metadata is not refreshed after the records are deleted from the table.

This could be huge problem if you use catalog query in your application to get total row count of a table. Sometimes it is a common practice for database developers that they query the system tables (catalog tables) in order to get record count of relatively bigger tables instead of triggering SELECT COUNT(*) statement. Because querying on system table (catalog table) is much faster than querying on the actual table. But particularly in terms of Redshift this approach would be dangerous as specified earlier in this post!!

So, what you must need to do after deleting a large number of rows from a Redshift Table. In order to solve this issue, after deleting a large number of rows from a table, it is obvious to execute VACUUM command. I executed the command like below:

VACUUM lineorder;

While VACUUM gets finished, I executed the catalog query again to get total record count of the table. The image below depicts the result:

select_count_cat_query_after_vac
Figure 06: Accurate record count after running VACUUM command.

Now we are getting accurate result by executing catalog query as well.

Finally, VACUUM command also reclaim storage space and resort rows according to the Sort Key defined in the table. It also a best practice to ANALYZE redshift table after deleting large number of rows to keep the table statistic up to date. Updated statistics ensures faster query execution.

Hope this information will help you in your real life Redshift development.

Note: VACUUM is a slower and resource intensive operation. Best practice is to execute it in an off peak time window. So if you really need immediate record count after a delete operation, better to use SELECT COUNT (*) instead of using catalog query.




This entry was posted in Amazon Redshift, Redshift Database Development by Khorshed. Bookmark the permalink.

About Khorshed

Khorshed has 12 years of cumulative professional experience in areas of database administration, development, Business Intelligent, system appreciation, requirement analysis and software development (using .net technologies). Achieved MCITP in SQL Server in both SQL Server development and administration. He has vast knowledge in SQL Server, SSIS, SSRS, SSAS. He is also proficient in Amazon cloud services like Redshift, Data Pipeline, SNS, DynamoDB, EMR and many more.

4 thoughts on “What you must need to do after deleting a large number of rows from a Redshift Table

  1. Just աant to say уoսr article is as amazing. ҬҺе clarity in yoսr post is simply nice аnd i coսld assume yοu are an expert on thiѕ subject.
    Fine with your permission lеt me tο grab yоur feed to keep up to date wіth
    forthcoming post. Τhanks ɑ mіllion аnd pleasе keep սp the
    enjoyable work.

    my web bloghere Recover Twitter Account (Pust)

Leave a Reply

Your email address will not be published. Required fields are marked *

4 × four =