There are several use cases to split up tables to smaller chunks in a relational database. Our choice of SQL server is PostgreSQL the most advanced open source and free database out there for regular SQL workloads. It has decent support for partitioning data in tables but it is not automatically done. While I was working with a client it came up as a potential optimization to reduce the time it takes to run a query against a smaller portion of the data.
There is great coverage on the Postgres website about what benefits partitioning has. Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory. When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table. Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE. Seldom-used data can be migrated to cheaper and slower storage media.
First we are going to create a table with only two fields. In production there obvisously more fields but for the sake of simplicity I have trimmed down the rest.
CREATE TABLE testing_partition(patent_id BIGINT, date DATE) WITH ( OIDS=FALSE);
There is only one thing to note here, OIDS=FALSE, that basically tells to Postgres not to assign any OIDS (object identifiers) for the rows in the newly created table. This is the default behaviour of Postgres after the 8.0 release. More about it here: link. After creating the table we need to create a function that will be used as a trigger to create a partition if it does not exist when inserting to the table. Postgres functions are fun, you should check out what other useful things can be done with them.
CREATE OR REPLACE FUNCTION create_partition_and_insert() RETURNS trigger AS $BODY$ DECLARE partition_date TEXT; partition TEXT; BEGIN partition_date := to_char(NEW.date,'YYYY_MM_DD'); partition := TG_RELNAME || '_' || partition_date; IF NOT EXISTS(SELECT relname FROM pg_class WHERE relname=partition) THEN RAISE NOTICE 'A partition has been created %',partition; EXECUTE 'CREATE TABLE ' || partition || ' (check (date = ''' || NEW.date || ''')) INHERITS (' || TG_RELNAME || ');'; END IF; EXECUTE 'INSERT INTO ' || partition || ' SELECT(' || TG_RELNAME || ' ' || quote_literal(NEW) || ').*;'; RETURN NULL; END; $BODY$ LANGUAGE plpgsql VOLATILE COST 100;
One thing to note that this relies on the “date” field being present in the table, and that controls the name of the partition. The “date” field is date type (surprise) and we need to convert it to text so it can be used as a field name in Postgres. Luckily the tochar function does exactly that, we can give a mask how we would like to receive the string. I was choosing YYYYMM_DD as the mask that gives us nice tables names. There is only one more thing left before we can try to insert into our new system. We need to create a trigger that runs before the actual insert happens. Creating the trigger is simple. The only important thing to note here is that it has to be before insert.
CREATE TRIGGER testing_partition_insert_trigger BEFORE INSERT ON testing_partition FOR EACH ROW EXECUTE PROCEDURE create_partition_and_insert();
Now we have everything in place for testing partitioning. Lets execute few INSERT statements to see it works as expected.
=> insert into testing_partition values (12312, '2011-01-11'); NOTICE: A partition has been created testing_partition_2011_01_11 INSERT 0 0 => => insert into testing_partition values (1, '2011-01-11'); INSERT 0 0
One minor problem you might notice is that the function does not return how many rows were inserted into the table. Other than that it seems everything is working.
We have few partitions in our setup but there is no good way to check how many exactly there. For checking on our partitions we can craft a simple query and roll it into a view for easier execution.
CREATE VIEW show_partitions AS SELECT nmsp_parent.nspname AS parent_schema, parent.relname AS parent, nmsp_child.nspname AS child_schema, child.relname AS child FROM pg_inherits JOIN pg_class parent ON pg_inherits.inhparent = parent.oid JOIN pg_class child ON pg_inherits.inhrelid = child.oid JOIN pg_namespace nmsp_parent ON nmsp_parent.oid = parent.relnamespace JOIN pg_namespace nmsp_child ON nmsp_child.oid = child.relnamespace WHERE parent.relname='testing_partition' ;
Lets select all of the partitions we got for the table so far:
=> select * from show_partitions; parent_schema | parent | child_schema | child ---------------+-------------------+--------------+------------------------------ public | testing_partition | public | testing_partition_2019_01_11 public | testing_partition | public | testing_partition_2018_01_11 public | testing_partition | public | testing_partition_2011_01_11 (3 rows)
Perfect, now we have a good start to use our new setup with automatic partition creation. Few open questions left on the table:
- h̶o̶w̶ ̶t̶o̶ ̶r̶e̶t̶u̶r̶n̶ ̶t̶h̶e̶ ̶c̶o̶r̶r̶e̶c̶t̶ ̶n̶u̶m̶b̶e̶r̶ ̶i̶n̶s̶e̶r̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶e̶ ̶t̶a̶b̶l̶e̶to
- h̶o̶w̶ ̶t̶o̶ ̶r̶e̶t̶u̶r̶n̶ ̶t̶h̶e̶ ̶n̶e̶w̶l̶y̶ ̶c̶r̶e̶a̶t̶e̶d̶ ̶i̶d̶ ̶f̶r̶o̶m̶ ̶t̶h̶e̶ ̶f̶u̶n̶c̶t̶i̶o̶n̶
Update I: Changing the INSERT statement to include patent_id when returns:
CREATE OR REPLACE FUNCTION create_partition_and_insert() RETURNS trigger AS $BODY$ DECLARE partition_date TEXT; partition TEXT; BEGIN partition_date := to_char(NEW.date,'YYYY_MM_DD'); partition := TG_RELNAME || '_' || partition_date; IF NOT EXISTS(SELECT relname FROM pg_class WHERE relname=partition) THEN RAISE NOTICE 'A partition has been created %',partition; EXECUTE 'CREATE TABLE ' || partition || ' (check (date = ''' || NEW.date || ''')) INHERITS (' || TG_RELNAME || ');'; END IF; EXECUTE 'INSERT INTO ' || partition || ' SELECT(' || TG_RELNAME || ' ' || quote_literal(NEW) || ').* RETURNING patent_id;'; RETURN NULL; END; $BODY$ LANGUAGE plpgsql VOLATILE COST 100;
We can add the same to the actual insert we are issuing.
# insert into testing_partition values (1, '2011-01-11') returning patent_id ; patent_id ----------- 1 (1 row)
I am going to update this post when I figure out these things. Thanks for reading!
UPDATE I: Figured out finally how to return the ids, updates above.
This article was cross posted to: https://www.streambrightdata.com/