Before starting to talk about partitions, let me explain the situation that led to me researching them. I have been given a system whereby all data is treated at a "dataset" level, where each dataset is put into its own table. The reason the system was developed like this in the first place is because it allowed the tables to be small enough to be held in memory when performing queries. This meant that the database performance was consistently fast, even as the amount of data grew rapidly.
The problem with such a system is that all datasets, and thus tables, are expected to have the exact same schema, which is hard to edit. Imagine how hard the simple act of adding or removing a column is in such a system. This has resulted in some of the table schemas diverging with different indexes, and in some extreme cases, some tables have extra columns. The other slight problem is that
SHOW TABLES is now pretty unusable.
How Will Partitions Help?
Partitions are essentially tables in the underlying system. Using partitions will allow us to consolidate the thousands of tables into a single table schema with a new "dataset_id" column telling us which dataset each row belongs to. The partitions will keep queries fast by allowing us to load only a section of the entire table into memory. Loading the entire consolidated table into memory is impossible as it has over 7 million rows and growing fast.
The reason I mention the use case above is because partitions will not help you unless your queries operate only on the data within a partition, in this case the
dataset_id. Having queries that operate over multiple partitions will be slow. This means my queries will need to use the
dataset_id in the conditional clauses. Partitioning is also unlikely to be of much help unless you have large tables. I wouldn't even consider using partitions if your tables only contain a few thousand rows.
Partitioning by a column requires that column to be within the primary key. Lots of users create tables that have an auto incrementing primary key called
id. If this is the situation you are in, then the solution is to add the column within the primary key,
dataset_id in my case. If you want to partition an existing table, then you would need to execute:
ALTER TABLE `[table name]` DROP PRIMARY KEY, ADD PRIMARY KEY (`id` ,`dataset_id`); ALTER TABLE `[table name] ` partition by KEY(`dataset_id`) partitions [number of partitions];
Alternatively, if you are creating the table for the first time, then an example is shown below
CREATE TABLE test123 ( id INT UNSIGNED AUTO_INCREMENT, dataset_id INT UNSIGNED NOT NULL, field1 INT UNSIGNED NOT NULL, field2 varchar(255) NOT NULL, PRIMARY KEY (id, dataset_id) ) PARTITION BY KEY(dataset_id) partitions [number of partitions];
Auto Increment Still Behaves The Same
id field will still increment by 1 even though you are using a combined primary key.
Number of Partitions, Key() and Hash() Partitioning
In the examples above, I have used the
KEY keyword for creating my partitions. This will result in the partition lookup being an MD5 of the values of the columns specified within the braces. The only difference between
HASH is that
HASH uses a user defined expression.
There are other partitioning types, but this was most appropriate for my situation. If you are dealing with a massive log table, then you may wish to partition by the timestamp of the log, in which case a range partition may be appropriate so that queries for the most recent logs will result in going to the same partition, which is likely to be kept in memory.
When creating a partitioned table, you will specify the number of partitions which will be immediately created. Data will be put into the partition based on the partitioning type/algorithm. By using a
dataset_id values that are very close in value, e.g.
3 are likely to be in different partitions, unlike a
RANGE partition type. This is actually a good thing because it ensures that my data will spread fairly evenly across the partitions. Consider the alternative where data is stored linearly across partitions so that
2 are likely to be on the same partition and
50001 are likely to be stored together on a different one. In such a scenario, fragmentation of the indexes (due to users deleting datasets) would result in an uneven distribution with some partitions containing a lot more data than others. Also since my system just auto increments a unique dataset id for each dataset, my first partition would have to become completely full and become a massive table before I started using the next partition, resulting in nullifying the entire point of partitioning in the first place (smaller tables that can be loaded into memory). In a situation where I have 100 tables for an unsigned integer, the first partition would hold
dataset_id values between 0 and 42,949,672 (the maximum value of an unsigned INT is 4294967295). All of my existing dataset_ids would reside within that one partition.
5.6.7 was 1024.
5.6.7+ can have up to 8192.