Sometimes you have two tables with similar structures and want to compare them. One could be the table in an older schema on an old server and the other one the newer schema on a new server. Let’s say you move data from one server to the other and want to check the data.
If you just want to compare two tables only once, you can go for a non-generic approach. Let’s assume that you have two tables (table A and table B) with a primary key called primary_key and two other columns (column1 and column2). What you want to get are:
- keys present in A but not in B
- keys present in B but not in A
- keys present in both A and B but where the other columns differ
The first part can be fetched like this:
SELECT A.primary_key FROM A LEFT JOIN B ON A.primary_key=B.primary_key WHERE B.primary_key IS NULL
The second part is basically the same but switching A and B:
SELECT B.primary_key FROM B LEFT JOIN A ON B.primary_key=A.primary_key WHERE A.primary_key IS NULL
The third part just involves an inner join and checking the other columns:
SELECT A.primary_key FROM A INNER JOIN B ON A.primary_key=B.primary_key WHERE A.column1 <> B.column1 OR A.column2 <> B.column2
An alternative is to get the following:
- entries in A which have no entry in B exactly matching
- keys present in B but not in A
The second part uses the same statement as the second part above. The first part can be fetched like this:
SELECT A.primary_key FROM A WHERE A.primary_key NOT IN (SELECT B.primary_key FROM B WHERE A.primary_key=B.primary_key AND A.column1=B.column1 and A.column2=B.column2)
This step basically combines the steps 1 and 3 in the first list.
Note that you can always replace a left join with a not in e.g. the following:
SELECT A.primary_key FROM A LEFT JOIN B ON A.primary_key=B.primary_key WHERE B.primary_key IS NULL
By:
SELECT A.primary_key FROM A WHERE A.primary_key NOT IN (SELECT B.primary_key FROM B WHERE A.primary_key=B.primary_key)
You just have to check what’s best from a performance point of view in you particular case.
If you do not have a primary key, you’ll have to do it using the algorithm described further down in the generic solution.
But if you often need to compare tables or need to compare many tables at once, then the following generic approach is probably better:
First since the two tables do not have the exact same columns, we’ll need to figure out what are the columns they have in common. To do this, we’ll simply read from INFORMATION_SCHEMA.COLUMNS.
A function to get the list of columns would look like this:
DELIMITER $$ DROP FUNCTION IF EXISTS getColumnList$$ CREATE FUNCTION getColumnList(schema1 VARCHAR(64), table1 VARCHAR(64)) RETURNS text BEGIN DECLARE column_name VARCHAR(64); DECLARE column_list TEXT DEFAULT ''; DECLARE done INT DEFAULT 0; DECLARE cur CURSOR FOR SELECT c1.COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS c1 WHERE c1.TABLE_SCHEMA=schema1 AND c1.TABLE_NAME=table1; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1; OPEN cur; read_loop: LOOP FETCH cur INTO column_name; IF done THEN LEAVE read_loop; END IF; IF column_list <> '' THEN SET column_list = CONCAT(column_list, ', '); END IF; SET column_list = CONCAT(column_list, '`', column_name, '`'); END LOOP; CLOSE cur; RETURN column_list; END $$ DELIMITER ;
You can all the function like this:
SELECT getColumnList('phpBugTracker', 'comment');
Now we do not only want the columns of one table but of the two tables. It’s just a matter of joining INFORMATION_SCHEMA.COLUMNS with itself:
DELIMITER $$ DROP FUNCTION IF EXISTS getCommonColumnList$$ CREATE FUNCTION getCommonColumnList(schema1 VARCHAR(64), table1 VARCHAR(64), schema2 VARCHAR(64), table2 VARCHAR(64)) RETURNS text BEGIN DECLARE column_name VARCHAR(64); DECLARE column_list TEXT DEFAULT ''; DECLARE done INT DEFAULT 0; DECLARE cur CURSOR FOR SELECT c1.COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS c1 INNER JOIN INFORMATION_SCHEMA.COLUMNS c2 ON c1.COLUMN_NAME=c2.COLUMN_NAME WHERE c1.TABLE_SCHEMA=schema1 AND c1.TABLE_NAME=table1 AND c2.TABLE_SCHEMA=schema1 AND c2.TABLE_NAME=table2; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1; OPEN cur; read_loop: LOOP FETCH cur INTO column_name; IF done THEN LEAVE read_loop; END IF; IF column_list <> '' THEN SET column_list = CONCAT(column_list, ', '); END IF; SET column_list = CONCAT(column_list, '`', column_name, '`'); END LOOP; CLOSE cur; RETURN column_list; END $$ DELIMITER ;
This one takes 4 parameters instead of 2:
SELECT getCommonColumnList('phpBugTracker', 'comment', 'phpBugTracker', 'comment_copy');
Now we know the columns, we just need an algorithm to compare these columns in the two tables. We’ll use the following:
SELECT MIN (TableName) AS TableName, column_list FROM ( SELECT 'source_table ' as TableName, column_list FROM source_table UNION ALL SELECT 'destination_table' as TableName, column_list FROM destination_table ) AS tmp GROUP BY column_list HAVING COUNT (*) = 1
Now we need to create a function with parameters, calling our previous function to get the relevant column names, creating this statement and executing it:
DELIMITER $$ DROP PROCEDURE IF EXISTS compareTables$$ CREATE PROCEDURE compareTables(schema1 VARCHAR(64), table1 VARCHAR(64), schema2 VARCHAR(64), table2 VARCHAR(64)) BEGIN SET @columns=getCommonColumnList(schema1, table1, schema2, table2); SET @st = CONCAT("SELECT MIN(TableName) as TableName, ",@columns," FROM ( SELECT '", schema1,".", table1,"' as TableName, ",@columns," FROM ", schema1,".", table1," UNION ALL SELECT '", schema2,".", table2,"' as TableName, ",@columns," FROM ", schema2,".", table2," ) tmp GROUP BY ",@columns," HAVING COUNT(*) = 1"); PREPARE stmt FROM @st; EXECUTE stmt; END $$ DELIMITER ;
And you call it like this:
CALL compareTables('phpBugTracker', 'comment', 'phpBugTracker', 'comment_copy');
Of course you could also adapt it to store the results in a temporary table…
Nice post.
Of course if you have a more frequent use case a tool might prove useful. For example the company I work for, Red Gate, does schema and data comparison tools for MySQL (Windows only): http://www.red-gate.com/products/mysql/
Hi good afternoon, its a great post, I have a question, what can I do if I want to compare 2 tables from diefernt servers? thanks in advance!
In order to compare tables in databases across servers, you’ll need to use federated tables. Those are similar to Oracle’s link tables or Sybase’s proxy tables. They are local representation of a remote table. I’ll try to soon write an article with details on how this is done.
Hi, Thanks for the post
I want to know what is the expected time taken when the two tables to be compared each have roughly 500,000 entries and there are 70 columns in each of these 2 tables which need to be verified.
I need to know this to estimate the time taken for such testing.