MySQL: Compare two similar tables to find unmatched records

Sometimes you have two tables with similar structures and want to compare them. One could be the table in an older schema on an old server and the other one the newer schema on a new server. Let’s say you move data from one server to the other and want to check the data.

If you just want to compare two tables only once, you can go for a non-generic approach. Let’s assume that you have two tables (table A and table B) with a primary key called primary_key and two other columns (column1 and column2). What you want to get are:

  1. keys present in A but not in B
  2. keys present in B but not in A
  3. keys present in both A and B but where the other columns differ

The first part can be fetched like this:

SELECT A.primary_key FROM A LEFT JOIN B ON A.primary_key=B.primary_key WHERE B.primary_key IS NULL

The second part is basically the same but switching A and B:

SELECT B.primary_key FROM B LEFT JOIN A ON B.primary_key=A.primary_key WHERE A.primary_key IS NULL

The third part just involves an inner join and checking the other columns:

SELECT A.primary_key FROM A INNER JOIN B ON A.primary_key=B.primary_key WHERE A.column1 <> B.column1 OR A.column2 <> B.column2

An alternative is to get the following:

  1. entries in A which have no entry in B exactly matching
  2. keys present in B but not in A

The second part uses the same statement as the second part above. The first part can be fetched like this:

SELECT A.primary_key FROM A WHERE A.primary_key NOT IN (SELECT B.primary_key FROM B WHERE A.primary_key=B.primary_key AND A.column1=B.column1 and A.column2=B.column2)

This step basically combines the steps 1 and 3 in the first list.

Note that you can always replace a left join with a not in e.g. the following:

SELECT A.primary_key FROM A LEFT JOIN B ON A.primary_key=B.primary_key WHERE B.primary_key IS NULL

By:

SELECT A.primary_key FROM A WHERE A.primary_key NOT IN (SELECT B.primary_key FROM B WHERE A.primary_key=B.primary_key)

You just have to check what’s best from a performance point of view in you particular case.

If you do not have a primary key, you’ll have to do it using the algorithm described further down in the generic solution.

But if you often need to compare tables or need to compare many tables at once, then the following generic approach is probably better:

First since the two tables do not have the exact same columns, we’ll need to figure out what are the columns they have in common. To do this, we’ll simply read from INFORMATION_SCHEMA.COLUMNS.

A function to get the list of columns would look like this:

DELIMITER $$
DROP FUNCTION IF EXISTS getColumnList$$
CREATE FUNCTION getColumnList(schema1 VARCHAR(64), table1 VARCHAR(64)) RETURNS text
BEGIN 
	DECLARE column_name VARCHAR(64);
	DECLARE column_list TEXT DEFAULT '';
	DECLARE done INT DEFAULT 0;
	DECLARE cur CURSOR FOR SELECT c1.COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS c1 WHERE c1.TABLE_SCHEMA=schema1 AND c1.TABLE_NAME=table1;
	DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
	OPEN cur;
	read_loop: LOOP
    	FETCH cur INTO column_name;
    	IF done THEN
			LEAVE read_loop;
		END IF;
		IF column_list <> '' THEN
			SET column_list = CONCAT(column_list, ', ');
		END IF;
		SET column_list = CONCAT(column_list, '`', column_name, '`');
	END LOOP;
	CLOSE cur;
	RETURN column_list; 
END $$
DELIMITER ;

You can all the function like this:

SELECT getColumnList('phpBugTracker', 'comment');

Now we do not only want the columns of one table but of the two tables. It’s just a matter of joining INFORMATION_SCHEMA.COLUMNS with itself:

DELIMITER $$
DROP FUNCTION IF EXISTS getCommonColumnList$$
CREATE FUNCTION getCommonColumnList(schema1 VARCHAR(64), table1 VARCHAR(64), schema2 VARCHAR(64), table2 VARCHAR(64)) RETURNS text
BEGIN 
	DECLARE column_name VARCHAR(64);
	DECLARE column_list TEXT DEFAULT '';
	DECLARE done INT DEFAULT 0;
	DECLARE cur CURSOR FOR SELECT c1.COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS c1 INNER JOIN INFORMATION_SCHEMA.COLUMNS c2 ON c1.COLUMN_NAME=c2.COLUMN_NAME WHERE c1.TABLE_SCHEMA=schema1 AND c1.TABLE_NAME=table1 AND c2.TABLE_SCHEMA=schema1 AND c2.TABLE_NAME=table2;
	DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
	OPEN cur;
	read_loop: LOOP
    	FETCH cur INTO column_name;
    	IF done THEN
			LEAVE read_loop;
		END IF;
		IF column_list <> '' THEN
			SET column_list = CONCAT(column_list, ', ');
		END IF;
		SET column_list = CONCAT(column_list, '`', column_name, '`');
	END LOOP;
	CLOSE cur;
	RETURN column_list; 
END $$
DELIMITER ;

This one takes 4 parameters instead of 2:

SELECT getCommonColumnList('phpBugTracker', 'comment', 'phpBugTracker', 'comment_copy');

Now we know the columns, we just need an algorithm to compare these columns in the two tables. We’ll use the following:

SELECT MIN (TableName) AS TableName, column_list
FROM
 (
  SELECT 'source_table ' as TableName, column_list
  FROM source_table
  UNION ALL
  SELECT 'destination_table' as TableName, column_list
  FROM destination_table
)  AS tmp
GROUP BY column_list
HAVING COUNT (*) = 1

Now we need to create a function with parameters, calling our previous function to get the relevant column names, creating this statement and executing it:

DELIMITER $$
DROP PROCEDURE IF EXISTS compareTables$$
CREATE PROCEDURE compareTables(schema1 VARCHAR(64), table1 VARCHAR(64), schema2 VARCHAR(64), table2 VARCHAR(64))
BEGIN 
	SET @columns=getCommonColumnList(schema1, table1, schema2, table2);
	SET @st = CONCAT("SELECT MIN(TableName) as TableName, ",@columns," FROM ( SELECT '", schema1,".", table1,"' as TableName, ",@columns," FROM ", schema1,".", table1," UNION ALL SELECT '", schema2,".", table2,"' as TableName, ",@columns," FROM ", schema2,".", table2," ) tmp GROUP BY ",@columns," HAVING COUNT(*) = 1");
	PREPARE stmt FROM @st;
	EXECUTE stmt;
END $$
DELIMITER ;

And you call it like this:

CALL compareTables('phpBugTracker', 'comment', 'phpBugTracker', 'comment_copy');

Of course you could also adapt it to store the results in a temporary table…

4 thoughts on “MySQL: Compare two similar tables to find unmatched records

  1. Hi good afternoon, its a great post, I have a question, what can I do if I want to compare 2 tables from diefernt servers? thanks in advance!

    1. In order to compare tables in databases across servers, you’ll need to use federated tables. Those are similar to Oracle’s link tables or Sybase’s proxy tables. They are local representation of a remote table. I’ll try to soon write an article with details on how this is done.

  2. Hi, Thanks for the post
    I want to know what is the expected time taken when the two tables to be compared each have roughly 500,000 entries and there are 70 columns in each of these 2 tables which need to be verified.
    I need to know this to estimate the time taken for such testing.

Leave a Reply

Your email address will not be published. Required fields are marked *