Slave rebuild — out of sync MySQL replication

Sometimes you may end up in a situation where the slave server is out
of sync and relication is halted due to a table crash on the slave o
similar situation. The best way to recover from an out of sync slave
is to rebuild the slave. There are ways to try and fix the table but
lets just say YMMV and for all you know a rebuld would work out
cheaper (in terms of time).

To rebuild follow these steps.

On the master database at the cli run these:

RESET MASTER;
FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS;

This resets the replication counter on the master server. The reason
we do this is because we will be taking a snapshot of the master
database and then restoring it to the slave so technically this is
going to be the initial stage for the slave server and this is from
where all subsequent updates to master need to be applied to slave.
Also all tables are being locked for writing so that no inserts,
deletes, updates are carried out on the database while the database
snapshot is made. The last command will output something like this:

mysql> show master status;
+——————+———-+————–+————————————————–+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB
|
+——————+———-+————–+————————————————–+
| mysql-bin.000001 | 107 | |
information_schema,performance_schema,mysql,test |
+——————+———-+————–+————————————————–+
1 row in set (0.00 sec)

The position field and the file name of the binary log is what we need
to be careful to note down and keep safe as it will be used later on
in this process. Basically right now the database is locked and no
writes are happening so there is nothing being written to the bin log
which means that the log position, in this case 107, is where we will
be taking our database dump from. Ergo once the write lock on tables
is removed the position counter will start moving, however the slave
needs to be told that this is the position from where you start
reading the bin log file (specified by the File field in the above
output) for the purpose of mirroring all writes to the master db.

So yes we have our database write protected and sitting there. Savor
the moment as all the associated application sweat it out wondering
why they cannot get write locks on the tables they wish to write to.
Done, now start a backup of the databases from master.

mysqldump -u -p” –opt –single-transaction
–comments –hex-blob –dump-date –no-autocommit –all-databases >
dbdump.sql

(NOTE: The mysqldump should be done from a second ssh terminal as
closing out of the cli will release the lock on the tables.)

This will dump all the databases to a single sql file (in my case,
last night, that turned out to be 40Gb file). The other options are to
allow simultaneous read locks on innodb tables so that applications
can still read data, to add comments to the dumpfile for better
readability (your choice really), dumping binary coloumns using
hexadecimal notation, ensuring that the inserts are committed in a
batch for faster execution.

You can unlock the tables as soon as the mysqldump starts, to do so
run the following from the mysql cli (where you did the reset)

UNLOCK TABLES;

Once the mysqldump completes copy your sql file to the slave server
using rsync, scp or whatever you prefer.

IMPORTANT: Now we are on the slave server.

First of all on the slave we need to stop replication by running:

STOP SLAVE;

a nifty trick at this point would be to turn of binary logging if it
is on. This would speed up your import by up to 3 times (again that’s
a unmetered metric YMMV but suffice to say you will see a huge speed
improvement if you turn of bin log).

SET sql_log_bin=0;

now start the import and go get a shower, walk the kids, go grocery
shopping with the wife, watch a movie, catch a nap or all of the above
(a 40G restore would easily take 3 hours to complete). You can use
onne of many ways to import a mysqldump I prefer doing it from the cli
using:

SOURCE dbdump.sql;

Once the restore to slave is complete you might want to, just to be
sure, run a CHECK, REPAIR on your tables. If it throws up any errors
this would be a good time to fix those errors. In my case three large
tables failed to restore so basically I had two options viz. figure
out how to fix the problem or redo the
reset/lock/dump/unlock/scp/restore routine. I chose to do the figuring
out how to fix this situation.

With mysqldump files, specially with that of all databases on a server
it is practically very difficult to restore just one table. You can do
it but you would never be sure if it happened right till the
replication kicks in and proceeds without a hitch and even then you’d
have slogged and sweated a lot. How I did it was pretty simple in the
end actually. In my case the tabes were crashed and well practically
not there (stuff like that happens when you have a slave replication
running for 1.5years…. one of the reasons why I think a test restore
from slave once in a while is a very good idea and why a server reboot
once in a while would not hurt anything except arrogant serveradmin
egos tied to uptimes of 500+ days…. in my case the server had not
been rebooted in 559 days and then suddenly it was and the mandatory
fsck on reboot muffed up these three tables on slave). So I copied the
structure from master (not a big deal but if you must):

SHOW CREATE TABLE ; //from the master

The resultant output will give you the statement to create the table,
copy that and run it on the slave.

CREATE TABLE (…..);

Next comes the genious bit. Take the 40G mysqldump file and look for
the INSERT statement that are puting the data into the .
Write those out to a separate file. open this file and carefully
eyeball it to make sure that no other inserts are being inadvertently
added (as in your regular expression is good, basically:

grep INSERT dbdump.sql | grep tablename > tablename.sql

NOTE: be very careful about the eyeballing of the above table. Maybe
run something like this as well:

grep INSERT tablename.sql | cut -d”(” -f1

The above will throw something like this:

INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum` VALUES
INSERT INTO `cromagnum_featured` VALUES
INSERT INTO `croquet_messages` VALUES
INSERT INTO `croquet_messages` VALUES

The last three inserts is what I was talking about (it could be
avoided by making a better regex but when you’re in a hurry it’s
easier to keep things simpler). At this point edit the tablename.sql
and remove the inserts to other tables. Why are we doing this and not
letting replication take care of it? Well replication would not start
if it finds these tables missing for one, and then there is the thing
about the reset which we did, all data prior to that point in the bin
log would be lost because we’re not restoring it to the slave. So yes
this acrobatics is necessary. Copy the tablename.sql to the slave and
restore it using:

USE databasename;
SOURCE tablename.sql;

So now your database is restored to the point where we took the
mysqldump. However now we need to start replication as the slave has
to catch up and then keep up to the master. At this point I ran a
mysqlcheck -c just to once again make sure that all databases and
tables were in ship shape before I put them back in business. Once
that completed (in about two hours) I started the slave by telling it
which binary log from the master to look at and from what position the
replication needed to start. This is where the “SHOW MASTER STATUS”
output comes in handy

RESET SLAVE;
CHANGE MASTER TO MASTER_LOG_FILE=’mysql-bin.000001′, MASTER_LOG_POS=107;
START SLAVE;

That’s it slave should catch up to the past 8-10 hours that master was
happily purring away powering the websites/application.

You can look at the slave status to confirm all is well by running
SHOW SLAVE STATUS\G;

The things to look for are:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

and

Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:

Another quick and dirty way to verify that you data on slave is
syncronized is to compare the row counts on all tables in a particular
db.

SELECT table_name, TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE
TABLE_SCHEMA = ”;

run this on both slave and master and compare the numbers.

If you run into any problems with this process or if you have any
queries please feel free to hit me up at imtiaz at netbrix dot net OR
leave a comment and I’ll try to help. Try and do periodic tests of
your slave servers integrity and also the integrity of your drives.
You don’t want to be caught in the middle of a fsck ruining your
happiness specially if it’s business critical data.

Leave a Reply

Your email address will not be published. Required fields are marked *