MongoDB Initial sync failed

After an upgrade to mongoDB 3.X and the switch from the DB-engine mmapv1 to wiredTiger we had to resync all replica sets from scratch. During that resync I figured out, that there is just on one shard a small collection which has a very large number of documents which are created, updated and deleted. Pretty close to that point, the resync should be done, the log file explodes and the sync crashes. I see thousands of lines like the following:

 I REPL     [repl writer worker 12] replication failed to apply update: { ts: Timestamp 1432762108000|244, h: -8874966953786363575, v: 2, op: "u",...
I REPL     [repl writer worker 4] replication failed to apply update: { ts: Timestamp 1432762108000|232, h: -2423412449985218813, v: 2, op: "u", ...
I REPL     [repl writer worker 10] replication failed to apply update: { ts: Timestamp 1432762108000|259, h: 6507151402806398582, v: 2, op: "u", ...
I REPL     [repl writer worker 13] replication failed to apply update: { ts: Timestamp 1432762108000|251, h: 8927018345590464930, v: 2, op: "u", ...
I REPL     [repl writer worker 8] replication failed to apply update: { ts: Timestamp 1432762108000|242, h: 7518410875297535456, v: 2, op: "u", ...
I REPL     [repl writer worker 12] replication info adding missing object
I REPL     [repl writer worker 12] replication missing object not found on source. presumably deleted later in oplog
I REPL     [repl writer worker 12] replication o2: { _id: ObjectId('5566368006c56f080dfc5033') }
I REPL     [repl writer worker 12] replication o firstfield: $set 
I REPL     [repl writer worker 4] replication info adding missing object
...

As far as that happened, I saw also a lot of TIME_WAIT network connections:

netstat -an | grep TIME_WAIT | wc -l

After some time of google search I found a bug on mongodb side which exactly described my situation. There they worked with kernel parameters to avoid these exception.

The settings they used and which also worked for me:

sysctl -w net.ipv4.ip_local_port_range="10000 61000"
sysctl -w net.ipv4.tcp_tw_reuse=1

tcp_tw_reuse allows reusing sockets in TIME_WAIT state for new connections. It is a good option when dealing with a process/programm that has to handle many short TCP connections left in a TIME_WAIT state. Here it is the mongodb itself.

comments powered by Disqus