Engineering

Sleepless nights with MongoDB WiredTiger and our return to MMAPv1

We have been using MongoDB 2.6 with MMAPv1 as the storage engine for the past two years. It’s been a stable component in our system until we upgraded to 3.0 and promoted secondaries configured with WiredTiger as the storage engine to primary. To put things in context, we do approximately ~18.07K operations/second on one primary server. We have two shards in the cluster so two primary server making it ~36.14K operations/second. This represents a small fraction of our incoming traffic since we offload most of the storage to a custom build in-memory storage engine

The lure of 7x to 10x faster throughput

WiredTiger promises faster throughput powered by document-level concurrency as opposed to collection level concurrency in MMAPv1. In our quick tests before upgrading in production, we saw a 7x performance improvement. Jaw dropped, we decided to upgrade the following weekend. We were going to do this in phases
1. Upgrade cluster metadata and Mongo binaries from 2.6 to 3.0. Sleep for 3 days
2. Re-sync a secondary with WiredTiger as the storage engine and promote it as the primary. Sleep
3. Change config servers to WiredTiger. Sleep
4. Upgrade existing MONGODB-CR Users to use SCRAM-SHA-1. Sleep

With so much sleep factored in, we’re hoping to wake up sharp 🙂

The Upgrade

Phase #1 – Binary upgrade was executed like clockwork on Saturday morning within an hour. In the mean time all incoming traffic was queued and processed after the upgrade. Monday night, I begun re-syncing the secondary server with WiredTiger as the storage engine

So far so good

Phase #2 – On Tuesday, we stepped down our primary nodes to let the WiredTiger powered secondary nodes begin serving production traffic. Within minutes we had profiling data that showed that our throughput had indeed increased

At this point we could possibly throw 7x more traffic at MongoDB without affecting throughput. WiredTiger stayed true to its promise. With a couple of hours in production we decided to re-sync the old MMAPv1 primary nodes which were now secondary with storage engine set to WiredTiger. We’re now running with one single functional data node (primary) in the replica set.

An hour later, all hell broke loose

MongoDB’s throughput plunged to about 1K operations/second within a few minutes. It felt like MongoDB came to a halt. This was the beginning of the end of our short, wild ride with WiredTiger. Scrambling to figure out what happened, our monitoring system reported that Mongos nodes were down. Starting them manually brought a few minutes of relief. Mongos logs seem to say:


2016-MM-DDT14:29:55.497+0530 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will
 terminate after current cmd ends
2016-MM-DDT14:29:55.497+0530 I SHARDING [signalProcessingThread] dbexit:  rc:0

Who/what send SIGTERM is still unknown. System logs had no details on this. Few minutes later, Mongos decided to exit again.

By this time, we all jumped into the war room — Mongod nodes along with Mongos were restarted. Things looked stable and we had some time to regroup and think about what just happened. A few more mongo lockups later, we identified based on telemetry data, that MongoDB would lock up every time WiredTiger’s cache was 100%. Our data nodes were running on r3.4xlarge (120GB RAM) instances. By default MongoDB allocated 50% of RAM for WiredTiger’ cache. Over time as the cache filled to 100%, it would come to a halt. With the uneasy knowledge of imminent lock ups showing up every few hours and a few more lockups in the middle of the night, we moved to r3.8xlarge (240GB RAM, thank you god for AWS). With 120G of cache, we learnt that WiredTiger stabilised at 98G cache for our workload and working set. We still didn’t have a secondary node in our replica set because initiating the re-sync would push the cache to 100% and bring MongoDB to a halt. Another sleepless night later we got MongoDB data nodes to run on a x1.32xlarge (2TB RAM). Isn’t AWS awesome? With 1TB of cache, we were able to get our secondary nodes to fully re-sync with storage engine set to MMAPv1 so that we could revert and get away from WiredTiger and its cache requirement. MMAPv1 had lower throughput, but was stable and we had to get back to it ASAP

Lessons learnt (the hard way)

Keep MMAPv1 replica node around for at-least 10 days when upgrading

We were too quick to start re-syncing with storage engine set to WiredTiger, if were fully in sync, we could have promoted the secondary running with MMAPv1 as the primary and avoided all the sleepless nights

Size your Oplog correctly based on your workload

Because of the frequent updates and inserts, our oplog could hold a few hours worth of data during normal production time. Only at night would we have enough hours of data to be able to re-sync and catch up using the oplog. This meant that we had to re-sync only during the night

WiredTiger is very promising but untamed and wild currently

Based on our experience, WiredTiger is true to its promise of 7x to 10x throughput improvement but locks up when cache hits 100%. We have been able to reproduce this with the latest version (3.2.9 at the time of writing) and are in the process of filing a bug report. We are committed to provide more information to help solve this issue. As it stands — WiredTiger storage which is the default in the latest version of MongoDB is unstable once the working set exceeds configured cache size. It will bring production to a halt

Config servers know too much about the state of data

As a closing thought, if you are evaluating MongoDB or using it, think about disaster recovery. While firefighting, we realized just how difficult it is to restore from a filesystem snapshot backup or setup a new cluster if you have data in a shared cluster. Exporting all data from each node and importing it into a new cluster is extremely slow for a dataset of ~200G per node. In our case, it would take ~36 hours

UPDATE #1:
We made a boo boo in representing the numbers we put out. It’s operations/second (literally the worst mistake you could make) instead of operations/minute as stated in the opening paragraph. I have updated this post to reflect the change

UPDATE #2:
– All things being equal, WiredTiger should’ve at least had the stability of MMAPv1 considering that it’s the default storage engine option from Mongo 3.2 onwards
– Each time we restarted the data nodes, WiredTiger would work fine for a couple of hours and then freeze. On digging deeper we noticed that it would freeze when the cache reached 95% capacity. So clearly there’s some cache eviction issue at play here

  • “Who/what send SIGTERM is still unknown. System logs had no details on this.”

    Might have been out of memory (OOM) killer. Sounds likely given your cache issues.

    • We look for an OOM in our logs but couldn’t find it. Also, the log lines are from mongos nodes. Data nodes continues to run with severly reduced throughput.

      • You can take a look at /var/log/messages. All OOM kills will be reported there. Are you running aggregations or Map reduce jobs? They use up RAM outside the cache. There isn’t any easy way to determine who/what is using the RAM outside the cache – hopefully we will get better tracing for this down the line.

        • @dharshanr:disqus we went though /var/log/messages to look for an OOM. Couldn’t find it. Look link MongoS sent itself SIGTERM when MongoD was slow

  • mamoos2

    You might want to give MongoRocks a try:
    https://github.com/mongodb-partners/mongo-rocks

    It’s basically using the same storage APIs that wiredTiger is using, but the storage engine itself (RocksDB) has much more production maturity as it’s being used in Facebook in hyper-scale scenarios.

    • CleverTap

      Thanks for the heads up @mamoos2:disqus I will give MongoRocks a spin.

      • Peter Zaitsev

        If you’re looking for MongoDB Distribution which includes MongoRocks try Percona Server for MongoDB https://www.percona.com/software/mongo-database/percona-server-for-mongodb

        • Thanks @peterzaitsev:disqus Will give Percona Server for MongoDB a shot

      • niki twitty

        Hi
        We had same problem as you.
        We dig a bit further and found that you can tune those eviction issues with eviction_target=80,eviction_trigger=95 parameters. Cache is usually 80% full. And sometimes it starts to grow fast and everything breaks at 95%. Setting eviction_target to 50% gave us some more time to restart mongod before everything is dead. But this didn’t solve the problem.

        At last we took mongo rocks distribution from percona and it works like a charm since that. Migration is the same, as you wrote above. You can have rocks and wiredtiger members in one replicaset.

        • mamoos2

          This is exactly the same experience we had.
          Don’t use wiredTiger – use RocksDB and things will start working like charm.

  • govindsk

    Would it help if we try to see if documentDb can help you ? It has hosted mongodb protocol compliant frontend to begin with.

    • Thanks for the heads up @govindsk:disqus. What kind of latency would I expect? Off the top of my head, I would be concerned about network latency.

  • Roman Verbitsky

    You have to try 3.2. Mine server is 10 times smaller and easily handles 1000 batch upserts/writes per second. 3.0 was a very early alpha.

    • CleverTap

      Will be giving 3.2.10 a shot and see of fixes documented here ( https://jira.mongodb.org/browse/SERVER-26055 ) work for our workload

      • Nuno Khan

        Yeah curious to see if 3.2.10 fixes your issue

        • We are working with MongoDB to confirm this.

          • mamoos2

            For us, moving to 3.2.10 improved performance significantly, but it did not improve cache related stalls.

  • Alex Malafeev

    Francis, what kind of load were you doing that 600~ QPS per server were causing cache eviction and cache utilization issues? I am genuinely trying to understand that.

    We’ve ran 3.0 and 3.2 (with snappy, on AWS r3.8xlarges and r3.4xlarges) in production for over a year now and have seen amazing results with QPS peaking >100k per server, and days and weeks of >50k QPS load (on updates and reads) without a hitch, and great 99.9 percentile latency etc. Wiredtiger probably saved us >100k / month in AWS reserved billing due to snappy alone.

    What kind of workload are you doing? What are the indexes like? What does htop/mongostat/mongotop/dstat look like? What is the execution time for these ops you are doing? Can you post some example explain and slowops?

    Did you try 3.2.10?

    • Alex, the workload is mostly search for a document on an indexed field followed by an update to the same document. If it doesn’t exist, create a new one. Does your working set fit in cache ? If it does, you won’t run into this

      A find is mostly 1.2ms on average and 1.8ms on update after compacting the database

      We will be running our workload against 3.2.10 shortly. Will post stats once the runs are complete

      • Alex Malafeev

        Francis, that is scarcely little detail in case you want help.

        Yes, we have a number of collections where the total ram of the machine is about enough to fit the indexes, and the total size on disk is 10x or more of the ram of the machine.

        What kind of workload are you doing? You should post query traces, etc. Are you sure the index is being used? What does htop/mongostat/mongotop/dstat look like? Does mongotop say you spend most time on writes or reads? What’s an document you’re accessing like? How big is the collection in terms of documents? How large is the index total, what other indexes are there? Can you post some example .explain and slowops? Are you compacting the database after every query?

        Stats are not created in a vacuum and depend on what you’re doing, and from the metrics you posted my strong suspicion is that you are doing something really weird by bringing r3.4xlarge instances to run out of memory.

  • Dennis Kashkin

    I find it very odd that WiredTiger does not want to use more than 60% of RAM (and more than 50% in the latest release). If you guys ended up buying terrabytes of extra RAM and the bottleneck is in WiredTiget cache, why not try increasing the WiredTiger RAM allocation to 90%? My assumption is that WiredTiger’s cache is “smarter” than file system cache because WT understands your data structure and your queries… I also assume that Mongo picked the 60% default with slow SATA storage in mind and SSD based deployments don’t need to waste too much RAM on primitive IO caching. Thoughts?

    • Jakrixa

      The reason for this is because WT cache is not the only RAM in use by Mongo. In addition to the memory allocated to WT, the filesystem cache, connections, and other resources make use of the remaining RAM.

      For the issue being experienced here, setting WT cache to 90% does not appear to be effective if it doesn’t encompass the working set. I’ve been battling the same bug on my own Mongo cluster. Short of providing enough RAM, it seems you will always run into this bug.

  • Jakrixa

    We’ve been operating a Mongo 3.0/3.2 replicaSet with a datasize of around 8TB for almost a year. WT has been a saving grace in the field data compression. With our updates/s sitting around 17000 at peak we see a lot of activity in the WT cache. Your experience sounds all too familiar.

    In normal operation under high load, WT is supposed to allocate up to 80% of what it is offered. Your 240GB EC2 instance would thus sit around 96GB in the cache if provided 120GB to WT. You saw 98 on average. Under Mongo 3.0, our mongod’s would sit at the 80% mark for days or weeks. But eventually they would begin to increase and eventually hit 95-100% which was always lethal to the mongod.

    Working with MongoDB support, we were assured that 3.2 introduced improvements which would help. The upgrade went smoothly and performance looked good the remainder of the day. Less than 24 hours later, however, WT cache blew out to 95% and took us down with it. Writes literally grind to a halt when WT cache hits 95%. In the aftermath, we went from resetting each mongod every few days or weeks to resetting nearly every single day for every single mongod. … Ugh.

    To quote one of our developers, “The mongo graphs really look like time bombs since upgrading to 3.2.”

    The good folks at MongoDB have been working with us on the issue. Current indications are that the heavy write volume is causing evictions to fail for reasons presently unknown. Why we average about 24 hours before WT cache starts to show any signs of distress is a mystery. The bug occurs at random times too and not necessarily at peak activity.

    Your issue with rsSync and WT cache going to 100% on the sync source is all too familiar. We cannot presently rsSync any new node. It is an impossibility. In fact, getting our new node online at all has proven presently impossible given all the issues Mongo is throwing at us compounded by the data size.

    Another strange issue we encountered with 3.2 was the sudden degradation in performance of one of our secondaries. Unexplained. Had to downgrade it back to 3.0 for the time being.

    If MMAP were an options we’d consider it, but 10TB of RAID SSD space x N replicas is unrealistic. The answer really appears to be more RAM. A lot more RAM.

    Merry Christmas and happy holidays. Good luck to you guys!

  • Did you try again to test WiredTiger since that?

    • Haven’t had the chance to run WT again. MMAPv1 although slower has been serving us well

  • Алексанр Платонов

    Hi. To create a new replicaset member use rsync from backup server. We have some issues with initial sync process on WT 3.2.11 and memory consumption.

    • Thanks for the heads up @disqus_FJccMFLVMO:disqus For now we have reverted to MMAPv1

  • Suraj Sharma
  • Ignacio Poletti

    Nice post! thanks for sharing

,

4000

55 billion

1 billion

10 billion