The Craigslist Dilemma: A case study for big data and NoSQL solutions
It's hard to imagine just how many postings Craigslist has handled over the years, and if you were in charge of archiving those posts and storing them for compliance, you'd need a 'big data' solution. So how does Craigslist manage all of their data, both the incoming stuff and the stuff that needs archiving? It's a beautiful combination of MySQL, NoSQL and a little help from the people at 10Gen.
With more than 1.5 million new ads posting every day, Craigslist users have generated over a billion records – some might even consider that ‘big data.’ What’s more, legislation demands that these records can’t simply be erased or overwritten at the whim of the company: after a 60 day retention period in the live portion of the site, records must be migrated over to an archival space for legislative compliance.
And how does Craigslist manage this brobdingnagian volume of data? Prior to 2011, the archive consisted of a MySQL cluster that was part of the company’s larger database structure that included over one hundred MySQL servers. Unfortunately, instead of making the job of data persistence easy, the nature of MySQL created complexity, forcing Craigslist to start exploring NoSQL options that could handle a huge amount of incoming data, simultaneously stream the archive process, and all while scale up easily over time.
The 'big data' challenge
As you could imagine, Craigslist faced several challenges due to the nature and volume of data being stored in their relational, MySQL servers. For example, the structure of their data had changed several times over the years. This alone made any change to the database schema a costly, prolonged nightmare, as changes often meant downtime, and of course, any alteration comes with the potential of unintended consequences. And if database alterations were a challenge, just imagine how difficult introducing entirely new features became? What’s more, each change to the live database schema required a corresponding change to the entire archive – a process that took months every time. And during these updates, the archival process had to be put on hold, which meant stale data piled up in the live databases, slowing down the site’s performance.
The NoSQL solution
Now don’t get the impression that anyone at Craigslist is slamming MySQL. MySQL is still revered, it’s a stellar relational database, and the people in charge didn’t want to stop using it for data in active online postings. It was the dead postings that needed a better “graveyard”. So what was the NoSQL solution? Craigslist passed that baton to MongoDB for archiving posts and their accompanying meta-data, and they archived these posts as documents instead of treating them as rows in a relational database table. And the process was a relatively speedy one. Including the time needed to sanitize and prep the data, migrating 1.5 billion postings to the new archive database only took about three months.
Key Benefits of a NoSQL Solution like MongoDB for Big Data
|
And of course, while there are obvious differences between a relational store and a NoSQL solution, there are similarities as well. After all, both systems are simply storing data for future retrieval. Jeremy Zawodny, a software engineer at Craigslist, appreciated this compatibility: “Coming from a relational background, specifically a MySQL background, a lot of the concepts carry over.... It makes it very easy to get started.”
Craigslist was able to implement a NoSQL solution in both of its data centers using servers in multi-node clusters, providing data replication functions and enhanced reliability, ensuring there is no single point of failure since the entire archive exists in each “shard” as servers can fail over without losing any data. The whole system is readily scalable over commodity hardware and new machines can be added without any downtime.
Archiving now occurs seamlessly, even when the MySQL schema undergoes changes. Samantha Kosko describes how this process works, “Once a posting goes dead, MongoDB then reads into MySQL and writes that posting into a JSON-like document. By doing that, they were able to provide a schema-less design that allowed them the flexibility to archive multiple years of files without worrying about failure or flexibility in design.”
Follow Cameron McKenzie on Twitter (@potemcam)
Recommended Titles
NoSQL Distilled By Martin Fowler
MongoDB: The Definitive Guide By Michael Dirolf
MongoDB in Action By Kyle Banker
Taming The Big Data Tidal Wave By Bill Franks
The Well-Grounded Java Developer By Martijn Verburg