Skip to main content

How an early bet on Apache Spark paid off big

At the Spark Summit East in New York this week.

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


When we first went to look at the Apache Spark data processing engine in 2011, it was a little-known thing inside the University of California, Berkeley’s AMPLab. The project was known to people at the AMPLab, but not to the market at large.

Now, that same Apache Spark project is enjoying a meteoric rise in popularity, as the big data community becomes increasingly attuned to the capabilities and strength that Spark delivers.

At ClearStory Data, we recognized Spark’s potential very early on as the next big thing in data-processing technology. Why? Because we’d been at Hadoop companies like Cloudera, and worked deeply with Google’s MapReduce data-processing framework at Aster Data and Teradata.

In the early days of our search to create a next great company, we made our first big bet that the engine under the hood for the next era of data analytics should be Spark. Our second big bet was on delivering a super-simple user experience.

With experience from Cloudera, Aster Data and Teradata under our belts, we knew storing big data at lower costs was a massive need. As data exploded, the last-generation data platforms were cost-prohibitive. We saw first-hand that customers wanted fast-cycle data processing on large amounts of internal and external data to gain near-real-time insights into what’s happening in the business.

But frameworks like MapReduce weren’t designed for anything approaching near-real-time analytics on large data volumes. It was fine for batch processing but too slow and complex to be viable. Spark caught our eye for its promise to provide a robust, massively parallel cluster-computing platform.

When we saw Spark in action at the AMPLab, it was architecturally everything we hoped it would be: distributed, in-memory data processing speed at scale. We recognized we’d have to fill in holes and make it commercially viable for mainstream analytics use cases that demand fast time-to-insight on hordes of data. By partnering with AMPLab, we dug in, prototyped the solution, and added the second pillar needed for next-generation data analytics, a simple to use front-end application.

We built ClearStory Data on this core premise: Why speed up data processing at scale with faster time-to results if it’s stuck in IT and can’t be consumed by the business faster?

We began the journey of ClearStory Data by building a next-generation data-analytics solution that could access more data variety (more sources) and speed data processing via Spark to make key insights quickly consumable by the business, adding a front-end user application that anyone could use — not just IT and data jocks.

On the surface, it may look like we made a lucky bet on Spark, but it was a well-informed architectural decision based on first-hand experience. We built our back-end engine on Spark, added our own intellectual property to it, and integrated a front-end user application to speed up delivering insights to the business.

From there, we began speaking to investors, formalizing, and incorporating ClearStory Data. These few months included bringing a stellar team of back-end data architects together with front-end designers and engineers, all with consumer-application backgrounds. We married two very different types of technical DNA and put them on one team.

Companies that are adopting easy-to-use Spark-based platforms are those that compete aggressively and can’t afford to miss out on insights or delay decision making. Data sources are getting more diverse, and time requirements are shrinking. Every dollar lost in sales when a consumer buys a competitor’s product is hard to win back. With always-on data intelligence, rear-view-mirror insights are a thing of the past.

Retail managers, for example, can understand why no yogurt sold at a particular time of day. This simple-sounding insight could involve nine or ten data feeds. If the company wants this data processed and analyzed several times per day, at scale, that requires the computational power of Spark coupled with a dead-simple (not simpler) application for business users.

Early adopters by sector include consumer-packaged goods, insurance, media and entertainment, pharmaceuticals, retailers, automotive — any industry where the consumer wields the power, and companies need to attract or keep them every day. In health care and pharmaceuticals, faster and more holistic insights can speed the diagnosis-to-cure cycle. Biosensors can provide real-time measurements on patient vitals to detect early warning signs of serious or critical symptoms, potentially saving lives and reducing medication, lab tests, and other costs.

Having led many companies for 19 years in a variety of markets, it’s like watching the same movie over again. The growth phases, challenges, and road to winning are identical. The first couple of years are tuning and tweaking at a pace that makes every day feel like a marathon. When all things align — product, people, market, technology choices — you hit the point of takeoff. In our case, Spark was one of those technology choices. Wrap it all together, and it makes our first visits to the Berkeley AMPLab one of the great scenes of this movie.

Sharmila Mulligan is the chief executive and founder at ClearStory Data. She has spent more than 18 years building software companies in a variety of markets. She is a board member at board of several startups, advisor to numerous companies, and an active investor in early-stage companies.