- Taboola Blog
- Engineering
If you fall, fall right – a tale of SRE critical incident management By Yehuda Levi, Tal Valani, Ariel Pisetzky & Eli Azulai Imagine this scenario – your data center is down. 1500 servers are down. Each server needs to be handled and monitored and the responsibility for each should be divided between all teammates. Each team is looking after the status of their services. Client facing services are impacted. New information keeps flowing in from different channels and the status of the outage and servers keep changing. How to get the list of the server affected? How to put it all in one place? How to assign responsibility for each? What is the status of each server? How can the internal clients receive ongoing status updates? What happens if a server was intentionally down before the incident? What happens if a more complex issue occurs and the time to […]
Optimize Data Center Health: Taboola employs LSTM Autoencoder for precise anomaly detection, enhancing system performance.
Failure. I need to talk about failure, and not any failure, my failure. I need to share it with everyone in the production group, everyone in R&D. My team, my peers, my managers. The meeting will start in just a few minutes and I am under fire to explain what went wrong, how I failed the organization and how we need to be better. General George S. Patton Jr. said “The test of success is not what you do when you are on top. Success is how high you bounce when you hit the bottom.” There is a lot to learn from that saying, and not only for people. Successful systems need to bounce back from a failure and do it well. IT systems need to be able to endure a catastrophic event and just dust it off. This is what we expect of our production systems in Taboola, […]
So, the firefighters are in your data center, there is no electricity, and the pager is more like a DDoS attack on your phone than anything informative. You look at your watch, multiple thoughts running through your head. Why me? Why now? What was the last DR test result? How do you pull the team out and through this IT catastrophe and survive to write about it? This is my story, my personal fight with the IT “Murphy laws” and how we can all benefit from it. It was a Friday, one you know you need to be extra careful with. It’s always the end of the work week or smack in the middle of the night. (No IT catastrophe ever happens when it’s convenient to you, now does it? They always cluster and bunch around the most difficult times.) Anyway, it’s the end of the day Friday and multiple […]
Taboola is responsible for billions of daily recommendations, and we are doing everything we can to make those recommendations fit each viewer’s personal taste and interests. We do so by updating our Deep-Learning based models, increasing our computational resources, improving our exploration techniques and many more. All those things though, have one thing in common – we need to understand if a change is for the better or not, and we need to do so while allowing many tests to run in parallel. We can think of many KPI’s for new algorithmic modifications – system latency, diversity of recommendations or user-interaction to name a few – but at the end of the day, the one metric that matters most for us in Taboola is RPM (revenue per mill, or revenue per 1,000 recommendations), which indicates how much money and value we create for our customers on both sides – the […]
This post is not about K8S – nor is it about AWS. It is not about containers – nor is it about some new, “cool” technology for managing large-scale applications. Rather, this post is about how we deploy a highly sophisticated Java service, a heavy service that is very actively developed on a daily basis, to 1000s of servers across our 7 data centers around the world. So what’s the problem? Isn’t it enough to take a list of servers, get the version to deploy and run it with an automation tool like ansible? Well, it’s not as simple as it might seem. This service serves Taboola’s recommendations and responds to hundreds of thousands requests per second. The service has to be fast – so fast that its p95 should be below 500 milliseconds per request. Which means we can’t have any downtime at all, or even afford slower […]
Optimizing Spark Executor Utilization: Harnessing Dynamic Allocation and Resource Management for Efficient Workload Processing.
Optimized MySQL slave replication for faster WAN connections. Explore key strategies and configurations for enhanced MySQL replication performance.
In Taboola, we deal with scale, huge scale. A small issue might turn into a disaster in a matter of hours. Re-writing and replacing an existing service with a new one is a real challenge, moreover doing it without causing downtime is SCARY. Reading logs is not an option. Logs are gigantic, unwieldy and span over many machines. It would take hours to combine and analyze them. In this post I will share with you three graphs in Grafana that I think are a must for observing new code. Let’s start… Did I break production? You write your shiny code, you (even) test it, but, how would you verify that you didn’t break the production environment? Luckily, we use Grafana, and this actually makes a big difference. My plan was to compare old code vs. new in Grafana, but, where to start? You have Grafana… let’s use it! Frankly, I […]
To facilitate flexibility and technological hype, you want to work with people who know how to learn. This is much better than having someone who knows a specific programing language, because a person ‘ who knows how to learn’ can learn any new language! This agility is crucial, because technology is always changing and learning is endless: My story begins two and a half years ago in Taboola Engineering, where I arrived with dozens of new employees. In fact, 50 percent of the developers were new (less than one year)! Taboola was growing, and with great growth comes a great need to learn. My goal was to create learning programs, but along the way I realized that it was far beyond this – learning brings personal development, curiosity, doubt, and insights into the organization’s working methods. The past years has been an exciting journey of many collaborations, trial and error, […]