Why Cloudflare went down?
And why do so many cloud services keep going down?
Cloudflare went down on Nov 18 2025, this cause many services to not be available, things like chatgpt, X (twitter) and many more major services. Services I host as well were not available as I use cloudflare as my main CDN (content delivery network)
Based on current reports we are able to understand that the reason cloudflare went down is because of their database systems permissions that caused the database to output multiple feature files that is used by their BOT management system, because the file was larger than what it was supposed to be and it had been moved to all the machines it caused them to fail. This file is what the machines read to be able to keep up to data with the threats. To know more you can read it in the cloudflare blog posted by them
The cascading effect

We had just had a major outage at AWS a month or so back, which was due to DNS issues in the RDS, Azure as well went down now we had Cloudflare go down.
You might be thinking the internet is so fragile and its going to break but that's not true
Before we had things related to High Availability, before cloud. People expected services had down time. Because you needed to manage your own servers, unless you were rich, you probably had to shut down servers on weekends or nights for updates and other things, and because you don't have enough money to have a lot of servers if there are outages you don't have backups, you don't have any way to be able to keep your service going.
Why did it happen?

The actual issue here we have is the over reliance on specific cloud platforms and services like cloudflare where too many parts of the internet rely on single services specifically. The services are good but we cannot trust that they will always be up or there will never be issues, we need competition, we need to be agnostic in our deployments so as to be able to make sure that whatever we build doesn't break because of a single point of faliure
What is reliability
To be reliable is to make sure that at most if not all points of time you can be depended on
To make systems that are reliable you need redundency, no single point of faliure, backups, multiple running systems to switch for 0 downtime
Reliability is expensive so people tend to find reliable services that are cheap, and what happens is that when those systems fail they become the main reason you have to learn to not be super dependent on them only
What can you learn for your services that you run?

You will never be able to create 0% downtime fully reliable systems, you don't have the money to. You need to think about SRE and things like error debt, finops where you need to understand how many issues can you allow your services to have, how best can you manage the financial part, the paying part of your cloud bill and be able to make sure that you are not only meeting your customers needs but you are able to maximise your budget.
If you want to learn more, or you are just starting our your cloud/devops/sre journey, sign up as I share more about my work, interests and topics that will help you gain a better understanding of the world of Cloud Engineering
Member discussion