VirMach/LET/ColoCrossing Down?

hostballsuser · June 24, 2019, 11:58am

My sites hosted on VirMach is facing issues for more than an hour. It goes offline for 1 or 2 minutes and then come back online.
LET is facing similar issues and so is VirMach billing panel and site. Looks like a DDos attack. Or may be issue with ColoCrossing?

Due to this downtime I have suffered a loss of around $1993738 x 0 = 0 dollars. This is unacceptable because I am paying $1 per year for this VPS (BF 2018 deal) and I expect 100% uptime.

Anybody else facing this issue?

khaled · June 24, 2019, 12:02pm

No issues with VirMach at all. However, LET seems to be down.

uptime · June 24, 2019, 12:06pm

I’m guessing something to do with cloudflare for LET etc

re Virmach - website seems down, my VPS are fine

root · June 24, 2019, 12:08pm

Virmach is up. LET is not down, it’s just not accessible from some locations at the moment.

uptime · June 24, 2019, 12:12pm

Update - We are continuing to work on a fix for this issue.
Jun 24, 11:43 UTC

Identified - We have identified a possible route leak impacting some Cloudflare IP ranges and are working with the network involved to resolve this.
Jun 24, 11:36 UTC

Investigating - Cloudflare is observing network related issues.
Jun 24, 11:02 UTC

Amitz · June 24, 2019, 12:12pm

Changes…?

“Okay, let’s change this place radically. Let’s take it offline. If that ain’t a change?”.
Just kidding…

dah85 · June 24, 2019, 12:14pm

Definitely a cloudflare outage. Interrupted my plex server…

uptime · June 24, 2019, 12:42pm

More than just cloudflare apparently as per Route Leak Impacting Cloudflare | Hacker News

FHR · June 24, 2019, 12:49pm

Half of the internet is down. A $clueless_company is leaking the whole internet routing table to the clueless Verizon, who redistributes it to a lot of ISPs.

Mason · June 24, 2019, 1:07pm

Global outage for Discord as well.

vovler · June 24, 2019, 1:11pm

Huh… It’s time to go outside and enjoy nature for a few hours

Miguel · June 24, 2019, 1:16pm

Indeed. Looks like it’s going to be busy.

Mason · June 24, 2019, 1:44pm

Seems to be mostly back to normal now

Wolveix · June 24, 2019, 1:54pm

I’m not that familiar with how or why BGP leaks occur. Does anyone feel like ELI5’ing it?

FHR · June 24, 2019, 2:24pm

Lemme try. Observe this:

We can see that a $clueless_company (AS396531 / Allegheny whatever) gets internet connectivity through two providers (called upstreams in the network speak): Verizon (AS701) and DQE Communications (AS33154).

So what does “get connectivity” exactly entail? $clueless_company receives a bunch of routes - in fact, probably the whole internet routing table (which is around 800k routes at the moment) from both providers. Their router imports a routing table from both of these providers and then chooses best routes based on several factors.

Now this communication/exchange of routes (which happens over a protocol called BGP) goes both sides. Basically as a company, you want to have some IP space of your own too. So in order for the world to be able to reach your IP space, you need to export routes for this IP space to your upstreams.

Thanks to the magic of routing, every router on the internet knows that in order to reach your IP space, they need to go either through Verizon or DQE. Great!

BUT. You are supposed to FILTER what you export!!! So essentially instead of only exporting their 1 prefix (IP block/range) or so, they did this:

To put it in words - they took everything DQE provided them and swiftly exported that to Verizon. Verizon decided to play an exhibitionist and propagated this to the rest of their peers - basically all other Tier 1 ISPs ( some of which accepted it - I know at least TATA, Cogent, Telia did) and their customers.

Which meant that half of the internet now learned that in order to reach e.g. CloudFlare or OVH or whatever, they can go through Verizon → $clueless_company → DQE.

To be honest, I’m not even mad at the $clueless_company. They made a mistake, and mistakes happen.
Who I’m pissed off at is Verizon. Their multiple faults resulted in this disaster.

They didn’t filter $clueless_company
They didn’t put a prefix count limit on $clueless_company
Their NOC did exactly nothing to mitigate this issue

//EDIT: Decided to make it into a blog post. Check it out if you want slightly more details

Wolveix · June 24, 2019, 2:28pm

Perfect, that makes far more sense than what I was researching. Thank you for taking the time to explain that! Yeah, shit happens, but damn Verizon should sort their shit out.

Mason · June 24, 2019, 3:19pm

@FHR → you da man! Thanks for the explanation!

hostballsuser · June 24, 2019, 3:23pm

Is it the same blunder which Pakistan did a few years back?

Mason · June 24, 2019, 3:28pm

Similar, but I think the Pakistan incident was intentional to cut off YouTube traffic and it got leaked out because of a Hong Kong telecom company not filtering it out.

Probably more similar to when European mobile internet traffic was rerouted through China earlier this month due to another BGP mistake/leak.

FHR · June 24, 2019, 5:05pm

There are subtle differences. The Pakistan incident was a “BGP Hijack”, this was a “BGP Leak”.
What’s the difference? It’s simple really.

When you do a BGP Hijack, you (either intentionally or by accident) claim that “Hey internet, IP address 1.3.3.7 is mine. Send traffic for this to me!”

This can potentially be used by malicious actors to pretend they’re popular sites (and do phishing; steal login credentials), to block DNS, to assume a block of IP addresses for sending spam…

With a BGP Leak, you (mostly accidentally, there are not many illegitimate purposes for this) claim this: “Hey internet, you can reach Google through me!”

Unless you want to snoop on traffic (and pay outrageous bandwidth bills depending on leak size), there are no illegitimate purposes for this. You just redirect traffic through yourself.
In theory, this shouldn’t affect reachability - it just increases latency. The problem is when (as we’ve seen today) the port capacity is small. We can assume all of CloudFlare’s (and others’) traffic tried to flow through something like a 10 gigabit port. This causes the port to get overwhelmed and introduces near 100% packetloss.

A BGP Hijack is much harder to pull off since there are mechanisms in place (and sometimes working ) which limit, or completely prevent any impact.