All quiet on the cluster front
February 28, 2025

Trying my best to tell myself it is indeed solved
The dust has very much settled at this point. I’ve sort of sat and stared for a while as I can’t quite believe the stability I’ve experienced after so much instability. It’s been about two or so weeks since I took action once again on the hardware front, and it’s been all quiet on this front. I mentioned previously acquiring new hardware, but as it turns out that was not the issue. But it did lead to discovery of what seemed to be either the root cause or at least partner in crime of my woe.
So, when I left things off last I was looking to perform some file system integrity checks, up to a full drive replacement. Attempting to run fsck at first proved troublesome, honestly if it ever worked at all for me. On Ubuntu, you used to able to drop a file at the root of the root filesystem called I believe /forcefsck. But since systemd took over the booting of the system, that does not work anymore. After several attempts anyway, I figured it could be a problem for future Evan so I went ahead and started to replace the drive. Got the drive replaced, operating system installed. All seemed fine.
At this point in the day, a Saturday, it was time to pivot into some games. Started playing Final Fantasy 14 with one of my friends, having a grand time. I know my room is getting hot which is a good test since heat was where I saw these conditions originally. As it turns out, the machine had died again, with a new drive. Honestly in that moment I elected to simply ignore my feelings and focus on enjoying my game. Which was the best call to make, since this whole thing has really been controlling my headspace to detrimental effect.
Once we wrap our gaming session up for the day, I elect to try and take a brief look at things. My indicator for health by the way is if the machine is alive on my Tailscale network. Since I do virtually all my activities over Tailscale, if it’s not alive on there, it’s pretty much dead in my eyes. I had noticed some “flapping” for lack of a better word, where the machine had died for a while, then came back, then died again. My plan, since it was late, was to go to bed obviously. But I left one of my monitors on so I could see any log messages in the morning in case issues happened again.
Sure enough I had some things to look into. The main thing, was seeing the interface for the NUC complaining about something exceeding two seconds. That was intriguing. Googling the error brought me to the Proxmox forums, a well established virtualisation tool. I found someone who had the same error, up to damn near the exact same hardware as me, with an Intel NUC 11 that has a 2.5Gb interface. Finding someone with the same issue as you, with near identical hardware as you, is almost an ecstasy like moment.
With much haste I scanned the forum thread for the back and forth between the OP and someone else. Seeing what troubleshooting they did together. One thing was clear to them was some kind of hardware issue on the network interface. For the OP, they replaced the entire switch they were using and the issue went away. Could I just have a bad port on my network switch? Or maybe the interface on the NUC has just died? Two easily solvable problems, the former just change the port you’re plugged into. The latter, perhaps a USB C to Ethernet adapter.
At this point, is when I had a flashback. A few months ago, when playing Final Fantasy 14 and my room got hot, I would lose all network connectivity on my desktop. I’d be forcibly disconnected from the game, thinking I had some kind of Internet issue. But my phone would work fine on WiFi and disabling / re-enabling the interface on Windows, would not resolve it either. I would need to unplug and replug the ethernet cable going into the machine to get it to work again, only for the issue to reoccur again after a while. So, I replaced the cable and the issue resolved. Do you see where I’m going with this?!
I replaced the ethernet cable the NUC was using. To be extra sure, I even replaced the patch cable for the NUC into the switch. And performance / uptime has been solid, he emphasises, hoping to God it isn’t a jinx. I’ve done my best to recreate temperatures in this room and I’ve even witnessed temperatures rising on the NUC, the fan kicking in and the temperatures going back down. It’s honestly been really hard to accept that this could be fixed, but I’ve forced myself to reintroduce workloads to the machine and I’ve implemented a lot more monitoring and alerting. It still needs improvements, and should be discussed in another blog post, but for now, things are all quiet on the cluster front. Save for the fans of course, those are loud.
I truly hope this is a conclusion to this blog post series, a series that I didn’t even want to turn into a series haha. It’s been really nice to focus on other projects and self hosting activities. But I plan to share more on what I will be working on next very soon. I appreciate those of you who I know have been following along and offering support! It’s going to be great to focus on building instead of debugging for a while and I hope you’ll continue to follow me on that journey.
Thank you!
You could of consumed content on any website, but you went ahead and consumed my content, so I'm very grateful! If you liked this, then you might like this other piece of content I worked on.
The previous post in this mini seriesPhotographer
I've no real claim to fame when it comes to good photos, so it's why the header photo for this post was shot by Samuel Thompson . You can find some more photos from them on Unsplash. Unsplash is a great place to source photos for your website, presentation and more! But it wouldn't be anything without the photographers who put in the work.
Find Them On UnsplashSupport what I do
I write for the love and passion I have for technology. Just reading and sharing my articles is more than enough. But if you want to offer more direct support, then you can support the running costs of my website by donating via Stripe. Only do so if you feel I have truly delivered value, but as I said, your readership is more than enough already. Thank you :)
Support My Work