02 March 2024

And we're back. Partially.

Farm status
CPU only
Both Ryzen 5900X machines running Rosetta beta work when its cool enough.

Nvidia GPUs
Have been running Einstein work when its cool enough.

Raspberry Pis
Off


Other news
As you would have gathered most of the farm has been off since the end of last year due to the heat. We've had a few cooler days as we come to the end of summer so I have managed to fire up some of the farm and get going again. For the moment it will be sporadic running as the weather permits.

The Ampere Altra is unable to complete installation of Debian Bookworm so its stuck at the moment. It installs everything up until it runs into trying the Grub bootloader and that fails. I raised a bug with Debian but haven't heard anything from them. Google suggests its something to do with being in UEFI mode, but that doesn't help.

I don't have any new hardware plans at the moment for the farm apart from the Raspberry Pis.

30 December 2023

30th of December 2023

Farm status
CPU only
Off

Nvidia GPUs
Off

Raspberry Pis
On and off depending on how hot it gets.


Blackout killed UPS
We had a blackout a fortnight ago. That cost me a CyberPower 650VA UPS that only had to power the internet cable modem and ISP's router. It appears to have totally killed the battery inside and it wouldn't power on. I have purchased another one (they are cheap) but if the power goes off for more than an hour it seems its too stupid to shut itself down and totally drains the battery. It has a lead-acid battery so if they get below about 50% that is the end of the battery.

Fortunately most things were off or in the case of the raspberries were idle.


Debian 12.3 point release
Following the issues with the Debian 12.3 point release they decided to stop the release. A day later they did a 12.4 point release which included an updated kernel that didn't have the ext4 corruption issue. They then followed that up with another kernel update a few days later due to issues with WiFi drivers.

Raspberry Pi OS still seems to be running an effected kernel that has the ext4 data corruption issue.


Parts orders
I mentioned in my last post about getting Contact Frames (or Secure Frames) for my Ryzen 7900's. I decided to get Thermal Paste Guards instead. I've ordered them so should be able to complete the builds in the new year.


Altra server issue
I decided to upgrade my Ampere Altra to debian bookworm. Unfortunately the installer fails with a "grub install dummy" failed message. This seems to be related to booting in UEFI mode, which it seems to be booting in, so I am not sure if I need to create legacy boot media for it. I'll make another attempt when the house is empty due to the noise that the server makes.

10 December 2023

10th of December

Farm status
CPU only
Off

Nvidia GPUs
Off

Raspberry Pis
Running overnight.

For more information on the Raspberry Pis see Marks Rpi Cluster

 

File corruption bugs
Debian discovered the kernel they are pushing out in their 12.3 point release (kernel 6.1.64) has an ext4 file system corruption bug. It was fixed in the 6.1.66 kernel but Debian haven't updated to it yet.

OpenZFS also has a file corruption bug which is fixed with OpenZFS 2.1.14 (or 2.2.2 if you are running a 2.2 version). Strangely Debian have put OpenZFS 2.1.14 into the bookworm-backports repo. One would have thought they would include it in the Debian 12.3 point release that came out on the 9th of December or offered it as a security fix for bookworm.

The bad news is I have a few servers with the effected version of OpenZFS. However to get it one needs to be rewriting files on the disks too fast for the underlying device(s) which I don't do. I have applied the Debian 12.3 point release to a number of machines so I likely have the ext4 issue. I haven't seen any problems so far, so maybe it is only an issue under some combination of conditions.


Other news
I still haven't assembled the Ryzen 7900 machines. I need to get a couple of Contact Frames (sometimes called Secure Frames) for the CPU socket before I install them.

I went on a cruise for a few weeks so the farm was off during that period. Most of the farm is off due to hot weather at the moment. Yesterday hit 39 degrees C. Unfortunately this is one of the joys of an Australian summer coupled with global warming.


28 October 2023

Hiatus

I've had the larger crunchers powered off to try and save my electricity bill. It doesn't seem to have had much effect as the last bill was almost $900 for the quarter.

This week it was cool for a few days so I got all of the x64 machines going. For the CPU only machines (a pair of Ryzen 5900X) I ran a few hours FGRP5 work. Most of the Einstein work is now GPU-based so I fired up the GPU crunchers (four Ryzen 3600 with a GTX3060Ti in each) and had them running for a day. The farm is back off as things warm up again.

The Raspberry Pis continue to crunch. For more information on them see Marks Rpi Cluster