25 November 2018

25th of November

Farm status
Intel GPUs
All running Einstein O1OD1 work overnight

Nvidia GPUs
Two running Einstein O1OD1 work overnight

Raspberry Pis
Twelve running Einstein BRP4 work

Increased storage capacity
I purchased 3 WD Red 8TB disks to put into the Drobo (its used to backup the storage server). This brings the Drobo up to 16TB of usable space.

At the same time I purchased 3 Toshiba 8TB disks to put in the storage server. I have a new disk controller on order that supports up to 8 drives and will swap the drives around when installing it. The idea is to have a vdev of 3 drives in raidz1 which will give single drive redundancy and more than 16TB of disk space due to compression. I can then add more drives as needed.

Disk performance
While doing some online research I came across the gnome-disk-utility that can benchmark disk performance so I let it loose on a few different drives that I have. Interestingly the Seagate expansion drives had a high transfer speed and the SSHD came in the slowest for transfer speed and 2nd slowest for seek time. I haven't benchmarked the NVMe SSD yet.

Drive model Type Interface Ave read (MB/sec) Ave seek (msec)
Samsung 850 Pro 2.5" SSD SATA 521.7 0.05
WD10j31x 2.5" SSHD SATA 89.8 17.31
WD2002FAEX 3.5" HDD SATA 124.2 12.78
WD4000F9YZ 3.5" HDD SATA 137.3 14.32
Seagate expansion 3TB 3.5" HDD USB 3 158.8 17.51
Seagate expansion 2TB 3.5" HDD USB 3 139.6 15.42

11 November 2018

11th of the 11th (month)

Farm status
Intel GPUs
All have been running Einstein O1OD1 work overnight

Nvidia GPUs
All off

Raspberry Pis
Twelve running Einstein BRP4 work

Other news
This week has been all about the file server and ZFS.

The Samsung 970Pro SSD arrived along with the Silverstone ECM21 adapter card. The only gripe I have with it is they didn't include a screw to hold the SSD in place but there are holes for it on the card. Cheap skates. I don't have any in my stockpile that fit so I ended up using a cable tie. I have since gone onto eBay and ordered a packet of 12 screws which I hope are the right size.

Installation is straight forward, no drivers are need and Linux saw it as a disk device and gave it a nvme device name. I just used gparted and created a partition table and formatted it. My original idea was to partition it into 2 drives and have one as a ZFS intent log drive (aka ZIL) and the other partition as a cache drive, however ZFS likes to deal in drives rather than partitions so it refused to mount the 2nd partition. I tried it both as a ZIL and as a cache drive. A ZIL doesn't need to be big, about 8GB is enough even for a 10Gbe network, it only needs to hold 5 seconds worth of writes.

I testing by dragging a 3.7GB test folder on and off my Drobo 5N2. I deleted it from the destination before running each test.

The Drobo has 2 x 1Gbe ports bonded together, 4 x 4TB WD red hard disks and a 256GB mSATA SSD. The storage server has a 10Gbe NIC, 4 x 4TB WD SE (enterprise grade) hard drives plugged into an Intel controller in JBOD mode. They're both plugged into the same Netgear switch. One 10Gbe port on the switch was an uplink to a 10Gbe switch which in turn has a 1Gbe link to the router. The other 10Gbe port was plugged into the storage server. The Drobo was plugged into two of the 1Gbe ports. That makes the theoretical maximum speed 1 gigabit/sec or 1Gbe ie the speed of the slowest device in the loop. Assuming a 10 bit signal (start bit, 8 bits data and a stop bit) that means we should be able to push 100MB (that's right megabytes) a second.

The best I could get was 49MB/sec which drops down to around 20MB before climbing back up which is probably the SSD on the Drobo buffering data. That was going from the Drobo to the storage server. Setting it up as a cache drive or ZIL didn't really make any difference to this regardless of which direction I was copying the files. As far as I understand it the ZIL is hardly used due to the asynchronous writes and the cache is only useful for data being read for a 2nd or subsequent time. If I had lots of users or random reads happening then it possibly would have made a difference, but for copying stuff in and out it doesn't make any difference in the speed. Ideally I should have used two devices that can do 10Gbe so as to remove the network speed from the equation.

I know in the reviews they recommend the Intel Optane 900P as a ZIL device but they're AUD $529 for the 280GB model, compared to the $250 that I spent on the Silverstone adapter and Samsung 970 Pro (512GB). That's double the price for half the capacity. You really need two devices for the ZIL to ensure data integrity as well.

I really liked Linus Tech Tips NVMe based server (a Supermicro SSG 2028R I believe) connected to his hard disk based server which sounds cool. Just connect them by that 40Gbe Mellanox controller and a short bit of fibre optic cable between the two and move the files off the fast storage to the slower one at lightening speed, or at least as fast as the hard disks can go. Maybe I should just buy an SSG 2028R...

If anyone has any suggestions on how to optimise it without throwing too much more hardware at it let me know. As always I'm happy to hear any comments.

03 November 2018

3rd of November

Farm status
Intel GPUs
All off

Nvidia GPUs
All off

Raspberry Pis
12 crunching Einstein BRP4 work

File server
I’ve converted it over to Linux and that is when the problems started. The RAID controller wasn’t recognised, even in JBOD mode so I removed it and plugged the drives directly into the motherboard SATA ports. I setup ZFS on the 4 drives without any issues. When I copied the files back it took ages and was indicating it was only doing 75MB/sec write speed which I think is pretty slow. Maybe the ZFS overheads are such that is a good figure.

I managed to get the RAID controller going but write speeds are still 75MB/sec. I have ordered a PCIe to NVMe adapter and a NVMe SSD to use as a cache and logging drive. NVMe SSD’s are somewhat faster than SATA which only go up to 6 gigabit/sec. Hopefully this will speed things up. I'm not sure if I need a second one (ie one for cache and another for logging).

Networking updated
I ordered and received a couple of 10Gbit network switches (the ones with 2 x 10Gbe and 8 x 1Gbe) and put them in. I also got a full 8 x 10Gbe switch to plug into the router so that I can get to use this higher speed down to the smaller switches and into the file server. I still have an extra 10Gbe network card that I could put into the file server.

New LIGO search
Einstein have a new search they are conducting on the LIGO O1 data. Its CPU only. I did some testing to see how it performs on the Intel GPU machines using all available threads versus using half of them. It certainly does more work using all threads.

I started doing the same on the AMD machines but after the first batch running on half the threads there isn't any new work available. Being 1st generation Ryzen’s they are pretty poor on the hyperthreading front, whereas Intel have had quite a few generations to optimise their designs.