Btrfs raid level for large NAS (8+ 5tb disks)

Question

I'm planning to build a new NAS to store a large amount of media (20tb+). I would like to use btrfs for both the NAS and the backup (might be a separate system, not sure yet)

I want to use raid1 or raid10 to cover disk failure & bit rot
I want to use 1 large file system and 8-15 sub volumes - efficient space usage etc

My issues is - it does not look like raid 6 is up to scratch yet and a single raid1 or raid10 file system will only protect me from a single disk failure - I,m worried that rebuilding my file system after a disk failure with 5tb-10tb sized disks will takes days at least and expose me to total loss with another disk failure. I know I will then still have my backup, but i have the same issues again

what are my options with btrfs for the above scenario
is there any btrfs file system mode for combinng disks that will only loose what files are on that disk if there is a failure ?
can btrfs use a backup file system rather that raid to recover a checksum error ?
what about zfs
what about unraid, flexraid, etc for my scenario ?

Thanks

Kjartan · Accepted Answer · 2014-11-03 07:34:17Z

what about zfs

Hello Shaun,

I can't tell you much about btrfs, it's still on my to-do list. For ZFS, there are a few solutions available, some with graphical interface (they usually offer versions that are free for private use). I've also tested it using command line on Solaris, OpenIndiana and OmniOS, but for ease of use I'd recommend using a special NAS distribution like nexentastor (more business-oriented, less intuitive GUI) or in your case probably FreeNAS (good allrounder, webGUI, free).

FreeNAS installation is a breeze (e.g. write image to USB stick (I prefer SLC-based chips for better resiliance), stick it onto mainboard, boot, configure network on command line and plug into network - after that, all else is done via web-GUI) and the community is quite lively. AND it has an easy option to install (as an isolated module) a media server (plex media server) and let it see a selected directory or file system, optionally read-only.

And to me most important: you get (almost limitless) snapshots and snapshot-based replication to another box. Meaning: you can introduce a task that periodically makes snapshots and then replicates them to another box. That box doesn't have to be identical, it can be a low-cost system configuration (even based on a different system / OS) that only serves as an archive - or a full-fledged twin.

Now, when it comes to the disk configuration, there are some basic informations required, mainly about the kind of usage: Media files are usualy large, copying them from and to storage one by one is usally no big task for any system. What else will you need? Multiple simultaneous access to different media? Heavily skipping forward / backward? Or basically put: how random is your read access? Same goes for write access. Single-User, storing files and watching from time to time shouldn't be a big deal. A home theater box regularly scanning through all the media on the NAS to build an index for each file, or streaming out to 5 or 50 is a totally different thing. 20 people working on separate projects, editing, cutting and merging media files is another story completely.

The good news: ZFS can satisfy any of the above. Even all of them. But the costs will naturally vary. Let me give you some examples:

An 'entry configuration' (mainly single user throughput) providing 24TB might look like this: * one pool with a RAIDZ2 or Z3 configuration of 6 respectively 7 6TB HDs ('Z' is followed by the number of disks that may fail without actual data loss, max 3) * 8GB RAM (4GB is a bit tight, with ZFS it's generally: the more, the merrier!) * one or more 1GBit ethernet ports (best to add one dedicated network for replication if needed/feasible)

This setup (about 24TB) should suffice for mainly single user access, big files copied serially onto the box, then read/streamed singly. Paired with an adequate CPU (recent generation 2-4 cores, 2.5+ GHz) it should offer good read- and write-throughput but due to the monolithic disk layout would suffer from low IO-performance (esp. writing). Throughput would be expected to stay below 4x single disk performance but especially write IOPS would be expected to be no more than that of a single disk (apart from cached reads, naturally). Rebuilds after disk failure would naturally curb performance even further, but since only used blocks are replicated, it usually finishes much faster (depending of fill rate of the pool) than 'usual' RAID rebuilds.

To improve parallel read performance, you can add a 'performance SSD' (high IO, good throughput) as L2ARC, an intelligent read cache that otherwise resides completely in RAM. That should greatly enhance read performance, but the L2ARC is 'emptied' on reboot, afaik. So after a reboot, it would have to gradually 'refill', based on the 'working set' of files / pattern of access.

Here's an example of a better parallel (read/write) performer: * one pool containing 6 mirrors with 3x 4TB disks each (meaning each disk is mirrored TWICE for redundancy, reducing load during mirror-rebuild when one copy can be read for re-mirroring and another serves read-requests) * 32GB RAM * 2x 200GB+ L2ARC * one or more 10GBit ethernet ports (again, add one for replication between boxes)

This setup should offer several times the (read- and write) IO of the first setup (data is spread over 6 mirrors instead of one RAIDZ-device), performance during rebuilds should be much better, rebuild time less (due to smaller disks). And redundancy (ok-to-fail) is 2 disks - for each mirror. Naturally you have more disks in total -> more likel to have a failed disk at some point. But rebuild is faster and has much less of an impact.

Naturally, the IO also depends on the disks: compare 10.000rpm with <3ms seek time to 5.400rpm with >12ms seek time, not to mention SSDs with a fraction of that.

Speaking of SSDs, there is also an option for speeding up things using a separate device for 'write logging' called SLOG (Separate LOG), usually utilizing one or more SSDs (or PCIe cards) but this is often misunderstood and thus used incorrectly. I won't delve further into this topic at this point except for one point: it's only used related to SYNCHRONOUS data transfers (write transactions are acknowledged as soon as data has actually been written to stable storage, e.g. disks, in sort meaning 'I'm finished'), as opposed to asynchronous transfers (write transactions are acknowledged as soon as data has been received, but part (or all) of the data may still be residing in cache/RAM waiting to be written to stable storage, meaning 'I shall do it ASAP'). Usually, when we're talking network shares for file storage, we are talking about asynchronous transfers. Without any 'tweaks', sync writes are always slower than async ones. If you need this kind of integrity, just come back and ask for more. ;-)

Almost forgot: to ensure data integrity, it is best to use ECC-RAM (and compatible mainboard and CPU) to avoid data corruption due to unnoticed faulty memory. In a production environment, you would definitely not want that.

A few other features you might want to know * ZFS is generally (but not always, sigh)compatible among distributions/OSes based on same ZFS version (if no additional 'special features' are activated) * several good 'inline' compression options - but probably not in your case (pre-compressed media, I suppose) * integrity with auto-repair * ZFS rebuild after disk failure only replicates live data on disk, not free space integration with Active Directory (for business use) * FreeNAS has a built-in disk encryption option - best used with appropriate CPUs (acceleration) - but beware, it breaks compatibility with other distributions

Ok, so much for a short write-up on a ZFS-based solution... I hope it offers more answers than it provokes new questions.

Regards, Kjartan

Kjartan, thank you for your detailed answer. very useful.
– Shaun Arundell
Commented Sep 26, 2014 at 22:10 — Shaun Arundell, Commented Sep 26, 2014 at 22:10

Kayot · Accepted Answer · 2015-12-26 00:39:36Z

2/3/5 - You could always use mhddfs with snapraid.

Basically, if uptime isn't a major concern you can have up to 6 disk failure recovery using snapraid. I'm currently using it in windows with DrivePool, but I used to use ubuntu 14.04LTS with mhddfs and snapraid. What I did was;

Per-pair your drives. This assumes that you labelled the drives A00->A05 and your parity drives P00 and P01 and they are all formatted as ext4. Your parity drives will contain your parity and two of the three content files. The last content file will be stored on your system drive. A content files checks the file integrity.

Get mhddfs

sudo apt-get install mhddfs

edit fstab:

# Archive
LABEL=A00 /mnt/A00 ext4 default 0 0
LABEL=A01 /mnt/A01 ext4 default 0 0
LABEL=A02 /mnt/A02 ext4 default 0 0
LABEL=A03 /mnt/A03 ext4 default 0 0
LABEL=A04 /mnt/A04 ext4 default 0 0
LABEL=A05 /mnt/A05 ext4 default 0 0

# Storage Pool
mhddfs#/mnt/A00,/mnt/A01,/mnt/A02,/mnt/A03,/mnt/A04,/mnt/A05 /media/Archive fuse defaults,allow_other 0 0

# Parity
LABEL=P00 /mnt/P00 ext4 default 0 0
LABEL=P01 /mnt/P01 ext4 default 0 0

After you download and compile snapraid, edit it's config file like so:

parity /mnt/P00/snapraid.parity
2-parity /mnt/P01/snapraid.parity

content /mnt/P00/snapraid.content
content /mnt/P01/snapraid.content
content /mnt/snapraid.content

disk d0 /mnt/A00
disk d1 /mnt/A01
disk d2 /mnt/A02
disk d3 /mnt/A03
disk d4 /mnt/A04
disk d5 /mnt/A05

exclude *.unrecoverable
exclude Thumbs.db
exclude lost+found
exclude \Downloading\

Then when in terminal

sudo snapraid sync

I suggest doing scrubs every now and then (once a month maybe?) with;

sudo snapraid scrub

Using this method, you can add drives any time you want without resizing any raid solution. You will loose any speed gain that you might have gotten from raid, but you get piece of mind and a simple setup. If a drive dies, just read SnapRAID's manual. It's a simple drive replace and restore. I've lost drives and I haven't lost any data thanks to this setup. If you couldn't tell from the above, all of your space is pooled into a single volume called /media/Archive and data that is added will be distributed across the drives equally.

Stack Exchange Network

Btrfs raid level for large NAS (8+ 5tb disks)

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
raid
nas
btrfs
.

Hot Network Questions

Btrfs raid level for large NAS (8+ 5tb disks)

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged raidnasbtrfs.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
raid
nas
btrfs
.