I want to compare the reliability of different RAID systems with either consumer (URE/bit = 1e-14) or enterprise (URE/bit = 1e-15) drives. The formula to have the probability of success of a rebuild (ignoring mechanical problems, that I will take later into account) is simple:
error_probability = 1 - (1-per_bit_error_rate)^bit_read
Important to remember is that this is the probability of getting AT LEAST one URE, not necessarily only one.
Let's suppose we want 6 TB usable space. We can get it with:
RAID1 with 1+1 disks of 6 TB each. During rebuild we read back 1 disk of 6TB and the risk is: 1-(1-1e-14)^(6e12*8)=38% for consumer or 4.7% for enterprise drives.
RAID10 with 2+2 disks of 3 TB each. During rebuild we read back only 1 disk of 3TB (the one paired with the failed one!) and the risk is lower: 1-(1-1e-14)^(3e12*8)=21% for consumer or 2.4% for enterprise drives.
RAID5/RAID Z1 with 2+1 disks of 3TB each. During rebuild we read back 2 disks of 3TB each and the risk is: 1-(1-1e-14)^(2*3e12*8)=38% for consumer or 4.7% or enterprise drives.
RAID5/RAID Z1 with 3+1 disks of 2 TB each (often used by users of SOHO products like Synologys). During rebuild we read back 3 disks of 2TB each and the risk is: 1-(1-1e-14)^(3*2e12*8)=38% for consumer or 4.7% or enterprise drives.
Calculating the error for single disk tolerance is easy, more difficult is calculating the probability with systems tolerant to multiple disks failures (RAID6/Z2, RAIDZ3).
If only the first disk is used for rebuild and the second one is read again from the beginning in case or an URE, then the error probability is the one calculated above square rooted (14.5% for consumer RAID5 2+1, 4.5% for consumer RAID1 1+2). However, I suppose (at least in ZFS that has full checksums!) that the second parity/available disk is read only where needed, meaning that only few sectors are needed: how many UREs can possibly happen in the first disk? not many, otherwise the error probability for single-disk tolerance systems would skyrocket even more than I calculated.
If I'm correct, a second parity disk would practically lower the risk to extremely low values.
Question aside, it is important to keep in mind that manufacturers increase the URE probability for consumer-class drives for marketing reasons (sell more enterprise-class drives), therefore even consumer-class HDDs are expected to achieve 1E-15 URE/bit read.
Some data: http://www.high-rely.com/hr_66/blog/why-raid-5-stops-working-in-2009-not/
The values I provided in parentheses (enterprise drives) therefore realistically apply to consumer drives too. And real enterprise drives have an even higher reliability (URE/bit=1e-16).
Concerning the probability of mechanical failures, they are proportional to the number of disks and proportional to the time required to rebuild.