When the original folks designed RAID, they only considered operational failures, which is when a disk drive breaks. When an operational failure occurs, you have to repair the disk so that the RAID works. However, you can still get your data. If you have two disks with operational failures, you have unrecoverable data loss. This is bad. Fortunately, operational failures are so infrequent that you can usually repair one before another one happens.
As it turns out, disks can also have latent failures, which is when a small part of the disk (called a sector) stores the wrong value. Latent failures typically go undetected -- you only discover one when you try to access the sector and you discover that it's holding wrong values (if you care, this is done by storing ECC bits with the sector. If you don't care, just take my word for it).
When you discover a latent failure, you can correct it by using information on the remaining N drives. However, if you have an operational failure on one disk and a latent failure on another disk, you cannot recover. This is a second source of data loss. As it turns out, latent failures are much more common than operational failures, and since they go undetected, they can last a long time. As such, the major component of data loss in RAID systems is when an operational failure occurs on one disk, and there is a latent failure on another that has gone undiscovered.
To combat this, you can perform an action called scrubbing on your disks. A scrub reads all sectors on all disks to check for latent failures and repairs them. When a scrub finishes, all disks will have no latent failures. Thus, if you scrub frequently enough, you can minimize the chances of data loss.
In 2007, Elerath and Pecht published a paper entitled "Enhanced Reliability Modeling of RAID Storage Systems", in a conference called DSN: International Symposium on Dependable Systems and Networks. In their paper, they model a RAID system with all the above parameters to see whether scrubbing alone can prevent data loss reasonably. Their system has the following states:
![]() |
Listed, they are as follows:
The three yellow states are where you can get at your data -- these are Available states. The red (Data Loss) state is Unavailable.
There are only four events in the system:
Your job is to write the program raid.cpp, which takes 12 parameters:
UNIX> raid N seed time beta_of eta_of lambda_lf gamma_r beta_r eta_r gamma_s beta_s eta_s |
The last nine parameters are the parameters that describe the four events. The other parameters are as follows:
The units of the lambda events are failures per hour. The units of the eta events are hours. The units of the gamma events are hours.
So, for example, one of the paper's simluations was:
UNIX> raid 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168 |
This represents a system with seven data disks for ten years, where operational failures occur on an average of every 461386 hours with a Weibull shape parameter of 1.12. Latent failures follow an exponential with a rate of one every 9259 hours. Repair takes a minimum of 6 hours, and then follows a Weibull with a lifetime of 12 hours and a shape parameter of two. Scrubbing occurs at a minimum of 36 hours, and after that has a lifetime of 168 hours with a shape parameter of three.
Your program needs to simluate the system. Therefore, you will have a data structure for disks, where disks are either up or under operational failure, and if they are up, you need to record how many latent failures they have. You will generate failure and repair events according to their probability disributions. You will also generate latent sector failures for each disk according to λLF and generate scrub events according to the scrub distribution function. When a scrub occurs, you eliminate all sector failures for all disks.
When you initialize your system, you should generate events in the following order:
Then you start the simulation. You will emit a line for every event that you process. That line will contain:
Each of the above is separated by a space. If your program ever enters the Data Loss state, your program should end. Otherwise, the last event should be Simulation Over and you print the state of the system.
One quirk of the program (discussed more below) is that when a failure occurs to a disk, you need to stop generating latent sector failures for that disk. When the disk is repaired, you start generating latent sector failures again. Do that before you generate the next operational failure event for that disk. In this way, the output of your program should match mine exactly.
You should error check each of the parameters:
The next step should be to implement a Run() method for the simulator, plus a Print_Event() method. I have that method as part of the Simulation_System class and not the Event class, because that makes it easier for me to access the Simulation_System variables, which I have protected. Now, implement the event processor, which removes events from the event queue, sets the simulation system time, and processes the event. Of course, the only event we process is the Simulation_Over event. At this point, I just have it print the time and exit. Test:
UNIX> raid_step_1 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168
87600.000 Simulation_Over
UNIX>
My next step was to implement the disks and to write code that prints the state of
the simulation. Note, you don't need this for the final program, but it will help
you debug. This is raid_step_2:
UNIX> raid_step_2 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168
87600.000 Simulation_Over N+1-W&C
--------------------------
Simulation State: N+1-W&C
Time: 87600
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
UNIX>
Now it's time to generate latent sector failures. You do this with an exponential, and you
do it for each disk, starting with Disk 0. In my code, I print out the events as I generate
them, and the code for processing the event simply exits:
UNIX> raid_step_3 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168
Generated LF Event for disk 0 at time 1734.47
Generated LF Event for disk 1 at time 12832.1
Generated LF Event for disk 2 at time 938.281
Generated LF Event for disk 3 at time 18923.6
Generated LF Event for disk 4 at time 7972.93
Generated LF Event for disk 5 at time 14266.7
Generated LF Event for disk 6 at time 10909.8
Generated LF Event for disk 7 at time 4259.87
938.281 Latent_Sector_Failure N+1-W&C
Event not implemented yet
UNIX>
Note, your latent failure times should match mine exactly. This is because you need
to generate events in the exact same order as I do.
Now it's time to process the LF events. I'm not going to say how to do this -- study up on the lecture notes. Note, I'm doing no repair and no scrubbing, so disks simply accumulate latent sector failures. I also beefed up my system state printer to print all the events on the event queue, and I print the state at the beginning of the event processing loop. Check out the output (r4-out.txt):
UNIX> raid_step_4 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r4-out.txt
UNIX> head -n 42 r4-out.txt
--------------------------
Simulation State: N+1-W&C
Time: 87600
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
938.281 Latent_Sector_Failure 2
1734.468 Latent_Sector_Failure 0
4259.873 Latent_Sector_Failure 7
7972.935 Latent_Sector_Failure 4
10909.753 Latent_Sector_Failure 6
12832.073 Latent_Sector_Failure 1
14266.656 Latent_Sector_Failure 5
18923.603 Latent_Sector_Failure 3
87600.000 Simulation_Over
--------------------------
938.281 Latent_Sector_Failure 2 N+1-W&C -> >=1-SF
--------------------------
Simulation State: >=1-SF
Time: 938.281
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 1
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
1734.468 Latent_Sector_Failure 0
4259.873 Latent_Sector_Failure 7
7972.935 Latent_Sector_Failure 4
10909.753 Latent_Sector_Failure 6
12832.073 Latent_Sector_Failure 1
14266.656 Latent_Sector_Failure 5
18923.603 Latent_Sector_Failure 3
20111.011 Latent_Sector_Failure 2
87600.000 Simulation_Over
UNIX> tail -n 21 r4-out.txt
87600.000 Simulation_Over
--------------------------
Simulation State: >=1-SF
Time: 87600
Disk 0 Up: 1 LF: 8
Disk 1 Up: 1 LF: 7
Disk 2 Up: 1 LF: 11
Disk 3 Up: 1 LF: 10
Disk 4 Up: 1 LF: 10
Disk 5 Up: 1 LF: 5
Disk 6 Up: 1 LF: 11
Disk 7 Up: 1 LF: 10
87625.452 Latent_Sector_Failure 7
89427.896 Latent_Sector_Failure 4
92845.404 Latent_Sector_Failure 5
94719.573 Latent_Sector_Failure 0
97374.679 Latent_Sector_Failure 6
99230.452 Latent_Sector_Failure 3
107978.718 Latent_Sector_Failure 2
109327.601 Latent_Sector_Failure 1
--------------------------
UNIX> grep ' -> ' r4-out.txt | wc
72 432 4824
UNIX>
Ok -- from the above, you can see that the same failure events are generated
and put on the event queue. The first failure event is on disk #2, and
when that is processed, a new failure event is generated for
time 20111.011. At the end, you can see that there are still failure
events on the queue which are unprocessed, because the simulation is over.
When you grep for " -> ", you see 72 failure events, and if you sum up
the values of "LF", their sum is 72. Excellent.
Next, it's time to generate a scrub event. We do so after we generate the initial latent sector failure events. I don't implement the code to process it though. A quick test shows that the first scrub event is at 171.061. Make sure your times match mine.
UNIX> raid_step_5 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168
--------------------------
Simulation State: N+1-W&C
Time: 87600
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
250.132 Scrub
938.281 Latent_Sector_Failure 2
1734.468 Latent_Sector_Failure 0
4259.873 Latent_Sector_Failure 7
7972.935 Latent_Sector_Failure 4
10909.753 Latent_Sector_Failure 6
12832.073 Latent_Sector_Failure 1
14266.656 Latent_Sector_Failure 5
18923.603 Latent_Sector_Failure 3
87600.000 Simulation_Over
--------------------------
250.132 Scrub
Event not implemented yet
UNIX>
Now we implement scrubbing. Again, this should be straightforward.
Lines 115 to 160 show the first sector failure occuring in disk 2, and
how it is cleaned up by the scrub at time 1033.209. The output is in
(r6-out.txt):
UNIX> raid_step_6 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r6-out.txt
UNIX> sed -n 115,160p r6-out.txt
938.281 Latent_Sector_Failure 2 N+1-W&C -> >=1-SF
--------------------------
Simulation State: >=1-SF
Time: 938.281
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 1
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
1033.209 Scrub
1734.468 Latent_Sector_Failure 0
3728.991 Latent_Sector_Failure 2
4259.873 Latent_Sector_Failure 7
7972.935 Latent_Sector_Failure 4
10909.753 Latent_Sector_Failure 6
12832.073 Latent_Sector_Failure 1
14266.656 Latent_Sector_Failure 5
18923.603 Latent_Sector_Failure 3
87600.000 Simulation_Over
--------------------------
1033.209 Scrub >=1-SF -> N+1-W&C
--------------------------
Simulation State: N+1-W&C
Time: 1033.21
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
1202.671 Scrub
1734.468 Latent_Sector_Failure 0
3728.991 Latent_Sector_Failure 2
4259.873 Latent_Sector_Failure 7
7972.935 Latent_Sector_Failure 4
10909.753 Latent_Sector_Failure 6
12832.073 Latent_Sector_Failure 1
14266.656 Latent_Sector_Failure 5
18923.603 Latent_Sector_Failure 3
87600.000 Simulation_Over
--------------------------
UNIX>
Again, make sure your times match my exactly. They should as long as you generate
those initial events in the correct order.
Now, let's generate operational failures, but not process them. I do this below, and as you see, the first operational failure is after the simulation time:
UNIX> raid_step_7 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | head -n 30
--------------------------
Simulation State: N+1-W&C
Time: 87600
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
250.132 Scrub
938.281 Latent_Sector_Failure 2
1734.468 Latent_Sector_Failure 0
4259.873 Latent_Sector_Failure 7
7972.935 Latent_Sector_Failure 4
10909.753 Latent_Sector_Failure 6
12832.073 Latent_Sector_Failure 1
14266.656 Latent_Sector_Failure 5
18923.603 Latent_Sector_Failure 3
87600.000 Simulation_Over
158132.214 Operational_Failure 4
220116.693 Operational_Failure 2
249081.263 Operational_Failure 5
288343.629 Operational_Failure 1
590663.738 Operational_Failure 3
609884.969 Operational_Failure 0
662515.741 Operational_Failure 6
804729.278 Operational_Failure 7
--------------------------
UNIX>
However, if you try a seed of 1, you get an operational failure at 40921.139:
UNIX> raid_step_7 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | head -n 30
--------------------------
Simulation State: N+1-W&C
Time: 87600
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
16.374 Latent_Sector_Failure 5
223.414 Scrub
393.709 Latent_Sector_Failure 0
1923.553 Latent_Sector_Failure 6
3791.118 Latent_Sector_Failure 3
5611.313 Latent_Sector_Failure 1
7717.702 Latent_Sector_Failure 4
16672.710 Latent_Sector_Failure 2
40921.139 Operational_Failure 4
43050.179 Latent_Sector_Failure 7
80859.489 Operational_Failure 3
87600.000 Simulation_Over
93196.900 Operational_Failure 6
218364.387 Operational_Failure 1
228921.239 Operational_Failure 0
399812.784 Operational_Failure 2
411047.679 Operational_Failure 7
1235197.318 Operational_Failure 5
--------------------------
UNIX> raid_step_7 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168
...
--------------------------
40921.139 Operational_Failure 4
Event not implemented yet
UNIX>
Now, start implementing operational failures. First, get the state transitions
right, and then generate the correct repair event. However, don't implement repair yet:
UNIX> raid_step_8 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168
....
40921.139 Operational_Failure 4 N+1-W&C -> N-W&C
--------------------------
Simulation State: N-W&C
Time: 40921.1
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
40937.686 Repair 4
41111.416 Scrub
43050.179 Latent_Sector_Failure 7
43146.259 Latent_Sector_Failure 3
44688.617 Latent_Sector_Failure 6
46319.019 Latent_Sector_Failure 4
55472.131 Latent_Sector_Failure 2
55577.832 Latent_Sector_Failure 5
57943.514 Latent_Sector_Failure 1
63009.763 Latent_Sector_Failure 0
80859.489 Operational_Failure 3
87600.000 Simulation_Over
93196.900 Operational_Failure 6
218364.387 Operational_Failure 1
228921.239 Operational_Failure 0
399812.784 Operational_Failure 2
411047.679 Operational_Failure 7
1235197.318 Operational_Failure 5
--------------------------
40937.686 Repair 4
Event not implemented yet
UNIX>
Good -- now, before moving on, we have an issue. Disk 4 is down, but there is a Latent_Sector_Failure
event generated for it. That makes no sense, especially if the sector failure falls while the disk
is down. So, what you need to do delete that event from the event queue. You'll regenerate it
when the disk is repaired. Raid_step_9 performs this deletion. You'll have to retain a pointer
to the event's iterator in each Disk instance. Make sure you manage memory properly here.
I deleted the Latent_Sector_Failure event.
UNIX> raid_step_9 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168
UNIX>
40921.139 Operational_Failure 4 N+1-W&C -> N-W&C
--------------------------
Simulation State: N-W&C
Time: 40921.1
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 0 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
40937.686 Repair 4
41111.416 Scrub
43050.179 Latent_Sector_Failure 7
43146.259 Latent_Sector_Failure 3
44688.617 Latent_Sector_Failure 6
55472.131 Latent_Sector_Failure 2
55577.832 Latent_Sector_Failure 5
57943.514 Latent_Sector_Failure 1
63009.763 Latent_Sector_Failure 0
80859.489 Operational_Failure 3
87600.000 Simulation_Over
93196.900 Operational_Failure 6
218364.387 Operational_Failure 1
228921.239 Operational_Failure 0
399812.784 Operational_Failure 2
411047.679 Operational_Failure 7
1235197.318 Operational_Failure 5
--------------------------
40937.686 Repair 4
Event not implemented yet
Good -- the Latent_Sector_Failure is gone. Ok -- let's implement repair.
Repair will set the state back to "N+1 Working & Clean", and it will generate
a new latent failure event and a new operational failure event.
Make sure you do it in that order. In raid_step_A, I generate
the new events, print the state and exit -- take a look:
UNIX> raid_step_A 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | tail -n 61
40921.139 Operational_Failure 4 N+1-W&C -> N-W&C
--------------------------
Simulation State: N-W&C
Time: 40921.1
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 0 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
40937.686 Repair 4
41111.416 Scrub
43050.179 Latent_Sector_Failure 7
43146.259 Latent_Sector_Failure 3
44688.617 Latent_Sector_Failure 6
55472.131 Latent_Sector_Failure 2
55577.832 Latent_Sector_Failure 5
57943.514 Latent_Sector_Failure 1
63009.763 Latent_Sector_Failure 0
80859.489 Operational_Failure 3
87600.000 Simulation_Over
93196.900 Operational_Failure 6
218364.387 Operational_Failure 1
228921.239 Operational_Failure 0
399812.784 Operational_Failure 2
411047.679 Operational_Failure 7
1235197.318 Operational_Failure 5
--------------------------
40937.686 Repair 4 N-W&C -> N+1-W&C
--------------------------
Simulation State: N+1-W&C
Time: 40937.7
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
41111.416 Scrub
41459.894 Latent_Sector_Failure 4
43050.179 Latent_Sector_Failure 7
43146.259 Latent_Sector_Failure 3
44688.617 Latent_Sector_Failure 6
55472.131 Latent_Sector_Failure 2
55577.832 Latent_Sector_Failure 5
57943.514 Latent_Sector_Failure 1
63009.763 Latent_Sector_Failure 0
80859.489 Operational_Failure 3
87600.000 Simulation_Over
93196.900 Operational_Failure 6
218364.387 Operational_Failure 1
228921.239 Operational_Failure 0
399812.784 Operational_Failure 2
411047.679 Operational_Failure 7
1235197.318 Operational_Failure 5
1252031.348 Operational_Failure 4
--------------------------
UNIX>
As you can see, the repair created two new events -- the Latent_Sector_Failure at time
41459.894, and the Operational_Failure at time 1252031.348.
We're almost done. However, before going on, let's think about the case where we have a latent sector failure and we're in the "N Working & Clean" state. This means he have to go to the data loss state. Make sure you have that case implemented, along with the proper code to deal with getting an operational failure when you're in the "≥1 Sector Failure" state. Test, test, test. The raid_verbose program prints out the state of the system before each event. The raid program simply prints the events, and that is the one which your program has to match.
Here are some good examples with raid_verbose and raid. Run it with a seed of 8 and you get data loss with a sector failure followed by an operational failure:
UNIX> raid 7 8 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | tail
67030.379 Latent_Sector_Failure 3 N+1-W&C -> >=1-SF
67080.075 Scrub >=1-SF -> N+1-W&C
67218.335 Scrub N+1-W&C -> N+1-W&C
67436.399 Scrub N+1-W&C -> N+1-W&C
67619.237 Scrub N+1-W&C -> N+1-W&C
67641.410 Latent_Sector_Failure 2 N+1-W&C -> >=1-SF
67798.062 Scrub >=1-SF -> N+1-W&C
68007.887 Scrub N+1-W&C -> N+1-W&C
68048.479 Latent_Sector_Failure 5 N+1-W&C -> >=1-SF
68078.932 Operational_Failure 2 >=1-SF -> Data-Loss
UNIX> raid_verbose 7 8 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r-out-08.txt
UNIX> tail -n 32 r-out-08.txt
68048.479 Latent_Sector_Failure 5 N+1-W&C -> >=1-SF
--------------------------
Simulation State: >=1-SF
Time: 68048.5
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 1 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 1
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
68078.932 Operational_Failure 2
68204.706 Scrub
68432.905 Latent_Sector_Failure 3
68927.632 Latent_Sector_Failure 7
69787.407 Latent_Sector_Failure 5
69993.567 Latent_Sector_Failure 4
70643.753 Latent_Sector_Failure 2
78296.389 Latent_Sector_Failure 0
80980.939 Latent_Sector_Failure 6
87224.610 Latent_Sector_Failure 1
87600.000 Simulation_Over
527848.763 Operational_Failure 1
533624.845 Operational_Failure 4
564553.233 Operational_Failure 0
637065.024 Operational_Failure 7
688310.478 Operational_Failure 6
809074.943 Operational_Failure 5
1119991.579 Operational_Failure 3
--------------------------
68078.932 Operational_Failure 2 >=1-SF -> Data-Loss
UNIX>
Run it with a seed of 54 and you get data loss with an operational failure
followed by a sector failure:
UNIX> raid 7 126 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | tail
16431.317 Latent_Sector_Failure 1 N+1-W&C -> >=1-SF
16566.950 Scrub >=1-SF -> N+1-W&C
16822.776 Scrub N+1-W&C -> N+1-W&C
16840.946 Latent_Sector_Failure 2 N+1-W&C -> >=1-SF
17018.701 Scrub >=1-SF -> N+1-W&C
17279.807 Scrub N+1-W&C -> N+1-W&C
17444.582 Scrub N+1-W&C -> N+1-W&C
17606.732 Operational_Failure 2 N+1-W&C -> N-W&C
17612.592 Scrub N-W&C -> N-W&C
17614.251 Latent_Sector_Failure 7 N-W&C -> Data-Loss
UNIX> raid_verbose 7 126 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r-out-126.txt
UNIX> tail -n 32 r-out-126.txt
--------------------------
17612.592 Scrub N-W&C -> N-W&C
--------------------------
Simulation State: N-W&C
Time: 17612.6
Disk 0 Up: 1 LF: 0
Disk 1 Up: 1 LF: 0
Disk 2 Up: 0 LF: 0
Disk 3 Up: 1 LF: 0
Disk 4 Up: 1 LF: 0
Disk 5 Up: 1 LF: 0
Disk 6 Up: 1 LF: 0
Disk 7 Up: 1 LF: 0
17614.251 Latent_Sector_Failure 7
17625.196 Repair 2
17861.024 Scrub
18465.175 Latent_Sector_Failure 1
19339.598 Latent_Sector_Failure 0
20112.740 Latent_Sector_Failure 3
24367.099 Operational_Failure 7
31394.891 Latent_Sector_Failure 6
56342.535 Latent_Sector_Failure 5
61687.690 Latent_Sector_Failure 4
83392.317 Operational_Failure 3
87600.000 Simulation_Over
151580.748 Operational_Failure 6
271051.588 Operational_Failure 0
329718.300 Operational_Failure 1
350806.455 Operational_Failure 5
384234.038 Operational_Failure 4
--------------------------
17614.251 Latent_Sector_Failure 7 N-W&C -> Data-Loss
UNIX>
If we modify the operational failure rate, we see two operational failures
causing data loss:
UNIX> raid 7 3 87600 1.12 4613 0.000108003 6 2 12 36 3 168
93.258 Operational_Failure 2 N+1-W&C -> N-W&C
109.223 Repair 2 N-W&C -> N+1-W&C
185.206 Scrub N+1-W&C -> N+1-W&C
407.436 Operational_Failure 7 N+1-W&C -> N-W&C
408.000 Operational_Failure 5 N-W&C -> Data-Loss
UNIX>
UNIX>
And finally, the following call shows a latent sector failure followed
by an operational failure to the same disk, which is not a data loss event:
UNIX> raid 7 272 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r-out-272.txt
UNIX> sed -n 90,100p r-out-272.txt
13382.944 Scrub N+1-W&C -> N+1-W&C
13627.772 Scrub N+1-W&C -> N+1-W&C
13865.283 Scrub N+1-W&C -> N+1-W&C
14105.195 Scrub N+1-W&C -> N+1-W&C
14174.163 Latent_Sector_Failure 7 N+1-W&C -> >=1-SF
14226.446 Operational_Failure 7 >=1-SF -> N-W&C
14239.088 Repair 7 N-W&C -> N+1-W&C
14352.994 Scrub N+1-W&C -> N+1-W&C
14460.291 Scrub N+1-W&C -> N+1-W&C
14717.011 Scrub N+1-W&C -> N+1-W&C
14848.550 Scrub N+1-W&C -> N+1-W&C
UNIX>
Have fun, and start early! The grading script tests 70 through 80 test
your error checking. The others test various features. For example, if you
only generate latent sector failures, then some of the examples will run
correctly.