CS302 -- Lab 5 -- Simulation: Failures in a RAID System

A RAID system is a storage system composed of N+1 disks, where N of them hold data, and the extra disk holds parity information, calculated from the data. If any single disk fails, its contents can be calculated from the remaining N disks. This is a nice thing because you get storage capacity by using multiple disks, but you also get fault tolerance when a disk fails.

When the original folks designed RAID, they only considered operational failures, which is when a disk drive breaks. When an operational failure occurs, you have to repair the disk so that the RAID works. However, you can still get your data. If you have two disks with operational failures, you have unrecoverable data loss. This is bad. Fortunately, operational failures are so infrequent that you can usually repair one before another one happens.

As it turns out, disks can also have latent failures, which is when a small part of the disk (called a sector) stores the wrong value. Latent failures typically go undetected -- you only discover one when you try to access the sector and you discover that it's holding wrong values (if you care, this is done by storing ECC bits with the sector. If you don't care, just take my word for it).

When you discover a latent failure, you can correct it by using information on the remaining N drives. However, if you have an operational failure on one disk and a latent failure on another disk, you cannot recover. This is a second source of data loss. As it turns out, latent failures are much more common than operational failures, and since they go undetected, they can last a long time. As such, the major component of data loss in RAID systems is when an operational failure occurs on one disk, and there is a latent failure on another that has gone undiscovered.

To combat this, you can perform an action called scrubbing on your disks. A scrub reads all sectors on all disks to check for latent failures and repairs them. When a scrub finishes, all disks will have no latent failures. Thus, if you scrub frequently enough, you can minimize the chances of data loss.

In 2007, Elerath and Pecht published a paper entitled "Enhanced Reliability Modeling of RAID Storage Systems", in a conference called DSN: International Symposium on Dependable Systems and Networks. In their paper, they model a RAID system with all the above parameters to see whether scrubbing alone can prevent data loss reasonably. Their system has the following states:

Listed, they are as follows:

The three yellow states are where you can get at your data -- these are Available states. The red (Data Loss) state is Unavailable.

There are only four events in the system:

  1. Operational Failure: This renders a disk unavailable. It causes Data Loss if there is another broken disk (N Working & Clean) or if there is a sector failure on another disk (≥1 Sector Failure). If all the disks were working, (N+1 Working & Clean) an operational failure takes you to the N Working & Clean state. If you are in the ≥1 Sector Failure state, it's possible for an operational failure not to result in Data Loss -- that is if the operational failure occurs to the only disk in the system that has latent sector failures. That is why there is an asterisk by the two Operational Failure events coming from the ≥1 Sector Failure. Operational Failure events are assumed to occur using a Weibull distribution. The three parameters are the shape, βOF, and the lifetime ηOF. The minimum, γOF, is equal to zero.

  2. Latent Sector Failure: These take the system from the N+1 Working & Clean state to the ≥1 Sector Failure state, or they keep the system in the ≥1 Sector Failure state. Additionally, a Latent Sector Failure takes the system from the N Working & Clean state to the Data Loss state, because an operational failure and a sector failure mean that you cannot reconstruct data. We model latent sector failures with an exponential parameterized by λLF.

  3. Repair events repair an operational failure, and thus take the system from the N Working & Clean state to the N+1 Working & Clean state. We model repairs using a three variable Weibull with a minimum value (since it takes a certain amount of time to respond to a failure) γR, a shape parameter βR and a lifetime ηR.

  4. Scrub events repair the latent failures on all disks, and take the system from the ≥1 Sector Failure state to the N+1 Working & Clean state. If the system is in either Working & Clean state, then the scrub has no effect. We model scrubbing with another Weibull, with parameters γS (the minimum scrub time), βS (shape) and ηS (lifetime).

Your job is to write the program raid.cpp, which takes 12 parameters:

UNIX> raid N seed time beta_of eta_of lambda_lf gamma_r beta_r eta_r gamma_s beta_s eta_s

The last nine parameters are the parameters that describe the four events. The other parameters are as follows:

The units of the lambda events are failures per hour. The units of the eta events are hours. The units of the gamma events are hours.

So, for example, one of the paper's simluations was:

UNIX> raid 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168

This represents a system with seven data disks for ten years, where operational failures occur on an average of every 461386 hours with a Weibull shape parameter of 1.12. Latent failures follow an exponential with a rate of one every 9259 hours. Repair takes a minimum of 6 hours, and then follows a Weibull with a lifetime of 12 hours and a shape parameter of two. Scrubbing occurs at a minimum of 36 hours, and after that has a lifetime of 168 hours with a shape parameter of three.

Your program needs to simluate the system. Therefore, you will have a data structure for disks, where disks are either up or under operational failure, and if they are up, you need to record how many latent failures they have. You will generate failure and repair events according to their probability disributions. You will also generate latent sector failures for each disk according to λLF and generate scrub events according to the scrub distribution function. When a scrub occurs, you eliminate all sector failures for all disks.

When you initialize your system, you should generate events in the following order:

Then you start the simulation. You will emit a line for every event that you process. That line will contain:

Each of the above is separated by a space. If your program ever enters the Data Loss state, your program should end. Otherwise, the last event should be Simulation Over and you print the state of the system.

One quirk of the program (discussed more below) is that when a failure occurs to a disk, you need to stop generating latent sector failures for that disk. When the disk is repaired, you start generating latent sector failures again. Do that before you generate the next operational failure event for that disk. In this way, the output of your program should match mine exactly.

You should error check each of the parameters:


Exponentials and Weibulls

See the lecture notes on random number generation. Use the code there for generating both exponentials and Weibulls.

Do This In Steps

Build up your simulation much like the example simulation in the lecture. You will have classes for each Disk, a class for Events and a class for the Simulation_System. Obviously, disks need to record whether they are working or failed, and whether they have latent sector failures. You'll also need states for the system. Your system should have an event queue. First write code to read in all the variables, initialize the data structures, and error check. My code has the Simulation_System constructor take argc and argv as parameters. It makes the srand48() call and puts the initial Simulation_Over event on the event queue. Test this.

The next step should be to implement a Run() method for the simulator, plus a Print_Event() method. I have that method as part of the Simulation_System class and not the Event class, because that makes it easier for me to access the Simulation_System variables, which I have protected. Now, implement the event processor, which removes events from the event queue, sets the simulation system time, and processes the event. Of course, the only event we process is the Simulation_Over event. At this point, I just have it print the time and exit. Test:

UNIX> raid_step_1 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168
     87600.000 Simulation_Over     
UNIX> 
My next step was to implement the disks and to write code that prints the state of the simulation. Note, you don't need this for the final program, but it will help you debug. This is raid_step_2:
UNIX> raid_step_2 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168
     87600.000 Simulation_Over      N+1-W&C
--------------------------
Simulation State: N+1-W&C
Time: 87600
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
UNIX> 
Now it's time to generate latent sector failures. You do this with an exponential, and you do it for each disk, starting with Disk 0. In my code, I print out the events as I generate them, and the code for processing the event simply exits:
UNIX> raid_step_3 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168
Generated LF Event for disk 0 at time 1734.47
Generated LF Event for disk 1 at time 12832.1
Generated LF Event for disk 2 at time 938.281
Generated LF Event for disk 3 at time 18923.6
Generated LF Event for disk 4 at time 7972.93
Generated LF Event for disk 5 at time 14266.7
Generated LF Event for disk 6 at time 10909.8
Generated LF Event for disk 7 at time 4259.87
       938.281 Latent_Sector_Failure N+1-W&C
Event not implemented yet
UNIX> 
Note, your latent failure times should match mine exactly. This is because you need to generate events in the exact same order as I do.

Now it's time to process the LF events. I'm not going to say how to do this -- study up on the lecture notes. Note, I'm doing no repair and no scrubbing, so disks simply accumulate latent sector failures. I also beefed up my system state printer to print all the events on the event queue, and I print the state at the beginning of the event processing loop. Check out the output (r4-out.txt):

UNIX> raid_step_4 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r4-out.txt
UNIX> head -n 42 r4-out.txt
--------------------------
Simulation State: N+1-W&C
Time: 87600
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
       938.281 Latent_Sector_Failure   2
      1734.468 Latent_Sector_Failure   0
      4259.873 Latent_Sector_Failure   7
      7972.935 Latent_Sector_Failure   4
     10909.753 Latent_Sector_Failure   6
     12832.073 Latent_Sector_Failure   1
     14266.656 Latent_Sector_Failure   5
     18923.603 Latent_Sector_Failure   3
     87600.000 Simulation_Over          
--------------------------
       938.281 Latent_Sector_Failure   2  N+1-W&C   ->   >=1-SF   
--------------------------
Simulation State: >=1-SF
Time: 938.281
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 1
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
      1734.468 Latent_Sector_Failure   0
      4259.873 Latent_Sector_Failure   7
      7972.935 Latent_Sector_Failure   4
     10909.753 Latent_Sector_Failure   6
     12832.073 Latent_Sector_Failure   1
     14266.656 Latent_Sector_Failure   5
     18923.603 Latent_Sector_Failure   3
     20111.011 Latent_Sector_Failure   2
     87600.000 Simulation_Over          
UNIX> tail -n 21 r4-out.txt
     87600.000 Simulation_Over          
--------------------------
Simulation State: >=1-SF
Time: 87600
  Disk 0 Up: 1 LF: 8
  Disk 1 Up: 1 LF: 7
  Disk 2 Up: 1 LF: 11
  Disk 3 Up: 1 LF: 10
  Disk 4 Up: 1 LF: 10
  Disk 5 Up: 1 LF: 5
  Disk 6 Up: 1 LF: 11
  Disk 7 Up: 1 LF: 10
     87625.452 Latent_Sector_Failure   7
     89427.896 Latent_Sector_Failure   4
     92845.404 Latent_Sector_Failure   5
     94719.573 Latent_Sector_Failure   0
     97374.679 Latent_Sector_Failure   6
     99230.452 Latent_Sector_Failure   3
    107978.718 Latent_Sector_Failure   2
    109327.601 Latent_Sector_Failure   1
--------------------------
UNIX> grep ' -> ' r4-out.txt | wc
      72     432    4824
UNIX> 
Ok -- from the above, you can see that the same failure events are generated and put on the event queue. The first failure event is on disk #2, and when that is processed, a new failure event is generated for time 20111.011. At the end, you can see that there are still failure events on the queue which are unprocessed, because the simulation is over. When you grep for " -> ", you see 72 failure events, and if you sum up the values of "LF", their sum is 72. Excellent.

Next, it's time to generate a scrub event. We do so after we generate the initial latent sector failure events. I don't implement the code to process it though. A quick test shows that the first scrub event is at 171.061. Make sure your times match mine.

UNIX> raid_step_5 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168
--------------------------
Simulation State: N+1-W&C
Time: 87600
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
       250.132 Scrub                    
       938.281 Latent_Sector_Failure   2
      1734.468 Latent_Sector_Failure   0
      4259.873 Latent_Sector_Failure   7
      7972.935 Latent_Sector_Failure   4
     10909.753 Latent_Sector_Failure   6
     12832.073 Latent_Sector_Failure   1
     14266.656 Latent_Sector_Failure   5
     18923.603 Latent_Sector_Failure   3
     87600.000 Simulation_Over          
--------------------------
       250.132 Scrub                    
Event not implemented yet
UNIX> 
Now we implement scrubbing. Again, this should be straightforward. Lines 115 to 160 show the first sector failure occuring in disk 2, and how it is cleaned up by the scrub at time 1033.209. The output is in (r6-out.txt):
UNIX> raid_step_6 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r6-out.txt
UNIX> sed -n 115,160p r6-out.txt
       938.281 Latent_Sector_Failure   2  N+1-W&C   ->   >=1-SF   
--------------------------
Simulation State: >=1-SF
Time: 938.281
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 1
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
      1033.209 Scrub                    
      1734.468 Latent_Sector_Failure   0
      3728.991 Latent_Sector_Failure   2
      4259.873 Latent_Sector_Failure   7
      7972.935 Latent_Sector_Failure   4
     10909.753 Latent_Sector_Failure   6
     12832.073 Latent_Sector_Failure   1
     14266.656 Latent_Sector_Failure   5
     18923.603 Latent_Sector_Failure   3
     87600.000 Simulation_Over          
--------------------------
      1033.209 Scrub                      >=1-SF    ->   N+1-W&C  
--------------------------
Simulation State: N+1-W&C
Time: 1033.21
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
      1202.671 Scrub                    
      1734.468 Latent_Sector_Failure   0
      3728.991 Latent_Sector_Failure   2
      4259.873 Latent_Sector_Failure   7
      7972.935 Latent_Sector_Failure   4
     10909.753 Latent_Sector_Failure   6
     12832.073 Latent_Sector_Failure   1
     14266.656 Latent_Sector_Failure   5
     18923.603 Latent_Sector_Failure   3
     87600.000 Simulation_Over          
--------------------------
UNIX> 
Again, make sure your times match my exactly. They should as long as you generate those initial events in the correct order.

Now, let's generate operational failures, but not process them. I do this below, and as you see, the first operational failure is after the simulation time:

UNIX> raid_step_7 7 0 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | head -n 30
--------------------------
Simulation State: N+1-W&C
Time: 87600
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
       250.132 Scrub                    
       938.281 Latent_Sector_Failure   2
      1734.468 Latent_Sector_Failure   0
      4259.873 Latent_Sector_Failure   7
      7972.935 Latent_Sector_Failure   4
     10909.753 Latent_Sector_Failure   6
     12832.073 Latent_Sector_Failure   1
     14266.656 Latent_Sector_Failure   5
     18923.603 Latent_Sector_Failure   3
     87600.000 Simulation_Over          
    158132.214 Operational_Failure     4
    220116.693 Operational_Failure     2
    249081.263 Operational_Failure     5
    288343.629 Operational_Failure     1
    590663.738 Operational_Failure     3
    609884.969 Operational_Failure     0
    662515.741 Operational_Failure     6
    804729.278 Operational_Failure     7
--------------------------
UNIX> 
However, if you try a seed of 1, you get an operational failure at 40921.139:
UNIX> raid_step_7 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | head -n 30
--------------------------
Simulation State: N+1-W&C
Time: 87600
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
        16.374 Latent_Sector_Failure   5
       223.414 Scrub                    
       393.709 Latent_Sector_Failure   0
      1923.553 Latent_Sector_Failure   6
      3791.118 Latent_Sector_Failure   3
      5611.313 Latent_Sector_Failure   1
      7717.702 Latent_Sector_Failure   4
     16672.710 Latent_Sector_Failure   2
     40921.139 Operational_Failure     4
     43050.179 Latent_Sector_Failure   7
     80859.489 Operational_Failure     3
     87600.000 Simulation_Over          
     93196.900 Operational_Failure     6
    218364.387 Operational_Failure     1
    228921.239 Operational_Failure     0
    399812.784 Operational_Failure     2
    411047.679 Operational_Failure     7
   1235197.318 Operational_Failure     5
--------------------------
UNIX> raid_step_7 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168 
...
--------------------------
     40921.139 Operational_Failure     4
Event not implemented yet
UNIX> 
Now, start implementing operational failures. First, get the state transitions right, and then generate the correct repair event. However, don't implement repair yet:
UNIX> raid_step_8 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168
....
     40921.139 Operational_Failure     4  N+1-W&C   ->   N-W&C    
--------------------------
Simulation State: N-W&C
Time: 40921.1
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
     40937.686 Repair                  4
     41111.416 Scrub                    
     43050.179 Latent_Sector_Failure   7
     43146.259 Latent_Sector_Failure   3
     44688.617 Latent_Sector_Failure   6
     46319.019 Latent_Sector_Failure   4
     55472.131 Latent_Sector_Failure   2
     55577.832 Latent_Sector_Failure   5
     57943.514 Latent_Sector_Failure   1
     63009.763 Latent_Sector_Failure   0
     80859.489 Operational_Failure     3
     87600.000 Simulation_Over          
     93196.900 Operational_Failure     6
    218364.387 Operational_Failure     1
    228921.239 Operational_Failure     0
    399812.784 Operational_Failure     2
    411047.679 Operational_Failure     7
   1235197.318 Operational_Failure     5
--------------------------
     40937.686 Repair                  4
Event not implemented yet
UNIX>
Good -- now, before moving on, we have an issue. Disk 4 is down, but there is a Latent_Sector_Failure event generated for it. That makes no sense, especially if the sector failure falls while the disk is down. So, what you need to do delete that event from the event queue. You'll regenerate it when the disk is repaired. Raid_step_9 performs this deletion. You'll have to retain a pointer to the event's iterator in each Disk instance. Make sure you manage memory properly here. I deleted the Latent_Sector_Failure event.
UNIX> raid_step_9 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168
UNIX> 
     40921.139 Operational_Failure     4  N+1-W&C   ->   N-W&C    
--------------------------
Simulation State: N-W&C
Time: 40921.1
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 0 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
     40937.686 Repair                  4
     41111.416 Scrub                    
     43050.179 Latent_Sector_Failure   7
     43146.259 Latent_Sector_Failure   3
     44688.617 Latent_Sector_Failure   6
     55472.131 Latent_Sector_Failure   2
     55577.832 Latent_Sector_Failure   5
     57943.514 Latent_Sector_Failure   1
     63009.763 Latent_Sector_Failure   0
     80859.489 Operational_Failure     3
     87600.000 Simulation_Over          
     93196.900 Operational_Failure     6
    218364.387 Operational_Failure     1
    228921.239 Operational_Failure     0
    399812.784 Operational_Failure     2
    411047.679 Operational_Failure     7
   1235197.318 Operational_Failure     5
--------------------------
     40937.686 Repair                  4
Event not implemented yet
Good -- the Latent_Sector_Failure is gone. Ok -- let's implement repair. Repair will set the state back to "N+1 Working & Clean", and it will generate a new latent failure event and a new operational failure event. Make sure you do it in that order. In raid_step_A, I generate the new events, print the state and exit -- take a look:
UNIX> raid_step_A 7 1 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | tail -n 61
     40921.139 Operational_Failure     4  N+1-W&C   ->   N-W&C    
--------------------------
Simulation State: N-W&C
Time: 40921.1
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 0 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
     40937.686 Repair                  4
     41111.416 Scrub                    
     43050.179 Latent_Sector_Failure   7
     43146.259 Latent_Sector_Failure   3
     44688.617 Latent_Sector_Failure   6
     55472.131 Latent_Sector_Failure   2
     55577.832 Latent_Sector_Failure   5
     57943.514 Latent_Sector_Failure   1
     63009.763 Latent_Sector_Failure   0
     80859.489 Operational_Failure     3
     87600.000 Simulation_Over          
     93196.900 Operational_Failure     6
    218364.387 Operational_Failure     1
    228921.239 Operational_Failure     0
    399812.784 Operational_Failure     2
    411047.679 Operational_Failure     7
   1235197.318 Operational_Failure     5
--------------------------
     40937.686 Repair                  4  N-W&C     ->   N+1-W&C  
--------------------------
Simulation State: N+1-W&C
Time: 40937.7
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
     41111.416 Scrub                    
     41459.894 Latent_Sector_Failure   4
     43050.179 Latent_Sector_Failure   7
     43146.259 Latent_Sector_Failure   3
     44688.617 Latent_Sector_Failure   6
     55472.131 Latent_Sector_Failure   2
     55577.832 Latent_Sector_Failure   5
     57943.514 Latent_Sector_Failure   1
     63009.763 Latent_Sector_Failure   0
     80859.489 Operational_Failure     3
     87600.000 Simulation_Over          
     93196.900 Operational_Failure     6
    218364.387 Operational_Failure     1
    228921.239 Operational_Failure     0
    399812.784 Operational_Failure     2
    411047.679 Operational_Failure     7
   1235197.318 Operational_Failure     5
   1252031.348 Operational_Failure     4
--------------------------
UNIX> 
As you can see, the repair created two new events -- the Latent_Sector_Failure at time 41459.894, and the Operational_Failure at time 1252031.348.

We're almost done. However, before going on, let's think about the case where we have a latent sector failure and we're in the "N Working & Clean" state. This means he have to go to the data loss state. Make sure you have that case implemented, along with the proper code to deal with getting an operational failure when you're in the "≥1 Sector Failure" state. Test, test, test. The raid_verbose program prints out the state of the system before each event. The raid program simply prints the events, and that is the one which your program has to match.

Here are some good examples with raid_verbose and raid. Run it with a seed of 8 and you get data loss with a sector failure followed by an operational failure:

UNIX> raid 7 8 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | tail 
     67030.379 Latent_Sector_Failure   3  N+1-W&C   ->   >=1-SF   
     67080.075 Scrub                      >=1-SF    ->   N+1-W&C  
     67218.335 Scrub                      N+1-W&C   ->   N+1-W&C  
     67436.399 Scrub                      N+1-W&C   ->   N+1-W&C  
     67619.237 Scrub                      N+1-W&C   ->   N+1-W&C  
     67641.410 Latent_Sector_Failure   2  N+1-W&C   ->   >=1-SF   
     67798.062 Scrub                      >=1-SF    ->   N+1-W&C  
     68007.887 Scrub                      N+1-W&C   ->   N+1-W&C  
     68048.479 Latent_Sector_Failure   5  N+1-W&C   ->   >=1-SF   
     68078.932 Operational_Failure     2  >=1-SF    ->   Data-Loss
UNIX> raid_verbose 7 8 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r-out-08.txt
UNIX> tail -n 32 r-out-08.txt
     68048.479 Latent_Sector_Failure   5  N+1-W&C   ->   >=1-SF   
--------------------------
Simulation State: >=1-SF
Time: 68048.5
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 1 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 1
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
     68078.932 Operational_Failure     2
     68204.706 Scrub                    
     68432.905 Latent_Sector_Failure   3
     68927.632 Latent_Sector_Failure   7
     69787.407 Latent_Sector_Failure   5
     69993.567 Latent_Sector_Failure   4
     70643.753 Latent_Sector_Failure   2
     78296.389 Latent_Sector_Failure   0
     80980.939 Latent_Sector_Failure   6
     87224.610 Latent_Sector_Failure   1
     87600.000 Simulation_Over          
    527848.763 Operational_Failure     1
    533624.845 Operational_Failure     4
    564553.233 Operational_Failure     0
    637065.024 Operational_Failure     7
    688310.478 Operational_Failure     6
    809074.943 Operational_Failure     5
   1119991.579 Operational_Failure     3
--------------------------
     68078.932 Operational_Failure     2  >=1-SF    ->   Data-Loss
UNIX> 
Run it with a seed of 54 and you get data loss with an operational failure followed by a sector failure:
UNIX> raid 7 126 87600 1.12 461386 0.000108003 6 2 12 36 3 168 | tail 
     16431.317 Latent_Sector_Failure   1  N+1-W&C   ->   >=1-SF   
     16566.950 Scrub                      >=1-SF    ->   N+1-W&C  
     16822.776 Scrub                      N+1-W&C   ->   N+1-W&C  
     16840.946 Latent_Sector_Failure   2  N+1-W&C   ->   >=1-SF   
     17018.701 Scrub                      >=1-SF    ->   N+1-W&C  
     17279.807 Scrub                      N+1-W&C   ->   N+1-W&C  
     17444.582 Scrub                      N+1-W&C   ->   N+1-W&C  
     17606.732 Operational_Failure     2  N+1-W&C   ->   N-W&C    
     17612.592 Scrub                      N-W&C     ->   N-W&C    
     17614.251 Latent_Sector_Failure   7  N-W&C     ->   Data-Loss
UNIX> raid_verbose 7 126 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r-out-126.txt
UNIX> tail -n 32 r-out-126.txt
--------------------------
     17612.592 Scrub                      N-W&C     ->   N-W&C    
--------------------------
Simulation State: N-W&C
Time: 17612.6
  Disk 0 Up: 1 LF: 0
  Disk 1 Up: 1 LF: 0
  Disk 2 Up: 0 LF: 0
  Disk 3 Up: 1 LF: 0
  Disk 4 Up: 1 LF: 0
  Disk 5 Up: 1 LF: 0
  Disk 6 Up: 1 LF: 0
  Disk 7 Up: 1 LF: 0
     17614.251 Latent_Sector_Failure   7
     17625.196 Repair                  2
     17861.024 Scrub                    
     18465.175 Latent_Sector_Failure   1
     19339.598 Latent_Sector_Failure   0
     20112.740 Latent_Sector_Failure   3
     24367.099 Operational_Failure     7
     31394.891 Latent_Sector_Failure   6
     56342.535 Latent_Sector_Failure   5
     61687.690 Latent_Sector_Failure   4
     83392.317 Operational_Failure     3
     87600.000 Simulation_Over          
    151580.748 Operational_Failure     6
    271051.588 Operational_Failure     0
    329718.300 Operational_Failure     1
    350806.455 Operational_Failure     5
    384234.038 Operational_Failure     4
--------------------------
     17614.251 Latent_Sector_Failure   7  N-W&C     ->   Data-Loss
UNIX> 
If we modify the operational failure rate, we see two operational failures causing data loss:
UNIX> raid 7 3 87600 1.12 4613 0.000108003 6 2 12 36 3 168
        93.258 Operational_Failure     2  N+1-W&C   ->   N-W&C    
       109.223 Repair                  2  N-W&C     ->   N+1-W&C  
       185.206 Scrub                      N+1-W&C   ->   N+1-W&C  
       407.436 Operational_Failure     7  N+1-W&C   ->   N-W&C    
       408.000 Operational_Failure     5  N-W&C     ->   Data-Loss
UNIX> 
UNIX> 
And finally, the following call shows a latent sector failure followed by an operational failure to the same disk, which is not a data loss event:
UNIX> raid 7 272 87600 1.12 461386 0.000108003 6 2 12 36 3 168 > r-out-272.txt
UNIX> sed -n 90,100p r-out-272.txt
     13382.944 Scrub                      N+1-W&C   ->   N+1-W&C  
     13627.772 Scrub                      N+1-W&C   ->   N+1-W&C  
     13865.283 Scrub                      N+1-W&C   ->   N+1-W&C  
     14105.195 Scrub                      N+1-W&C   ->   N+1-W&C  
     14174.163 Latent_Sector_Failure   7  N+1-W&C   ->   >=1-SF   
     14226.446 Operational_Failure     7  >=1-SF    ->   N-W&C    
     14239.088 Repair                  7  N-W&C     ->   N+1-W&C  
     14352.994 Scrub                      N+1-W&C   ->   N+1-W&C  
     14460.291 Scrub                      N+1-W&C   ->   N+1-W&C  
     14717.011 Scrub                      N+1-W&C   ->   N+1-W&C  
     14848.550 Scrub                      N+1-W&C   ->   N+1-W&C  
UNIX> 
Have fun, and start early! The grading script tests 70 through 80 test your error checking. The others test various features. For example, if you only generate latent sector failures, then some of the examples will run correctly.

Citation

[Elerath07] J. G. Elerath and M. Pecht, "Enhanced Reliability Modeling of RAID Storage Systems," DSN 2007: International Conference on Dependable Systems and Networks, Edinburgh, UK, IEEE, 2007.