$Id: README,v 1.2 2001/06/21 23:07:06 dwmw2 Exp $
$Log: README,v $
Revision 1.2  2001/06/21 23:07:06  dwmw2
Initial import to MTD CVS

Revision 1.1  2001/06/11 19:34:40  vipin
Added README file to dir.


This is the README file for the "checkfs" power fail test program.
By: Vipin Malik

NOTE: This program requires an external "power cycling box"
connected to one of the com ports of the system under test. 
This power cycling box should wait for a random amount of time 
after it receives a "ok to power me down" message over the 
serial port, and then yank power to the system under test.
(The box that I rigged up tested with waits anywhere from 
0 to ~40 seconds).


It should then restore power after a few seconds and wait for the
message again.


ABOUT:

This program's primary purpose it to test the reliiability
of various file systems under Linux.

SETUP:

You need to setup the file system you want to test and run the
"makefiles" program ONCE. This creates a set of files that are
required by the "checkfs" program.

Also copy the "checkfs" executable program to the same dir.

Then you need to make sure that the program "checkfs" is called
automatically on startup. You can customise the operation of
the "checkfs" program by passing it various cmd line arguments.
run "checkfs -?" for more details.

****NOTE*******
Make sure that you call the checkfs program only after you have
mounted the file system you want to test (this is obvious), but
also after you have run any "scan" utilities to check for and
fix any file systems errors. The e2fsck is one utility for the
ext2 file system. For an automated setup you of course need to
provide these scan programs to run in standalone mode (-f -y
flags for e2fsck for example).

File systems like JFFS and JFFS2 do not have any such external
utilities and you may call "checkfs" right after you have mounted
the respective file system under test.

There are two ways you can mount the file system under test:

1. Mount your root fs on a "standard" fs like ext2 and then
mount the file system under test (which may be ext2 on another
partition or device) and then run "checkfs" on this mounted
partition OR

2. Make your fs AND device that you have put this fs as your
root fs and run "checkfs" on the root device (i.e. "/").
You can of course still run checkfs under a separate dir
under your "/" root dir.

I have found the second method to be a particularly stringent
arrangement (and thus preferred when you are trying to break
something).

Using this arrangement I was able to find that JFFS clobbered
some "sister" files on the root fs even though "checkfs" would
run fine through all its own check files.

(I found this out when one of the clobbered sister file happened
to be /bin/bash. The system refused to run rc.local thus 
preventing my "checkfs" program from being launched :)

"checkfs":

The "formatting" reliability of the fs as well as the file data integrity
of files on the fs can be checked using this program.

"formatiing" reliability can only be checked via an indirect method.
If there is severe formatting reliability issues with the file system,
it will most likely cause other system failures that will prevent this
program from running successfully on a power up. This will prevent
a "ok to power me down" message from going out to the power cycling
black box and prevent power being turned off again.

File data reliability is checked more directly. A fixed number of
files are created in the current dir (using the program "makefiles").

Each file has a random number of bytes in it (set by using the
-s cmd line flag). The number of "ints" in the file is stored as the
first "int" in it (note: 0 length files are not allowed). Each file
is then filled with random data and a 16 bit CRC appended at the end.

When "checkfs" is run, it runs through all files (with predetermined 
file names)- one at a time- and checks for the number of "int's" 
in it as well as the ending CRC.

The program exits if the numbers of files that are corrupt are greater
that a user specified parameter (set by using the -e cmd line flag).

If the number of corrupt files is less than this parameter, the corrupt
files are repaired and operation resumes as explained below.

The idea behind allowing a user specified amount of corrupt files is as
follows:

If you are testing for "formatting" reliability of a fs, and for
the data reliability of "other" files present of the fs, use -e 1.
"other" files are defined as sister files on the fs, not being written to
by the "checkfs" test program.

As mentioned, in this case you would set -e 1, or allow at most 1 file 
to be corrupt each time after a power fail. This would be the file 
that was probably being written to when power failed (and CRC was not 
updated to reflect the  new data being written). You would check file 
systems like ext2 etc. with such a configuration.
(As you have no hope that these file systems provide for either your
new data or old data to be present in the file if power failed during
the write. This is called "roll back and recover".)

With JFFS2 I tested for such "roll back and recover" file data reliability
by setting -e 0 and making sure that all writes to the file being
updated are done in a *single* write().

This is how I found that JFFS2 (yet) does NOT support this functionality.
(There was a great debate if this was a bug or a feature that was lacking
or even an issue at all. See the mtd archives for more details).

In other words, JFFS2 will partially update a file on FLASH even before
the write() command has completed, thus leaving part old data part new
data in your file if power failed in the middle of a write().

This is bad functionality if you are updating a binary structure or a 
CRC protected file (as in our case).


If All Files Check Out OK:

On the startup scan, if there are less errors than specified by the "-e flag" 
a "ok to power me down message" is sent via the specified com port.

The actual format of this message will depend on the format expected
by the power cycling box that will receive this message. One may customise
the actual message that goes out in the "do_pwr_dn)" routine in "comm.c".

This file is called with an open file descriptor to the comm port that
this message needs to go out over and the count of the current power
cycle (in case your power cycling box can display/log this count).

After this message has been sent out, the checkfs program goes into
a while(1) loop of writing new data (with CRC), one at a time, into
all the "check files" in the dir.

Its life comes to a sudden end when power is asynchronously pulled from
under its feet (by your external power cycling box).

It comes back to life when power is restored and the system boots and
checkfs is called from the rc.local script file.

The cycle then repeats till a problem is detected, at which point
the "ok to power me down" message is not sent and the cycle stops
waiting for the user to examine the system.