$Id: README,v 1.2 2001/06/21 23:07:06 dwmw2 Exp $ $Log: README,v $ Revision 1.2 2001/06/21 23:07:06 dwmw2 Initial import to MTD CVS Revision 1.1 2001/06/11 19:34:40 vipin Added README file to dir. This is the README file for the "checkfs" power fail test program. By: Vipin Malik NOTE: This program requires an external "power cycling box" connected to one of the com ports of the system under test. This power cycling box should wait for a random amount of time after it receives a "ok to power me down" message over the serial port, and then yank power to the system under test. (The box that I rigged up tested with waits anywhere from 0 to ~40 seconds). It should then restore power after a few seconds and wait for the message again. ABOUT: This program's primary purpose it to test the reliiability of various file systems under Linux. SETUP: You need to setup the file system you want to test and run the "makefiles" program ONCE. This creates a set of files that are required by the "checkfs" program. Also copy the "checkfs" executable program to the same dir. Then you need to make sure that the program "checkfs" is called automatically on startup. You can customise the operation of the "checkfs" program by passing it various cmd line arguments. run "checkfs -?" for more details. ****NOTE******* Make sure that you call the checkfs program only after you have mounted the file system you want to test (this is obvious), but also after you have run any "scan" utilities to check for and fix any file systems errors. The e2fsck is one utility for the ext2 file system. For an automated setup you of course need to provide these scan programs to run in standalone mode (-f -y flags for e2fsck for example). File systems like JFFS and JFFS2 do not have any such external utilities and you may call "checkfs" right after you have mounted the respective file system under test. There are two ways you can mount the file system under test: 1. Mount your root fs on a "standard" fs like ext2 and then mount the file system under test (which may be ext2 on another partition or device) and then run "checkfs" on this mounted partition OR 2. Make your fs AND device that you have put this fs as your root fs and run "checkfs" on the root device (i.e. "/"). You can of course still run checkfs under a separate dir under your "/" root dir. I have found the second method to be a particularly stringent arrangement (and thus preferred when you are trying to break something). Using this arrangement I was able to find that JFFS clobbered some "sister" files on the root fs even though "checkfs" would run fine through all its own check files. (I found this out when one of the clobbered sister file happened to be /bin/bash. The system refused to run rc.local thus preventing my "checkfs" program from being launched :) "checkfs": The "formatting" reliability of the fs as well as the file data integrity of files on the fs can be checked using this program. "formatiing" reliability can only be checked via an indirect method. If there is severe formatting reliability issues with the file system, it will most likely cause other system failures that will prevent this program from running successfully on a power up. This will prevent a "ok to power me down" message from going out to the power cycling black box and prevent power being turned off again. File data reliability is checked more directly. A fixed number of files are created in the current dir (using the program "makefiles"). Each file has a random number of bytes in it (set by using the -s cmd line flag). The number of "ints" in the file is stored as the first "int" in it (note: 0 length files are not allowed). Each file is then filled with random data and a 16 bit CRC appended at the end. When "checkfs" is run, it runs through all files (with predetermined file names)- one at a time- and checks for the number of "int's" in it as well as the ending CRC. The program exits if the numbers of files that are corrupt are greater that a user specified parameter (set by using the -e cmd line flag). If the number of corrupt files is less than this parameter, the corrupt files are repaired and operation resumes as explained below. The idea behind allowing a user specified amount of corrupt files is as follows: If you are testing for "formatting" reliability of a fs, and for the data reliability of "other" files present of the fs, use -e 1. "other" files are defined as sister files on the fs, not being written to by the "checkfs" test program. As mentioned, in this case you would set -e 1, or allow at most 1 file to be corrupt each time after a power fail. This would be the file that was probably being written to when power failed (and CRC was not updated to reflect the new data being written). You would check file systems like ext2 etc. with such a configuration. (As you have no hope that these file systems provide for either your new data or old data to be present in the file if power failed during the write. This is called "roll back and recover".) With JFFS2 I tested for such "roll back and recover" file data reliability by setting -e 0 and making sure that all writes to the file being updated are done in a *single* write(). This is how I found that JFFS2 (yet) does NOT support this functionality. (There was a great debate if this was a bug or a feature that was lacking or even an issue at all. See the mtd archives for more details). In other words, JFFS2 will partially update a file on FLASH even before the write() command has completed, thus leaving part old data part new data in your file if power failed in the middle of a write(). This is bad functionality if you are updating a binary structure or a CRC protected file (as in our case). If All Files Check Out OK: On the startup scan, if there are less errors than specified by the "-e flag" a "ok to power me down message" is sent via the specified com port. The actual format of this message will depend on the format expected by the power cycling box that will receive this message. One may customise the actual message that goes out in the "do_pwr_dn)" routine in "comm.c". This file is called with an open file descriptor to the comm port that this message needs to go out over and the count of the current power cycle (in case your power cycling box can display/log this count). After this message has been sent out, the checkfs program goes into a while(1) loop of writing new data (with CRC), one at a time, into all the "check files" in the dir. Its life comes to a sudden end when power is asynchronously pulled from under its feet (by your external power cycling box). It comes back to life when power is restored and the system boots and checkfs is called from the rc.local script file. The cycle then repeats till a problem is detected, at which point the "ok to power me down" message is not sent and the cycle stops waiting for the user to examine the system.