Table of Contents
This document describes the design and provides notes on a program to split an original file in to three or more equal sized components and to reassemble the components in to the original file using RAID technology.
This document should be read in conjunction with the source code.
The program will use RAID 5 (XOR) technology to protect data against loss or unauthorized access. There are many possible uses for the programme one example would be to split a file and store each of the components on different third party data storage facilities. In this example RAID 5 technology would ensure the original file could still be recovered in the event of one of the third party storage facilities being unavailable. In this example security is improved by ensuring that none of the third party data storage facilities holds sufficient data to re-construct the original file.
The file to be split in to three or more component files.
One of three or more files in to which the original file is split.
Part of the original file; an array of N bytes read from the original file, which is to be split in to N+1 component files. The array may be rotated.
An array of M bytes consisting of M -1 consecutive bytes read from the original file and one parity byte, where M is the number of component files. The array is constructed by reading one byte from each component file.
Each of the components will be the same size, and hold the same portion of data and parity as each other. Each component will be as redundant as the others; no component will be any more significant than any other.
Because the original file may not divide in to the number of component files required, some of the components will be padded with random data to ensure they are all the same size. The trailer section will contain information about the padding and the programme will anticipate this.
The components will begin with a header section containing information about the component and it's relationship to the other components and original file. A future requirement may be to provide options to nullify this information in which case the user will need to maintain this information separately in order to reconstruct the original file.
The programme will work in a Unix like environment using a command line interface. It is assumed that the original file will be made available on standard in and when reconstructed from the components the original file will be provided on standard out.
The program will be designed to process the original file or components in a single pass. Files can be piped in to or out of gsecraif.
Component files will be processed “in parallel” i.e. non sequentially. All the component files will be opened “simultaneously” and data will be read or written to the the files “simultaneously”. This is an inevitable requirement if arbitrarily large files are to be processed in a single pass.
For added security, before calculating parity, the data in the original file will be bit rotated thus ensuring that none of the components contain any of the original data. During re-assembly the data from the components will be reverse bit rotated.
The programme will limit the number of component files to 255 this simplifies the design and facilitates testing by avoiding the possibility of a very large number of component files.
The programme could be used in a cascading manner with a file split in to a number of temporary files which are then further split using the programme a second time. There would be no need to retain the temporary files and reconstruction would be done by the two stages above in reverse. The number of component files created is only limited by time and system resources.
As well as increasing the number of component files, cascading could also be used to increase redundancy, for example nine files created in a single pass could survive the loss of a single file where as nine files consisting of three groups of three created from three temporary files could survive the loss of any three files from nine.
Before calculating parity, the bytes will be bit rotated.
In the example below the three bytes are rotated 3 bits
Each bit is represented by a letter (with a value of 1 or 0)
Before rotation:
abcdefgh ijklmnop qrstuvwx
After rotations:
defghijk lmnopqrs tuvwxabc
Each new byte contains bits from two of the original bytes.
The rotating can be accomplished in numerous ways. In this example the bytes are copied to two temporary arrays.
First each byte is right shifted 5 bits and the result goes one byte to the left, except for the first byte with goes to the last byte:
original |
abcdefgh
|
ijklmnop
|
qrstuvwx
|
temp1 |
00000ijk
|
00000qrs
|
00000abc
|
Each byte is then left shifted 3 bits
original |
abcdefgh
|
ijklmnop
|
qrstuvwx
|
temp2 |
defgh000
|
lmnop000
|
tuvwx000
|
The results are ORed together to provide the desired result.
temp1 |
00000ijk
|
00000qrs
|
00000abc
|
temp2 |
defgh000
|
lmnop000
|
tuvwx000
|
temp1 OR temp2 |
defghijk
|
lmnopqrs
|
tuvwxabc
|
To Reverse the above example
Each byte is left shifted 5 bits and the result goes one byte to the right, except for the last byte which goes to the first byte
Each byte is then right shifted 3 bits
The results are ORed together to provide the desired result.
There is no point in rotating a multiple of 8 bits; this would simply be byte swapping. Similarly there is no point rotating more than 7 bits as this would be equivalent to byte swapping and rotating 7 or less bits.
Parity is calculated by successive XOR operations
Note that XOR is both an associative and a commutative operation. http://en.wikipedia.org/wiki/XOR
Byte number | Parity |
1 | Set value of parity to that of byte 1 |
2 | Parity is parity xor byte 2 |
3 | P = P xor byte 3 |
4 | P = P xor byte 4 |
5 | P = P xor byte 5 |
N | P = P xor byte N |
The diagram below shows xor “triangles” (any two points on the triangle are xor'ed to produce the third):
1 |\ | \ 2--p2 /| / | 3--p3 /| / | 4--p4 /| / | 5--p5
Data is be striped across the files to spread the parity and data evenly.
In the example below data is read from five files. Each successive byte is represented by successive letters and the parity byte by the symbol ⊕
A | B | C | D | ⊕ |
F | G | H | ⊕ | E |
K | L | ⊕ | I | J |
P | ⊕ | M | N | O |
⊕ | Q | R | S | T |
U | V | W | X | ⊕ |
When the programme is combining component files to produce the original, the stripes are read from the component files as above and then “de-striped” Eg:
A | B | C | D | ⊕ |
E | F | G | H | ⊕ |
I | I | K | L | ⊕ |
M | N | O | P | ⊕ |
Q | R | S | T | ⊕ |
U | V | W | X | ⊕ |
The header contains the following items of information:
version number (I.e version of file format) 1 byte
number of component files, 1 byte
identity of the component file I.e which file (number) this is, 1 byte
bit rotation value (how many bits were rotated) 1 byte
The trailer contains a value indicating the number of padding bytes added to the original file in order to make the component files the same size. The trailer is 1 byte. Each component file needs to have the same trailer reporting the number of padding bytes; any given component file may be missing so it is not possible to rely upon a scheme where each component file reports the padding in just that component file.
Stripes are constructed by reading one byte from each of the component files.
Although the last byte of each component file should be identical as described above, it is not possible to use that fact to determine the end of the component files. The original file may contain a succession of identical bytes which would result in identical bytes in the component files; although the parity byte would be different, the component file containing the parity byte may be missing, therefore a stripe containing identical bytes does not necessarily indicate the end of the component files. The only way to determine the end of the component files is to try to read beyond the end and obtain the EOF status. This affects the way in which the trailer must be anticipated and the way data is read and processed.
When reconstructing the original file three stripes are held in memory. When the first stripe is read it is not possible to determine if it the last. Similarly when the second stripe is read it is not possible to determine if it the last and therefore the trailer and the preceding stripe can not be processed because the manner in which it is processed depends upon whether it precedes the trailer. When the third stripe is read, it may or may not be the trailer, however the preceding (second) stripe is not the trailer and so the stripe before that (first) can be processed in the normal way.
When an attempt to read a stripe fails, the end of the component files has been reached. In this case, the preceding (second) stripe was the trailer and this information can be used to process the stripe before that (first).
Once the header information has been read and processed the original file can be reconstructed. This is done in to stages, the first stage is to process the initial stripes and initiate the process. The second stage is to repeat a loop reading and processing data until the end of the component files is reached.
Read stripe 1 (which could be trailer) if we fail then we have corrupt component files
Attempt to read stripe 2 if EOF is encountered then stripe 1 was the trailer (0 length component files which is OK) and the original is a null file, other wise stripe 1 is a data stripe.
Attempt to read stripe 3 and process as above.
Read stripes until the end of the component files is encountered.
Process stripes normally until EOF is reached.
combine files (default is to split)
tell programme there are N files
tell programme the bit rotation value is R
the prefix for the file name is <prefix>, files will be called prefix00n up to prefix255
debug level D
print hashes every 1K
help
Debug levels
0 nothing – default because std err goes to the same as std out by default and std out is used
1 report basic details and key error messages
2 trace progress – lots of output
Autotools were used to facilitate the build process.
Autoscan is used
The main source file, gsecraif.cpp is edited to have: #include "config.h"
configure.ac is edited to have:
AC_INIT([Gsecraif], [0.01], [info@localhost]) AM_INIT_AUTOMAKE(gsecraif, 0.01) AC_CONFIG_SRCDIR([src/gsecraif.cpp]) AC_CONFIG_HEADERS([config.h]) … AC_CONFIG_FILES([doc/Makefile man/Makefile src/Makefile]) AC_OUTPUT
Makefile.am has
AUTOMAKE_OPTIONS = foreign SUBDIRS = src doc man src/Makefile.am # what flags you want to pass to the C compiler & linker CFLAGS = --pedantic -Wall -std=c99 -O2 LDFLAGS = # this lists the binaries to produce, the (non-PHONY, binary) targets bin_PROGRAMS = gsecraif gsecraif_SOURCES = gsecraif.cpp comb_functions.cpp int2asciistr.cpp parity.cpp parse_args.cpp print_help.cpp print_out.cpp process_stripe2.cpp rotate.cpp split_functions.cpp
Installation is achieved by running the configure script followed by make.
Scripts have been provided to facilitate the use of gsecraif and to test gsecraif.
The BASH script splitwiz.sh is a “wizard” to assist with splitting a file. The script prompts for the name of a file to split and then for the various options gsecraif can use. The script then uses the unix utility cat to pipe the specified file in to gsecraif passing the options chosen to gesecraif.
The BASH script tester.sh is a shell script to test gsecraif. The script requires a single parameter which is the name of a file to be used for testing. The file used for testing is not changed by the script or gsecraif. The test file is split in to a number of component files (the test file remains intact). The component files are re-assembled in to a temporary recovery file which is compared with the original. One of the component files is deleted at random and gesecraif is used again to reassemble the recovery file (gsecraif uses parity) which again is compared with the original. The process is repeated with the number of components increasing from three to 255 and each time the rotated bits increasing from zero to seven; every way in which the file can be split by gsecraif is tested.
The script supertester.py is a Python script that generates files, containing random data, of increasing size and calls tester.sh to test gsecraif with each generated file.
This section contains notes on specific parts of the source code.
When combining files a byte is read from each component file to form a stripe if one of the component files is missing then the missing byte is recovered using parity.
Once a stripe has been obtained it needs to be “de-striped” that is the rotation of bytes that occurred when the original file was split needs to be reversed. See notes above.
If a byte was rotated n places to the right it now needs to be rotated n places to the left. Rotating n places to the left is equivalent to rotating SIZE -n places to the right, where SIZE is the size of the array.