Copyright © 2014, 2015 Timothy Evans
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".
Table of Contents
This document describes the design and provides notes on a program to split an original file in to three or more equal sized components and to reassemble the components in to the original file using RAID technology.
This document should be read in conjunction with the source code.
The program will use RAID 5 (XOR) technology to protect data against loss or unauthorized access. There are many possible uses for the programme one example would be to split a file and store each of the components on different third party data storage facilities. In this example RAID 5 technology would ensure the original file could still be recovered in the event of one of the third party storage facilities being unavailable. In this example security is improved by ensuring that none of the third party data storage facilities holds sufficient data to re-construct the original file.
The file to be split in to three or more component files.
One of three or more files in to which the original file is split.
Part of the original file; an array of N bytes read from the original file, which is to be split in to N+1 component files. The array may be rotated.
An array of M bytes consisting of M -1 consecutive bytes read from the original file and one parity byte, where M is the number of component files. The array is constructed by reading one byte from each component file.
Each of the components will be the same size, and hold the same portion of data and parity as each other. Each component will be as redundant as the others; no component will be any more significant than any other.
Because the original file may not divide in to the number of component files required, some of the components will be padded with random data to ensure they are all the same size. The trailer section will contain information about the padding and the programme will anticipate this.
The components will begin with a header section containing information about the component and it's relationship to the other components and original file. A future requirement may be to provide options to nullify this information in which case the user will need to maintain this information separately in order to reconstruct the original file.
The programme will work in a Unix like environment using a command line interface. It is assumed that the original file will be made available on standard in and when reconstructed from the components the original file will be provided on standard out.
The program will be designed to process the original file or components in a single pass. Files can be piped in to or out of gsecraif.
Component files will be processed “in parallel” i.e. non sequentially. All the component files will be opened “simultaneously” and data will be read or written to the the files “simultaneously”. This is an inevitable requirement if arbitrarily large files are to be processed in a single pass.
For added security, before calculating parity, bytes from the original file will be processed as a string of bits that will be re-ordered, thus ensuring that none of the components contain any of the original data. During re-assembly, the data from the components will also be processed as a string of bits and the re-ordering process will be reversed.
The programme will limit the number of component files to 255 this simplifies the design and facilitates testing by avoiding the possibility of a very large number of component files.
The programme could be used in a cascading manner with a file split in to a number of temporary files which are then further split using the programme a second time. There would be no need to retain the temporary files and reconstruction would be done by the two stages above in reverse. The number of component files created is only limited by time and system resources.
As well as increasing the number of component files, cascading could also be used to increase redundancy, for example nine files created in a single pass could survive the loss of a single file where as nine files consisting of three groups of three created from three temporary files could survive the loss of any three files from nine.
The default method used to re-order the bits is by bit rotation.
Before calculating parity, the bytes will be bit rotated.
In the example below the three bytes are rotated 3 bits
Each bit is represented by a letter (with a value of 1 or 0)
Before rotation:
abcdefgh ijklmnop qrstuvwx
After rotations:
defghijk lmnopqrs tuvwxabc
Each new byte contains bits from two of the original bytes.
The rotating can be accomplished in numerous ways. In this example the bytes are copied to two temporary arrays.
First each byte is right shifted 5 bits and the result goes one byte to the left, except for the first byte with goes to the last byte:
original |
abcdefgh
|
ijklmnop
|
qrstuvwx
|
temp1 |
00000ijk
|
00000qrs
|
00000abc
|
Each byte is then left shifted 3 bits
original |
abcdefgh
|
ijklmnop
|
qrstuvwx
|
temp2 |
defgh000
|
lmnop000
|
tuvwx000
|
The results are ORed together to provide the desired result.
temp1 |
00000ijk
|
00000qrs
|
00000abc
|
temp2 |
defgh000
|
lmnop000
|
tuvwx000
|
temp1 OR temp2 |
defghijk
|
lmnopqrs
|
tuvwxabc
|
To Reverse the above example
Each byte is left shifted 5 bits and the result goes one byte to the right, except for the last byte which goes to the first byte
Each byte is then right shifted 3 bits
The results are ORed together to provide the desired result.
There is no point in rotating a multiple of 8 bits; this would simply be byte swapping. Similarly there is no point rotating more than 7 bits as this would be equivalent to byte swapping and rotating 7 or less bits.
An alternative method of re-ordering the bits is bit transposition.
Bytes from the original file are processed as an 8 x N bit matrix. The matrix is transposed and stored in a compact form before being stored in the component files.
Parity is calculated by successive XOR operations
Note that XOR is both an associative and a commutative operation. http://en.wikipedia.org/wiki/XOR
Byte number | Parity |
1 | Set value of parity to that of byte 1 |
2 | Parity is parity xor byte 2 |
3 | P = P xor byte 3 |
4 | P = P xor byte 4 |
5 | P = P xor byte 5 |
N | P = P xor byte N |
The diagram below shows xor “triangles” (any two points on the triangle are xor'ed to produce the third):
1 |\ | \ 2--p2 /| / | 3--p3 /| / | 4--p4 /| / | 5--p5
Data is be striped across the files to spread the parity and data evenly.
In the example below data is read from five files. Each successive byte is represented by successive letters and the parity byte by the symbol ⊕
A | B | C | D | ⊕ |
F | G | H | ⊕ | E |
K | L | ⊕ | I | J |
P | ⊕ | M | N | O |
⊕ | Q | R | S | T |
U | V | W | X | ⊕ |
When the programme is combining component files to produce the original, the stripes are read from the component files as above and then “de-striped” Eg:
A | B | C | D | ⊕ |
E | F | G | H | ⊕ |
I | I | K | L | ⊕ |
M | N | O | P | ⊕ |
Q | R | S | T | ⊕ |
U | V | W | X | ⊕ |
The header contains the following items of information:
version number (I.e version of file format) 1 byte
number of component files, 1 byte
identity of the component file I.e which file (number) this is, 1 byte
bit rotation value (how many bits were rotated) 1 byte
The trailer contains a value indicating the number of padding bytes added to the original file in order to make the component files the same size. The trailer is 1 byte. Each component file needs to have the same trailer reporting the number of padding bytes; any given component file may be missing so it is not possible to rely upon a scheme where each component file reports the padding in just that component file.
Stripes are constructed by reading one byte from each of the component files.
Although the last byte of each component file should be identical as described above, it is not possible to use that fact to determine the end of the component files. The original file may contain a succession of identical bytes which would result in identical bytes in the component files; although the parity byte would be different, the component file containing the parity byte may be missing, therefore a stripe containing identical bytes does not necessarily indicate the end of the component files. The only way to determine the end of the component files is to try to read beyond the end and obtain the EOF status. This affects the way in which the trailer must be anticipated and the way data is read and processed.
When reconstructing the original file three stripes are held in memory. When the first stripe is read it is not possible to determine if it the last. Similarly when the second stripe is read it is not possible to determine if it the last and therefore the trailer and the preceding stripe can not be processed because the manner in which it is processed depends upon whether it precedes the trailer. When the third stripe is read, it may or may not be the trailer, however the preceding (second) stripe is not the trailer and so the stripe before that (first) can be processed in the normal way.
When an attempt to read a stripe fails, the end of the component files has been reached. In this case, the preceding (second) stripe was the trailer and this information can be used to process the stripe before that (first).
Once the header information has been read and processed the original file can be reconstructed. This is done in to stages, the first stage is to process the initial stripes and initiate the process. The second stage is to repeat a loop reading and processing data until the end of the component files is reached.
Read stripe 1 (which could be trailer) if we fail then we have corrupt component files
Attempt to read stripe 2 if EOF is encountered then stripe 1 was the trailer (0 length component files which is OK) and the original is a null file, other wise stripe 1 is a data stripe.
Attempt to read stripe 3 and process as above.
Read stripes until the end of the component files is encountered.
Process stripes normally until EOF is reached.
combine files (default is to split)
tell programme there are N files
tell programme the bit rotation value is R
tell programme to transpose the bits
the prefix for the file name is <prefix>, files will be called prefix00n up to prefix255
debug level D
print hashes every 1K
help
Debug levels
0 nothing – default because std err goes to the same as std out by default and std out is used
1 report basic details and key error messages
2 trace progress – lots of output
Autotools were used to facilitate the build process.
Autoscan is used
The main source file, gsecraif.cpp is edited to have: #include "config.h"
configure.ac is edited to have:
AC_INIT([Gsecraif], [0.01], [info@localhost]) AM_INIT_AUTOMAKE(gsecraif, 0.01) AC_CONFIG_SRCDIR([src/gsecraif.cpp]) AC_CONFIG_HEADERS([config.h]) … AC_CONFIG_FILES([doc/Makefile man/Makefile src/Makefile]) AC_OUTPUT
Makefile.am has
AUTOMAKE_OPTIONS = foreign SUBDIRS = src doc man src/Makefile.am # what flags you want to pass to the C compiler & linker CFLAGS = --pedantic -Wall -std=c99 -O2 LDFLAGS = # this lists the binaries to produce, the (non-PHONY, binary) targets bin_PROGRAMS = gsecraif gsecraif_SOURCES = gsecraif.cpp comb_functions.cpp int2asciistr.cpp parity.cpp parse_args.cpp print_help.cpp print_out.cpp process_stripe2.cpp rotate.cpp split_functions.cpp
Installation is achieved by running the configure script followed by make.
Scripts have been provided to facilitate the use of gsecraif and to test gsecraif.
The BASH script splitwiz.sh is a “wizard” to assist with splitting a file. The script prompts for the name of a file to split and then for the various options gsecraif can use. The script then uses the unix utility cat to pipe the specified file in to gsecraif passing the options chosen to gesecraif.
The BASH script tester.sh is a shell script to test gsecraif. The script requires a single parameter which is the name of a file to be used for testing. The file used for testing is not changed by the script or gsecraif. The test file is split in to a number of component files (the test file remains intact). The component files are re-assembled in to a temporary recovery file which is compared with the original. One of the component files is deleted at random and gesecraif is used again to reassemble the recovery file (gsecraif uses parity) which again is compared with the original. The process is repeated with the number of components increasing from three to 255 and each time the rotated bits increasing from zero to seven; every way in which the file can be split by gsecraif is tested.
The script supertester.py is a Python script that generates files, containing random data, of increasing size and calls tester.sh to test gsecraif with each generated file.
This section contains notes on specific parts of the source code.
When combining files a byte is read from each component file to form a stripe if one of the component files is missing then the missing byte is recovered using parity.
Once a stripe has been obtained it needs to be “de-striped” that is the rotation of bytes that occurred when the original file was split needs to be reversed. See notes above.
If a byte was rotated n places to the right it now needs to be rotated n places to the left. Rotating n places to the left is equivalent to rotating SIZE -n places to the right, where SIZE is the size of the array.
For gsecraif it can be useful to regard data as a matrix of bits. Within C++ there is the bitset class, that, like the array class, has a fixed size. A class of Bitmatrix has been defined that does not have a fixed size; the size can be established at run time.
The smallest unit of storage is the byte (8 bits), so a means must be found of storing the bits in a bit matrix within a vector of bytes. A structure also needs to record the number of rows and columns in the bit matrix.
There are many possible ways in which a bit matrix could be implemented, three are considered here. In each of the schemes below the number of rows and columns are stored as integers within a structure.
In this scheme a matrix of bytes is established. The number of rows of bytes is equal to the number of rows of bits; the number of columns of bytes is sufficient to hold the columns of bits (number of bit columns divided by 8 rounded up to the nearest multiple of 8). In this scheme a 3 x 13 bit matrix would have the bits stored in a 3 x 2 byte matirx e.g.
000xxxxx xxxxxxxx 000xxxxx xxxxxxxx 000xxxxx xxxxxxxx
The first three bits in this example are undefined but in practice would probably be set to 0. The recording of the number of bit columns as an integer can be used to construct an appropriate mask to process the bits.
This scheme is the same as the one above, except that the least significant bits of the least significant bytes in a row hold any surplus undefined bits. The example above would be held as:
xxxxxxxx xxxxx000 xxxxxxxx xxxxx000 xxxxxxxx xxxxx000
This scheme is used within the Bitmatrix class, together with the scheme below:
This scheme stores the bits in sequence in a vector of bytes. If the number of bits is not a multiple of 8 then any surplus undefined bits occur as the least significant bits of the last byte. This is illustrated by a 3 x 9 matrix of bits:
The matrix is:
@abcdefgh ijklmnopq rstuvwxyz
This is stored in compact form as:
@abcdefg hijklmno pqrstuvw xyz00000
Information about the number of rows and columns can be used to convert between storage forms.