Copyright © 2014, 2015 Timothy Evans
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".
Table of Contents
This document discusses security with gsecraif.
The envisaged uses of gsecriaf are data transfer and storage of data on third party storage systems and these uses are discussed in the user guide. As well as simply splitting files, gsecraif is designed to provide security against data loss and unauthorised access.
This document discusses the two aspects of security, protection against data loss and protection against unauthorised access; it explains the need for this security and how it is implemented in gsecriaf.
It should be made clear that gsecraif is primarily intended to protect against data loss; protection against data corruption is implicit rather than explicit as explained below.
With a conventional file splitting tool, it is necessary to have all the component files present in order to reconstruct the original file; if any of the component files are lost or become corrupted, the original file can not be re-constructed entirely. Gsecraif can reconstruct an original file even if one of the component files is missing, this implies that if a component file is found to be corrupted it can be deleted and gsecriaf used to re-construct the original file (or the component file as explained in the user guide).
gsecriaf is not designed to detect or recover from corrupt component files. It is recommended that appropriate check-sums (e.g. MD5 / SHA-2) are produced for each component file and stored with the corresponding component file.
If an original file is split in to component files using any splitting tool, including gsecraif, it is, of course, possible to keep multiple copies of the components to guard against loss or corruption, however this can relatively “expensive” in terms of storage media or if third party storage facilities are used. Gsecraif uses redundancy within the component files to provide security which means it is not necessary to rely upon multiple copies of component files as the only means of ensuring security (although it may still be desirable to maintain some duplication). The redundancy within the component files is provided using RAID 5 (XOR) technology as explained below.
When gsecraif splits a file in to components, it “stripes” the data across the component files in the same way that data is stripped across disks in a RAID 5 disk array. Information about RAID disk systems can be found at:
http://en.wikipedia.org/wiki/RAID
Because gsecriaf uses the “RAID 5” approach to “striping” data across component files, the minimum number of files gsecraif can split an original file in to, is three. Gsecraif reads data from the original file a few bytes at a time, process the bytes, calculates a parity value, using the XOR function and writes the data to the component files. One of the advantages of this method is it's relative simplicity. Other RAID technologies are briefly mentioned below. An important reason for choosing the “RAID 5” approach is that it facilitates providing security against unauthorised access as discussed below.
As discussed in the beginning of this section, gsecraif is primarily intended to protect against data loss; protection against data corruption is implicit rather than explicit. This design is in keeping with the general unix philosophy of having specific tools to do specific jobs and using different tools together to achieve a particular task. There are many tools available to check for and guard against data corruption and this function has been excluded from gsecraif for simplicity.
Gsecraif does have some limited features that can reveal obvious data corruption. By design, gsecraif ensures that all the component files are the same size. The last byte of each component file should have the same value. These two facts can provide obvious checks, however they are not substitutes for the use of other tools. It is strongly recommended that a check-sum is produced for each component file.
There is often much discussion about the security aspects of various check-sum / hashing algorithms, however in this context it is probably sufficient to simply have a high degree of confidence that the tool used would reveal any data corruption.
Information on the traditional MD5 can be found at:
http://en.wikipedia.org/wiki/MD5
Information on SHA-2 is available at:
A full discussion of RAID technologies is outside the scope of this document, however some non-RAID 5 technologies are briefly discussed below for reference.
RAID 1 is also known as disk mirroring and involves duplicating data. As discussed above any file splitting utility can be used to split a file and then duplicate copies of the component files kept for redundancy. There are numerous hybrid RAID configurations, some of which are vendor specific; one popular configuration is to combine both RAID 5 and RAID 1; the analogy of this with Gsecraif would be to use Gsecraif to split a file in to three or more components and keep two or more copies of each component file.
There are RAID technologies that provide configurations that can tolerate the failure of more than one disk, a popular example of which is RAID 6 (which can tolerate the loss of two disks). There are also technologies, that in theory at least, allow for the failure of arbitrary numbers of disks (such approaches often involve non-trivial mathematics). In principal it would be possible to implement analogous technology in to a file splitting tool; a tool could split a file in to a number of component files such that the loss of more than one component could be tolerated. However the technology is significantly more complex and Gsecraif was designed to be as simple as possible.
It should be noted that the use of RAID type technologies beyond disk redundancy has been used in areas such as Redundant Arrays of Independent Nodes (RAIN)
http://en.wikipedia.org/wiki/Reliable_array_of_independent_nodes
Security is often associated with protecting against unauthorised access and that is one of the main design goals of Gsecraif and is discussed here. It is important to understand what is meant by protecting against unauthorised access in the context of the envisaged use of Gsecraif.
The envisaged uses of gsecriaf are data transfer and storage of data on third party storage systems and these uses are discussed in the user guide. Gsecraif is not designed to protect against unauthorised access to the original file; anyone who has access to all the component files or all except one of the component files has complete access to the original file.
Gsecraif is designed to protect against unauthorised access by someone with access one of the component files. Some security is provided in the case where someone has access to a number of the component files, from one to N-2 (where N is the number of component files), but security diminishes rapidly the more component files someone has access to. To understand why this is important it useful to consider any file splitting utility. If a document is split in to a number of component files, anyone with access to one or more components will typically have access to a portion of the original document and thus access to more information that the author may wish. An intelligent reader may be able to infer more information about the original document that the author would wish; it is this problem that Gsecraif is designed to address.
Gsecraif is designed to ensure that anyone with access to one of the component files, obtained from media being used to transfer data or held on third party data storage systems, will find it impractical to infer information about the original file.
Gsecraif does not use encryption. Of course a file could be encrypted before it is split using Gsecraif and the component files could also be encrypted. The idea of Gsecriaf is to remove the need to rely upon encryption which would be the usual way of safeguarding data stored on media used for transfer or data stored on third party storage systems.
There are a number of disadvantages of encryption, some of which are briefly described below:
There may be legal restrictions on the use of encryption both in terms of the laws of a particular country and legal issues restricting how a commercial encryption product may be used.
Encryption also introduces potential difficulties for the user in that passwords or keys need to be maintained in a secure manner.
Perhaps the most serious problem with encryption in this context is that once data is entrusted to a third party, they may have unlimited time to crack the encryption and future developments or discoveries may render the encryption weak.
Disk systems store data in blocks of various sizes and RAID 5 disk systems follow this scheme. In principal a file splitting programme could use the same approach. Each component file created by a file splitter could contain blocks of data from the original file. The weakness of this approach from a security perspective is that each component file contains unaltered portions of the original file; in the case of a plain text file, access to a component file would provide access to parts of the original document analogous to reading a heavily censored document. A reader may be able to infer more information. The problem described here is similar to that of simply splitting a file using a simple splitting tool.
Gsecraif splits an original file on a byte by byte basis. However simply distributing bytes between the component files could be even less secure than splitting in to blocks. The problem is best understood in terms of a simple text file. Although no component file would contain any continuous text, each would contain letters from the words in the document. An aptitude for cross-words may be all that is required to reconstruct portions of the original text.
To enhance security, Gsecriaf effectively processes an original file as a collection of bits that are distributed between the component files. Each component file contains the same sized portion of bits from the original file, but no component contains any bytes from the original file. This is achieved by reordering the bits of several bytes as the original file is read. Reordering is achieved either by rotating the bits or transposing the bits.
The original file is processed by reading N-1 bytes at a time, where N is the number of component files the original is to be split in to. The N bytes are processed as a string of bits 8 x N in length and the bits shifted a specified number of places (default 3) to the right, with the bits that “fall off” the right re-inserted in the left.
Bit rotation is a well-known, simple, and relatively insecure form of data obfuscation, a well known example being ROT-13 ( http://en.wikipedia.org/wiki/Rot-13 ). The difference here is that the processed data is spit among several files and the aim is to provide security against unauthorised access by someone with access to one of the component files. Each byte in a component file contains bits from two bytes in the original file, but as there are bits missing from those two bytes they can not be constructed from a single component file.
The original file is processed by reading N-1 bytes at a time, where N is the number of component files the original is to be split in to. The N bytes are processed as a matrix of 8 x N bits.
Numerous references describe the mathematics of matrices and transposition (including http://en.wikipedia.org/wiki/Transpose )
If a file is spit in to 9 components, then 8 bytes are read at a time. In this case the bytes are treated as an 8x8 bit matrix and and it is easy to see how following a transposition of this matrix the data is stored as 8 bytes distributed among the component files (with a parity byte). It is also easy to envisage how this is done for any multiple of 8 bytes. Where the number of bytes is not a multiple of 8 the resulting transposed matrix is stored in a compact form. This is best explained by example. If a file is split in to 3 components (the default), then 2 bytes are read at a time. These are treaded as a 2 by 8 matrix of bits:
a | b | c | d | e | f | g | h |
i | j | j | l | m | n | o | p |
Each letter represents a bit.
This is transposed:
a | i |
b | j |
c | k |
d | l |
e | m |
f | n |
g | o |
h | p |
The above matrix of bits is stored in compact form in two bytes:
a | i | b | j | c | k | d | l |
e | m | f | n | g | o | h | p |
In the above example each component byte contains four bits from an original byte.
Three bytes would be processed as follows:
a | b | c | d | e | f | g | h |
i | j | j | l | m | n | o | p |
q | r | s | t | u | v | w | x |
Each letter represents a bit.
This is transposed:
a | i | q |
b | j | r |
c | k | s |
d | l | t |
e | m | u |
f | n | v |
g | o | w |
h | p | x |
The above matrix of bits is stored in compact form in three bytes:
a | i | q | b | j | r | c | k |
s | d | l | t | e | m | u | f |
n | v | g | o | w | h | p | x |
In the above example each component byte contains two or three bits from an original byte.
Some general notes on the usage of gsecraif with respect to security are provided here.
For security against unauthorised access, it is important to ensure that the component files are kept separate and to prevent access to more than one component file. If an unauthorised person has access to more than one component file then they may have access to a portion of the bytes in the original file (from which information could be inferred). If a person has access to more than one “adjacent” component files, then the bits can be rotated (where rotation was used) to regain original bytes.
Access to all the component files or all except one of the component files gives complete access to the original file.
Access to only one component file gives no access to any of the original bytes in the original file; this is the design goal of gsecraif.
As the number of “adjacent” component files someone has access to increases, the portion of bytes in the original file the person has access to increases.
The situation is illustrated in the table below. Each row shows how the percentage of bytes in the original file a person has access to, increases as the number of “adjacent” component files a person has access to increases, for file that has been split in to a given number of component files.
# files accessed | |||||||
---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | ||
# files | 3 | 0% | 100% | 100% | N/A | N/A | N/A |
4 | 0% | 16% | 100% | 100% | N/A | N/A | |
5 | 0% | 15% | 30% | 100% | 100% | N/A | |
6 | 0% | 13% | 26% | 40% | 100% | 100% |
The values above are approximate because an original file might not divide exactly in to the component files and some padding may be used.
From a general security perspective, the smaller the portion of the original file that a person has access to, the less they are likely to be able to infer about the original file. This suggests that splitting a file in to a greater number of component files could offer greater protection against unauthorised access, as each component file represents a smaller portion of the original. However this has to be considered along with practical considerations.
Obviously it is important to protect for each component file as much as possible. Component files must be kept separate.
Finally it should be pointed out that, obviously, gesecraif can be used in conjunction with encryption. For greatest security, an original file could be encrypted before being split with gsecriaf and then each component file could be encrypted, preferably with a different encryption method.