Cracking Gsecraif

Timothy Evans

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".


Table of Contents

Introduction
Example 1 - Rotation -r 0
Example 2 - Rotation -r 3
Frequency Analysis
Further analysis
Conclusion
A. Example text
B. Partial Recovered text
C. GNU Free Documentation License

Introduction

The security document describes how gsecraif provides security both in terms of preventing data loss through RAID 5 techniques and in terms of preventing unauthorised access through obfuscation. Preventing unauthorised access is achieved without encryption by using bit rotation or transposition. To better understand how gsecraif provides security and perhaps more importantly to better understand it's limitations, the process of inferring information from a single component file is considered in this document. This is precisely the process gsecraif is designed to guard against.

The process of inferring information about an original file from one of the component files is considered using an example file. The utility gsecraif can split an original file in to several components; there can be between three and two hundred and fifty five components. Obviously the more components a file is spit in to the smaller is the portion of the original file that each component represents. However it is envisaged that a file will often be split in to three components and this is the example used in this document. Gsecraif can of course split any kind of file and usually access to a single component file would provided no information initially about what kind of file the original was (unless there was some clue from the component file name (e.g. split005.doc) . For this example, the original file is a plain ASCII text file. This provides a good example for illustration purposes; the principals would apply to other types of file, but other file types may be more challenging to process.

Here an example file is split in to three component files (split000 split001 and split002) and the process of trying to infer information about the original file from one of the components is considered. The situation would be the same for any of the three component files, in this document the middle component, split001 is used as an example.

Two situations are considered. Gsecraif can take a parameter of -r to specify the number of bits (0 to 7) that are used for rotation. For the first example the rotation value for -r is set to 0 which is not recommended (as will be illustrated below). In the second example the rotation value for -r is set to 3 which makes inferring information about the original file more difficult.

Gsecraif can also reorder the bits of an original file by transposition (this is specified with the -t parameter). The process of inferring information from a component file, where transposition has been used, is left as an exercise for the interested reader.

Example 1 - Rotation -r 0

In this example the original file is split using the parameter -r 0 This is not recommended and would not normally be used, however there may be special circumstances when this is required. This sets the rotation to zero bits and thus instructs gsecraif not to carry out any bit rotation.

In this example the bytes from the original file along with the generated parity bytes are distributed among the component files. This is illustrated below:

Split000 Split001 Split002
C h
p a
t e
r space
new line I
new line M
y space
a f
t h
e r
' s
space f
a m
l i
y space
n a
e m
space b
e I
g n

⊕ = parity

If any of the component files is examined (perhaps using a hex editor) it will be seen to contain bytes from the original file. Here the middle component file is considered. The first ten bytes are shown in the table below:

  Byte hex value Comment ASCII character
header 01 File format version
03 Number of component files
01 Component file id
00 Bit rotation value
data 68 h
11 Parity value
74 t
20 space character
43 Parity value
0A new line

How the first few bytes of the component file relate to the original file is shown below. The bytes present in the component file are highlighted and those missing are shown struck out.

C h ap t er I

My fa t he r 's fa m il y n a me be i ng Pi r ri p , a nd my Ch r is i ta n n a me Ph i li p ,

It is apparent that no special decryption techniques are required to infer information about the original file; all that may be required is an aptitude for crosswords. The process of inferring information about the original file could be automated using a variety of methods; for example a dictionary file could be used and (treating the missing bytes as “wild card” characters) each word could be tested in successive positions in the file.

To assist the process of inferring missing data, the missing component files could be replaced with component files consisting of appropriate header and trailer information and filled with bytes with value 0. gsecraif could then be used to combine the files to produce a version of the original file to work with. An example of this is shown in the Appendix Appendix B, Partial Recovered text .

It is worth noting that in this example inferring information about the original file is achieved using one of the component files, knowledge that the original file is English text and a knowledge of English and perhaps items such as a dictionary file. There may be several inputs in to the process and there is no violation of information theory; no information is conjured in to existence from nothing. If there were errors such as incorrect characters in the original file that were not saved in the component file, it is difficult to imagine how the incorrect characters might be inferred.

This example uses English text, but illustrates the principals that would apply to other types of file, if perhaps with more difficulty.

Example 2 - Rotation -r 3

It is recommended that gsecraif is used with a non-zero rotation value and this is the default. In this example the rotation value is set to 3. Each component file consists of bytes made up of five bits from one of the original bytes and three bits from another.

If any of the component files is examined (perhaps using a hex editor) it will be seen that none of the bytes directly correspond to bytes from the original file. Here the middle component file is considered. The first ten bytes are shown in the table below:

  Byte hex value Comment ASCII character
header 01 File format version
03 Number of component files
01 Component file id
00 Bit rotation value
data 6D m
22 Parity value
AE  
44 D
68 Parity value
A1  

The header entries are the same as before, but the data values are different. The first few data values are also shown in the first column of the table below Table 1, “split001 rotation -r 3” .

As explained in other documentation the bytes are read from the original file two at a time in this case and bit rotated three places to the right:

  Byte 1 Byte 2
Original bytes abcdefgh ijklmnop
Bit rotated bytes nopabcde fghijklm

Each lower case letter represents a bit (1 or 0).

In this example only one component file is available which corresponds to only having one of the bit rotated bytes above. The bits can be rotated back, but this will leave bits in the original bytes undefined. This is illustrated below with byte 2 available and byte 1 missing:

  Byte 1 Byte 2
Bit rotated bytes   fghijklm
Original bytes ?????fgh ijklm???

The second original byte has fewer missing bits and will be considered here. As there are three missing bits there are eight combinations for the possible bits (000, 001, 010, 011, 100, 101, 110, 111). The missing bits come from other bytes and so there is no correspondence between the missing bits in one byte and the missing bits in another.

If a particular set of bits (such as 000) is considered then this will be correct on average about one time in eight, perhaps depending upon the nature of the file, but instances where this is correct will be randomly distributed. If files consisting of appropriate header and trailer information and filled with bytes with value 0 are created as mentioned above, these could be used to assist with constructing a version of the original file. Gsecraif could be used to combine the files to produce a version of the original. In this case the bytes obtained from component file 1 (with the missing bits set to zero) are shown in the second column headed “000” in the table below Table 1, “split001 rotation -r 3” . Each cell contains the hex value of the byte, followed by the character it represents (where possible) and a third number that is explained later. Where the byte value is correct (the missing bits are 000) the character is highlighted.

The values in the second column are only one of eight possibilities and the other possibilities are shown in the other columns headed with the possible bit combinations in the table below Table 1, “split001 rotation -r 3” .

Unlike the situation with no bit rotation where bytes from the original file were recovered, each byte is now replaced with eight possibilities. This makes inferring information about the original file more difficult, however further analysis is possible.

Table 1. split001 rotation -r 3

split001 000 001 010 011 100 101 110 111
6D


68
h
6.1


69
i
7.0


6A
j
0.2


6B
k
0.8


6C
l
4.0


6D
m
2.4


6E
n
6.7


6F 
o
7.5

22
AE


70
p
1.9


71
q
0.1


72
r
6.0


73
s
6.3


74
t
9.1


75
u
2.8


76
v
1.0


77
w
2.4

44


20
space

68

⊕ = parity

Frequency Analysis

Frequency analysis can be used to help infer information about the original file from the possible bytes provided by a single component file. A full discussion of frequency analysis is beyond the scope of this document; further information about the subject is available from many sources including Wikipedia .

One of many sources of information about frequency analysis is: http://www.counton.org/explorer/codebreaking/frequency-analysis.php

This provides the following table showing the relative percentages with which each letter is found in typical English text.

Table 2. Percentages with which letters occur in English text

A B C D E F G H I J K L M
8.1 1.5 2.8 4.3 12.7 2.2 2.0 6.1 7.0 0.2 0.8 4.0 2.4
N O P Q R S T U V W X Y Z
6.7 7.5 1.9 0.1 6.0 6.3 9.1 2.8 1.0 2.4 0.2 2.0 0.1

Using this (or similar) information it is possible to identify which of the possible eight values above is more likely to be correct.

In the above table, Table 1, “split001 rotation -r 3” the third value in each cell is the percentage value taken from the table above. The highest value (i.e. the most likely value) is shown in italic and the number highlighted. Some of the cells contain values that are both correct and the most likely from frequency analysis.

Further analysis

So far the bytes with three missing bits have been considered. Further information could be inferred from the bytes with five bits missing, but this would be more difficult. It may even be possible to infer some information from the parity bytes; anyone determined to infer as much as possible would use all information available.

Conclusion

It is hoped that having considered the process of trying to infer information about an original file from one of the component files, the reader has gained a better understanding of what gsecraif can do and it's limitations. It should be apparent why using gsecraif without reordering the bits either by rotation or transposition is not generally recommended and the above examples should serve to emphasis that gsecraif is not a panacea for all security requirements. However it should be apparent that gsecraif can provide useful data obfuscation without encryption.

In considering the above examples it should be borne in mind that the fact that the example original text was known to be English text and a knowledge of English would be used in inferring information about the original. It is envisaged that gsecraif would not often be used to split plain ascii text files; it is a general purpose splitting tool that could be used to split anything from audio files to compressed backup files.

The disadvantage of simply relying upon encryption, is that it is an “all or nothing” arrangement. Data is completely secure unless the encryption is broken at which point it is completely compromised. Often it is not the encryption mathematics that is “cracked” but a vulnerability in the implementation that is exploited.

If gsecraif is used to split a file in to three component files then someone with access to one of the components has access to one third of the original bits and that will never change. If the original file is split in to more components then the portion decreases. The missing bits from the other components can not be conjured in to existence, however although missing information can not be recovered, it is possible to infer information about an original file from a component. However this will always be difficult.

If a high level of security is required then gsecraif can be used in conjunction with encryption where this is permitted. As explained in the security document the original file and the components could be encrypted using different encryption techniques.