Introduction
In information theory, the amount of bits required to represent a sequence is referred to as the entropy of that sequence. For example, if we wanted to store the result of a single coin flip, be it heads or tails, we'd only need 1 bit. A value of 1 for heads, and a value of 0 for tails. Subsequently, if we wanted to store the result of two coin flips, we'd need two bits, one to store the outcome of each flip. However, if all we wanted to know was whether a single coin out of the coins we flipped landed on heads, we'd only ever need 1 bit (either one coin landed on heads, or no coins landed on heads).
This general idea of omitting information unnecessary to answer the question being asked can be extended to finding randomly generated strings in a large body of text. Similar to the coin flip example, the higher the probability of an individual sequence occurring (say, at least one coin flip out of 100 flips coming up as tails), the fewer the amount of bits required to represent it. Extending this idea to the English language, by accounting for real world biases towards certain characters, we should be able to identify strings that do not factor in these biases (randomly generated strings).
In this post we'll take a look at the math behind calculating entropy, and then build a simple application that identifies and plots high entropy strings within a large body of text.
Laying the Groundwork
Let's take a look at a short sequence of strings:
aardvark, abadon, abdicate, pvreskzgoqrxyjfsqylc, abide, ability, abject
Which of these strings doesn't belong?
At a glance, it should be clear that pvreskzgoqrxyjfsqylc
is the odd one out, and if we look at the source of each of these sequences, it should be easy to see why. aardvark, abadon, abdicate, abide, ability
and abject
were all pulled from the first page of the nearest dictionary I could find, and pvreskzgoqrxyjfsqylc
was randomly generated using a simple Python function.
|
|
>>> generate_random_string(20)
pvreskzgoqrxyjfsqylc
However, mathematically, what actually makes pvreskzgoqrxyjfsqylc
stand out from the rest of the strings in the set? Humans are exceptional at recognizing strings and sequences (this one holds meaning, this one does not), but a computer has no latent ability to understand that an aardvark
is a small nocturnal mammal that feeds mostly on ants and termites, and that pvreskzgoqrxyjfsqylc
is a meaningless jumble of characters.
Thankfully, there are two readily available heuristics that a computer can understand that we can use to compare aardvark
and pvreskzgoqrxyjfsqylc
without any outside knowledge of the physical world. The first is the length of these two strings.
>>> strings = ["aardvark", "abadon", "abdicate", "pvreskzgoqrxyjfsqylc", "abide", "ability", "abject"]
>>> average_length = sum([len(s) for s in strings]) / len(strings)
>>> average_length
8.571428571428571
At 20 characters long, pvreskzgoqrxyjfsqylc
is more than twice the length of the average string in our set of strings, while aardvark
, at 8 characters long, is about as close to the average length as any string can get. Given that most of the words in the English language are far shorter than 20 characters in length, this dramatic difference is ultimately something we'll want to take into account.
The second heuristic we can use is the probability of each individual character occurring within a given string. In order to determine this, we'll need to build a simple probability distribution table. This is easily accomplished by iterating through our set of strings, and counting how many times each character occurs.
|
|
>>> strings = ["aardvark", "abadon", "abdicate", "pvreskzgoqrxyjfsqylc", "abide", "ability", "abject"]
>>> pd_table = generate_pd_table(strings)
>>> print_pd_table(pd_table)
a: █████████████████████████████████ 16.67%
b: ████████████████ 8.33%
d: █████████████ 6.67%
e: █████████████ 6.67%
i: █████████████ 6.67%
r: █████████████ 6.67%
c: ██████████ 5.00%
t: ██████████ 5.00%
y: ██████████ 5.00%
j: ██████ 3.33%
k: ██████ 3.33%
l: ██████ 3.33%
o: ██████ 3.33%
q: ██████ 3.33%
s: ██████ 3.33%
v: ██████ 3.33%
f: ███ 1.67%
g: ███ 1.67%
n: ███ 1.67%
p: ███ 1.67%
x: ███ 1.67%
z: ███ 1.67%
h: 0.00%
m: 0.00%
u: 0.00%
w: 0.00%
At a glance, it's easy to tell that some letters occur more frequently than others. Given that the majority of strings in our set were all chosen from the English dictionary in alphabetical order, a bias towards vowels like a
and e
, and consonants like b
and d
is to be expected. Additionally, as can be exhibited by the following probability distribution table generated from a relatively comprehensive list of words in the English language, these dramatic differences in probability are not isolated to our comparatively small set of strings.
|
|
>>> strings = ingest_text("english.txt")
>>> pd_table = generate_pd_table(strings)
>>> print_pd_table(pd_table)
e: ███████████████████████ 11.61%
s: █████████████████ 8.91%
i: █████████████████ 8.76%
a: ███████████████ 7.61%
r: ██████████████ 7.38%
n: █████████████ 6.94%
t: █████████████ 6.65%
o: ████████████ 6.14%
l: ██████████ 5.38%
c: ████████ 4.05%
d: ███████ 3.67%
u: ██████ 3.34%
g: █████ 2.88%
p: █████ 2.82%
m: █████ 2.75%
h: ████ 2.22%
b: ███ 1.96%
y: ███ 1.69%
f: ██ 1.37%
v: ██ 1.04%
k: █ 0.87%
w: █ 0.86%
z: 0.44%
x: 0.29%
j: 0.19%
q: 0.18%
With our two heuristics defined, let's take a look at an equation that will combine them to calculate the randomness, or entropy, of a given string.
Calculating Entropy
At the core of the algorithm we're going to develop is a relatively simple equation known as Shannon Entropy.
\[H(x) = -\sum\limits^n_{i=1} P(x_i)\ log_2\ P(x_i)\]
In this equation,
\(H(x)\) is the calculated entropy of a given sequence x.
\(P(x_i)\) is the probability of the \(ith\) value in the sequence \(x\) occurring.
By explicitly taking into account the probability of each individual character occurring, and summing each character across the entire length of the string, Shannon Entropy properly factors in both of our heuristics, and should serve as a decent tool to find randomly generated strings.
|
|
>>> entropy("quux", pd_table)
8013.109717001295
>>> entropy("wysiwyg", pd_table)
2552.1539238359765
>>> entropy("elephant", pd_table)
724.69216099031
>>> entropy("apple", pd_table)
518.3318569640584
>>> entropy("appleapple", pd_table)
1036.6637139281167
As expected, strings with a higher abundance of rarer characters like q
and w
, as well as longer strings (apple
vs appleapple
) result in a higher entropy value. Additionally, a discerning eye has no doubt already noticed a slight difference between our implementation and the equation for Shannon Entropy defined above. In our implementation, we're using the inverse of \(P(x_i)\) rather than \(P(x_i)\) itself; this simply inverts the range of our outputs so that lower entropy strings tend towards zero, and higher entropy strings tend towards infinity. Ultimately, this is a relatively cosmetic change.
Using Entropy To Find Randomly Generated Strings
Let's try to use our new algorithm to solve an actual problem that all software engineers have likely faced at some point in their career: randomly generated secrets such as SSH keys ending up in source control. If you've never personally encountered this issue, then all this may seem like a lot of work to solve a manufactured problem, but a quick stroll through Github will reveal more than a handful of popular projects that aim to solve this very problem.
Let's don our deerstalker caps and tobacco pipes and see if we can use our algorithm to find randomly generated secrets that have been maliciously hidden within a copy of The Adventures of Sherlock Holmes from The Gutenberg Library. Within this book, I've hidden the following lines of text:
SECRET = pvreskzgoqrxyjfsqylc
API_KEY = yyczyxcycyqyzxycqyzxc
For the sake of simplicity, I've taken the liberty of converting all of the text to lowercase, and removing all forms of punctuation, special characters and numbers. However, accounting for these different characters isn't particularly difficult, and is something you should certainly do if you seek to take this code from the theoretical to the practical. Real randomly generated secrets are nearly certain to contain a mixture of lowercase, uppercase and numerical characters.
|
|
>>> text = ingest_text("english.txt")
>>> pd_table = generate_pd_table(text)
>>> text = ingest_text("sherlock_holmes.txt")
>>> entropies = generate_entropies(text, pd_table)
>>> print_highest_entropies(entropies, 10)
1 | 27337.20 | yyczyxcycyqyzxycqyzxc | [9197]
2 | 22826.51 | pvreskzgoqrxyjfsqylc | [8675]
3 | 8244.75 | exquisite | [9137]
4 | 7148.93 | squeezed | [4703]
5 | 7057.88 | jumpingwhich | [9720]
6 | 6968.12 | jabez | [1152, 1195, 1227, 1266, 1401, 1489, 1600, 1704, 2116]
7 | 6796.49 | jezail | [8803]
8 | 6479.26 | quickly | [4524, 4956, 5500, 9873, 10972]
9 | 6289.97 | disqualify | [1396]
10 | 6252.42 | jewellery | [9906]
Our algorithm managed to not only pick out our randomly generated secrets with extremely high confidence, but also returned what lines these secrets were found on so tracking down the offending strings and removing them is a breeze. With a little bit of additional work, we could easily extend our algorithm to examine entire directories, or even entire git repos.
On that note, if you're considering using the approach to find secrets in your own repos, it's extremely important that you make sure that the source you use to build you probability distribution table is something that actually resembles the files you're ultimately going to be scanning. A dictionary containing all of the words in the English language is an excellent source to build our probability distribution table when our goal is to scan a book written in English, but a rather poor source if we want to scan source code. For one, there are all sorts of special symbols and characters that appear in programming languages that are going to be totally absent from an English dictionary, such as _
, %
and *
. Of the characters that are present in both an English dictionary and a source code file, the frequency of those characters are liable to be very different.
Visualizing Entropy
Lets see if we can generate some neat visuals by programatically building a heatmap that illustrates the entropies of the words in our text. Doing so should be easy enough, we just need to normalize the values of our entropies between 0 and 1 (0 being the smallest entropy we found, and 1 being the largest), and then color the pixels in an image corresponding to the characters in our text. In order to accomplish this, I used the Python Imaging Library Pillow. In regards to color choice, a simple black and white gradient should do the trick (white for low entropy, black for high entropy), but we can also make things a bit prettier by mapping individuals colors to set intervals instead.
|
|
To test our heatmap generator, I spliced some randomly generated strings into a brief excerpt taken from The Adventures of Sherlock Holmes.
He was, I take it, the most perfect reasoning and observing machine that
the world has seen, but as a lover he would have placed himself in a
false position. He never spoke of the softer passions, save with a gibe
and a sneer. They were admirable things for the observer—excellent for
drawing the veil from men’s motives and actions. But for the trained
reasoner to admit such intrusions into his own delicate and finely
adjusted temperament was to introduce a distracting factor which might
throw a doubt upon all his mental results. Grit in a sensitive
instrument, or a crack in one of his own high-power lenses, would not
be more disturbing than a strong emotion in a nature such as his. And
yet there was but one woman to him, and that woman was the late Irene
Adler, of dubious and questionable memory.
rujodkyurynvfcvaxgnr iplypzttmliqbgxsccey aaedwtrwpvkivdivbfgq
zqepfdlrswfwhefknkyr zubewlwdfjtvwkesxkcz midgnlpvlqdudjshabpx
rxqhhvnlxsntexvvdugw cirtinqkpdzmpvmbfywg
I had seen little of Holmes lately. My marriage had drifted us away
from each other. My own complete happiness, and the home-centred
interests which rise up around the man who first finds himself master
of his own establishment, were sufficient to absorb all my attention,
while Holmes, who loathed every form of society with his whole Bohemian
of the singular tragedy of the Atkinson brothers at Trincomalee, and
finally of the mission which he had accomplished so delicately and
successfully for the reigning family of Holland. Beyond these signs of
his activity, however, which I merely shared with all the readers of
the daily press, I knew little of my former friend and companion.
oboyanehimthdhhyftvl uesabsvwhkchqxdbxnpz lqltjyknqrdhywvjotkl
forvccrgqslozjlynjkg uwmeeymfzazfngprabow nfydgurtxdrbwfbodkom
One night—it was on the twentieth of March, 1888—I was returning from a
journey to a patient (for I had now returned to civil practice), when
my way led me through Baker Street. As I passed the well-remembered
door, which must always be associated in my mind with my wooing, and
with the dark incidents of the Study in Scarlet, I was seized with a
keen desire to see Holmes again, and to know how he was employing his
extraordinary powers. His rooms were brilliantly lit, and, even as I
looked up, I saw his tall, spare figure pass twice in a dark silhouette
against the blind. He was pacing the room swiftly, eagerly, with his
head sunk upon his chest and his hands clasped behind him.
>>> dictionary = ingest_text("english.txt")
>>> pd_table = generate_pd_table(dictionary)
>>> text = ingest_text("sherlock_holmes_abridged.txt")
>>> entropies = generate_entropies(text, pd_table)
>>> generate_heatmap("heatmap.png", text, entropies)
A black and white heatmap generated using a linear range of values from 0 to 1.
A full color heatmap generated using a linear range of values from 0 to 1, but broken up into 10 distinct color intervals.
It should be easy to immediately point out where the high entropy blocks of text are located.