WARNING: This rant isn't organized at all, just enjoy. I'm not necessarily writing this for the reader's understanding, more so that I can just dump everything out my head like
a child does with a piggy bank flipped to dump coins. Enjoy :)
The original idea came from "Haha I love it when these numbers spell this word" or "Haha my PIN number (Fun Fact: Personal Identification Number number) says this phrase"
(My favorite being the KFCJEW meme,
as a result of random codes being generated, happening to spell this), which came generally into the idea of finding funny substrings within random alphabet codes. The first
experiment with this initially was in fact a non-published Scratch project, which generated a random string of monocase letters, and repeated that until it found one that contained
within it. Very quickly I noticed that higher lengths become significantly longer. Yes, I know how probability works... ish. Of course increasing amounts of dependent events happening
a certain way will be incredibly less likely the more events are put into it, but that was the initial learning. Soon later around that time, I experimented in python and C# further, making
a pretty simple C# program to explore this concept further. It calculated such random strings using RNGCryptoServiceProvider, for the purposes of high levels of randomization. The initial
program did not work very well. What it did first was a simple while loop, that took as input a string, and regenerated a string, and looped until it found one. It also kept track of how
long it took. Later then, I tried to repeat it 100 times, to average out testing, and stuff. The issue is with this though, that I forgot to reset the count each loop, which led to wildly
incorrect numbers. A few months later, though, same day I am writing this, in fact, I revisited it. About the first thing I did, was parallelize it, using Parallel.For(). I dealt a bit with
things like that, with parallelization, and I got MEGA into statistics. I don't know anything about statistics, but because this is randomness, that sort of started to creep in, I
believe that sort of thing is destined anyway when you come up with randomness.
So, I went through a much further analysis of this program. In short, because I am somewhat lazy in terms of
writing, I added a lot of printing info, and I fixed a few simple errors. And then I added python to it. There is plot.py, which just graphs this statistics data with matplotlib, and does
some processing with it, specifically giving the mean, median, 90th percentile, and standard deviation. I'm not entirely sure how to interpret most of this, as I do not have a statistics
education, I just like numbers. One thing I can pretty well interpret though, is graphs. The important thing I have done here is graphed this data in matplotlib, which has been fascinating.
The first graphs produced were with such very bad quality, with many code errors, and such. Specifically, one thing I did was analyze the data sorted from lowest to highest, and then
analyze it chronologically. As it had errors, you, the reader, can probably guess that this detailing of events is not chronological. It all happened in a blur, as programming tends to do.
However, first I fixed the issue that all threads in this for loop accessed a global variable counting attempts. This is bad of course, as it massively throws out numbers and their
accuracy. So, I of course fixed that. Another important thing. Initially each thread ended once it found successfully a random string of 20 characters (By the way, this is the original
attempt as well, all tests were done with a 20 character string) containing the specified substring, which was supplied by terminal argument. This made sense, but it massively skewed
the data more exponentially, sorting the attempt counts per thread by amount, the graph skewed higher and higher, as the later it went on, the more attempts each thread had gone through,
but as well less threads to actually perform this operation. This is bad, because it does not accurately represent the data, and so on. As well, graphing attempt data chronologically shows
shows a graph that was erratic and random, as such a thing should be, but the average steadily increased as a function of time, which is not how this works normally. The second issue there
fixed was as well pretty simple. Each thread did not reset the count upon successful string finding. This wouldn't be an issue if each thread quit once it found successfully anway, but
they do not quit anymore, so that is an issue. It is an issue because each thread can count more than once, and it will also throw off results. Before fixing this, the graph was more
exponential still, which is not great for data analysis, although the first results through matplotlib without fixing errors I do not remember. So, hopefully then, this code is good. I
could optimize this further in time, but I am not counting time anyway, I know it is slightly an unreliable stat, or so I believe.
So, then, I went to more or less analyze the data.
Again, I used matplotlib to render the graph for looking at trends and stuff, in a simple python script that reads data from a file that is outputted by the C# code, and then processes
it optionally, handling stuff like sorting, derivatives, etc, and graphs it. That's roughly the pipeline there, pretty simple, you've heard the deal Another thing there is, now graphs. I
haven't necessarily prepared graphs, as I should've, because I did not think necessarily to show them to many people, but I have descriptions and results. If people want to as well, I can
publish the source code for this whole project, but for now, I don't care. One thing as well that is to be mentioned here, I don't necessarily understand the math behind my results. I do
however understand my results, ish. After finally fixing everything, I started more experimenting with matplotlib and this randomness, and such. I first analyzed the values over time,
which produced a relatively uninteresting graph, where everything was basically stable. There were some small outliers, some spikes in the data, which made some sense, but I still wanted
to explore it further, because I wondered if there was yet still some major issue with handling the attempt data. As far as I know, there isn't. Such outliers are incredibly rare, and
in a trial of 1000 tests even, only about 3-5 of them were so high. But, after fixing everything, the graphs were very exponential. The situation there is that 90% of the graph roughly is
linear. Sorting each trial's attempt count from lowest to highest gives such this graph, as taking data chronologically functions effectively as noise data. The sorted graph however
increases at a linear rate as a function of the place of the item after sorted, which makes logical sense due to the fact that sorting something tends to have it increase as such a
function. However, the graph before fixing such issues was significantly distorted. At this point I decided to also analyze the derivative, which dictates how the data changes over time.
It's basically only relevant with the sorted function. As I expected and predicted, the derivative stayed more or less the same. It followed roughly the same curve as the original curve,
however because calculus, the linear part of the sorted data graph became constant, which is exactly what I got. However, with this same 10% at the end with outliers, it followed
basically the same curve, just with slightly lower numbers. For most of the function, the derivative was very low, and stayed consistently low. Of course, the function does steadily
increase in its derivative over
time, expontentially, however the derivative is slightly more stable. I tried a sample size of 10,000 trials, and it ends up looking more or less just like... a corner. 90 degree rotation
and such. Which makes sense, sort of. The derivative gets a bit unstable at such higher sample sizes. More accurately, it was like such corner, where the outer edge was 90 degrees. It
looked however like the inner edge of the corner was more rounded, and the derivative between the inside and outside encompassed some pretty diverse values, looking more or less on
matplotlib that it filled in. With such a curve, people who know what these are should expect that the median is much lower than the mean, which it was. Of course again the original
non-derivative equation was about the same, just with the first part being a linear steady increase, as opposed to the linear homostatic nature of the derivative, which makes sense
because calculus. I as well tried a sample size of 100 trials, with the substring being "sillyy" as I needed to test a 6 character string. This ended up taking incredibly long, of course
due to the exponential nature of searching randomly for strings getting more difficult the longer the substring. The stable portion of the graph this time was significantly less stable,
as it resembled much more a simple exponential curve, rather than a combination linear and exponential curve piecewise.
That's about all I have to say. Probably. I did a lot of learning with this, and I am very happy with it. I recognize my findings aren't necessarily documented well, barely at all, but
either way, I still had a lovely time, and I think it was very good to write about it. Again if anyone wants to know more about this,