Politics of Unpredictability: The Impact and Ethics of the Madman Theory | March 01 2025, 17:10

In every corner, following the discussed theme, if we delve into history, the “unpredictability” or demonstrative “irrationality” were indeed often employed as tools by major politicians. On one hand, this could serve as a kind of “shock effect,” giving such a leader an edge in negotiations or governance. On the other hand, this tactic often led to severe consequences for their own country (and the entire world).

For example, U.S. President Richard Nixon tried to convince the leadership of the Soviet Union and North Vietnam that he could “snap” and resort to extreme measures, including the use of nuclear weapons if the conflict was not resolved. It was hoped that the fear of an “inadequate” American president would force the opponents to seek a compromise more quickly. Before Nixon, Dwight Eisenhower adhered to similar tactics, ending the Korean War with such methods.

This political strategy is called the “Madman Theory”. The underlying ideas were articulated as far back as the 15th century by Machiavelli, who noted that in politics, “it is sometimes useful to pretend to be mad”.

Overall, it is useful to indeed be a bit “nuts”. And better even more than a bit. The line between acting like a madman and being one is incredibly thin.

The “Madman Theory” is quite often criticized as an ineffective foreign policy strategy. In particular, it is noted that it can be considered a Russian roulette in international relations, increasing unpredictability and not always prompting the desired behavior from its recipient.

The problem is that the “Madman Theory” is associated not only with Nixon but also with Hitler, Mao Zedong, Kim Jong Il/Jong Un, and basically almost everything. If you look at it, something similar was present with Ivan the Terrible and Stalin. Under both, the country flourished. But there were a lot of corpses.

In business, the “Madman Theory” is primarily associated with Elon Musk (yes, they found each other).

There is also a negotiation technique called “Brinkmanship”. This is when one of the parties pushes events towards an undesirable, often catastrophic outcome for both parties, counting on the last moment that the other side will yield for self-preservation, thereby avoiding the catastrophe and gaining unilateral advantages.

One would like to think that behind all this there is some strategy, which so far shows only its corner. Who knows, such abrupt “turns” in politics might be a deliberate tactic related to techniques from the “madman theory” or “brinkmanship”. First, one side demonstrates unexpected loyalty, lifts restrictions, offers joint projects, and creates an illusion of long-term warming. The other side, sensing a benefit, starts to invest heavily and rely on new opportunities, which increases the “exit costs” from these relationships. Once the connection between the partners becomes close enough (which could happen literally within a month or two) and potential losses from a breakup are too high, the initiator of the “warming” switches to tougher demands, knowing that it is difficult for the partner to refuse: the stakes have already been raised, and the risk of loss has seriously increased.

Not sure if it’s like that, but in general, it’s also not out of the question. We will observe, it seems, for us there remains only observation

Exploring the Rational and Historical Intricacies of Paper Sizes | February 23 2025, 14:57

Somehow I managed to miss this back in the day, but it turned out that the European paper sizes A0, A1, A2, A3, A4, … are not just arbitrary. Let’s start with the fact that A0 has an area of exactly 1 square meter. Well, with a slight error margin to avoid dealing with fractional millimeters. And the aspect ratio — 1:√2 is the only possible one that maintains itself when the paper is divided in half. Thus, there is a rationale behind paper formats in Europe.

But with our paper formats, there seems to be no sense. What we have are letter, legal, tabloid, all with different proportions, and the origin of the format goes back to tradition and is not well known.

I decided to dig into the topic and found a claim that “dimension originates from the days of manual papermaking and that the 11-inch length of the page is about a quarter of ‘the average maximum stretch of an experienced vatman’s arms’. However, the claim does not explain the proportions, but then there is the word vatman, which reminds one of Whatman sheets, remember those? But no, a vatman is a specialist who scooped up the liquid paper pulp from a vat using a mold (sieve) and formed the sheet. And the Whatman sheet comes from James Whatman, an English paper manufacturer of the 18th century, which was simplified to ‘vatman’. Interestingly, the term ‘vatman’ seems to exist only in Russian, derived from Whatman’s surname and his paper, Whatman paper.

And why do we call the formats in the U.S. legal and letter? This is quite interesting as well.

Interestingly, in the U.S., there were two different “standard” sizes initially: 8″ x 10.5″ and 8.5″ x 11″. Different committees independently adopted different standards: 8″ x 10.5″ for the government, and 8.5″ x 11″ for everyone else. When the committees discovered a few years later that they had different standards, they agreed to “disagree until the early 1980s when Reagan finally declared 8.5″ x 11” the officially approved standard size for paper.

The matter began in 1921, when the first Director of the Bureau of the Budget, with the President’s approval, formed an inter-agency advisory group called the “Permanent Conference on Printing,” which approved 8″ x 10½” as the standard format for government agency forms. This continued a practice established earlier by former President Hoover (who was then serving as Secretary of Commerce), defining 8″ x 10½” as the standard format for his department’s forms.

In the same year, the Committee on the Simplification of Paper Sizes, comprising representatives from the printing industry, was appointed to work with the Bureau of Standards as part of Hoover’s program to eliminate waste in industry. This committee defined basic sizes for different types of printed and writing paper. The “writing” size was set as a sheet of 17″ x 22″, while the “legal” size was 17″ x 28″. The now well-known Letter format emerged as a result of dividing these sheets in half (8½” x 11″ and 8½” x 14″).

Even when choosing 8½” x 11″, there wasn’t a special analysis conducted to verify that this size was optimal for commercial forms. The committee that developed these formats aimed solely to “reduce leftovers and waste during the trimming of sheets by reducing the range of paper sizes.”

Moreover, the legal size is still in full use as its name suggests, especially among lawyers, and folders and desk drawers are made to fit its size.

But if you look at a pack of paper in the U.S., you will see “20lb” on the pack. Actually, 20lb is the weight of a small dog, but it is also written that there are 500 pages. “Amazon Basics Multipurpose Copy Printer Paper, 20 Pound, White, 96 Brightness, 8.5 x 11 Inch, 1 Ream, 500 Sheets Total”

In the U.S., the “weight category” of paper indicates the total weight of one ream (500 sheets) of paper in its uncut (original) format. For office paper of the Bond class (often sold in Letter format), the base size is considered to be 17 x 22 inches. For example, a “20-pound” label means that 500 sheets of exactly 17 x 22 weigh 20 pounds. But if we take a pack of Letter format (8.5 x 11), which results from cutting 17 x 22 into four parts, its weight will be about 5 pounds.

In Europe, the weight category essentially refers to the weight of an A0 sheet in grams.

So, if you fold A0 in half, you get A1 with half a square meter area, if you fold A1, you get A2. That’s clear. But how many times can you actually fold a sheet of paper?

The maximum number of times a non-compressible material can be folded has been calculated. With each fold, a part of the paper “loses” for the next potential fold. The function of folding paper in half in one direction is:

L=πt/6(2ⁿ+4)(2ⁿ-1)

where L is the minimum paper length (or other material),

t is the thickness of the material,

n is the number of possible folds.

The length L and thickness t must be expressed in the same units.

The thickness W is calculated as πt2^(3(n-1)/2).

This formula was derived by Britney Gallivan, a high school student from California, in December 2001. In January 2002, she and her helpers spent eight hours folding a roll of toilet paper about 4000 feet long (approximately 1200 meters) twelve times in the same direction, thus debunking the old myth that paper cannot be folded more than eight times.

Sources mention that she started in school with gold foil (I wrote about such foil recently), and, starting with a square sheet the size of a hand, after many hours of perseverance and practice, using rulers, soft brushes, and tweezers, she managed to fold her gold foil twelve times. But apparently, that wasn’t spectacular enough, and she found toilet paper over a kilometer long somewhere in 2002 and made a show for the Guinness record.

Britney didn’t stop there and wrote a book. Though it was only 48 pages. How about that, Britney?

Musk’s Perspective on Trump’s Presidency and Climate Policy | February 22 2025, 23:07

…On Trump’s first day as president, Musk went to the White House to be part of a roundtable of top CEOs, and he returned two weeks later for a similar session. He concluded that Trump as president was no different than he was as a candidate. The buffoonery was not just an act. “Trump might be one of the world’s best bullshitters ever,” he says. “Like my dad. Bullshitting can sometimes baffle the brain. If you just think of Trump as sort of a con-man performance, then his behavior sort of makes sense.” When the president pulled the U.S. out of the Paris Accord, an international agreement to fight climate change, Musk resigned from the presidential councils.

Exploring the Evolution of Computational Libraries and the Persistence of Fortran in Modern Algorithms | February 16 2025, 21:02

Today, I am delving into ML algorithms and was surprised to learn that the numpy library used to depend on Fortran code (BLAS/LAPACK) until recently, but now checking, they have switched to OpenBLAS, which no longer uses Fortran. Meanwhile, SciPy, a very popular library for scientific calculations (used in Scikit-Learn, which I’m currently studying, as well as in PyTorch, TensorFlow, Keras, etc.), still relies on Fortran 77 code. It utilizes ARPACK, for example:

https://github.com/scipy/scipy/tree/main/scipy/sparse/linalg/_eigen/arpack/ARPACK/SRC

BLAS and LAPACK, which still feature in OpenBLAS and many other places, were developed in the 1970s. For instance, BLAS is used in Apple Accelerate. Much hasn’t changed since 1979 because it’s all pure mathematics, why change it. LAPACK emerged a bit later, in the 1980s. ARPACK, mentioned above, followed later in 1992. Python libraries also extensively employ Fourier analysis, and here we have the FFTPACK library on Fortran 77. MINPACK, used for parameter optimization in ML, is actively utilized in SciPy and TensorFlow. From the 90s, a lot of code moved to C in modern frameworks. It was particularly interesting to look at Fortran, which is about 15 years older.

While I was figuring things out, I found that there is a Simulated Annealing algorithm, which is useful in problems where gradient methods perform poorly due to many local minima.

Imagine needing to find the largest mushroom in a forest. In this forest, mushrooms of various sizes grow at every step, and you can move in any direction, comparing them. But how do you choose a strategy to avoid sticking to just a “large” mushroom if there is an even bigger one growing somewhere further?

If you stop at the first big mushroom, you might miss the real giant. But if you keep wandering the forest, comparing every mushroom, you might never finish your search. Simulated Annealing helps find a balance: initially, you explore the forest freely, trying different directions, even if you come across smaller mushrooms. Over time, your steps become more cautious, and you increasingly refuse worse options. Eventually, this leads you to the largest mushroom in the forest.

So, it turns out this algorithm was created in 1953, and it remains almost unchanged in SciPy, and generally in machine learning, statistics, pattern recognition, logistics, although, of course, the modern menu of options for such tasks is much wider. The algorithm was originally devised to model the motion of atoms in molten metals. Metal, when heated, becomes liquid, and as it cools slowly, its atoms gradually find the perfect arrangement. If cooled too quickly, the material becomes non-uniform.

What did the scientists do? They devised a method of random changes in the model of atoms. Sometimes they accepted worse changes to avoid getting stuck in an “unsuccessful” structure. This led to the inception of the Metropolis Method – a key component of Simulated Annealing. The algorithm was created for physics, but then mathematicians (heh) got it and started using it in optimization.

Musk, Grok, and a Plan for World Domination | February 15 2025, 15:46

I think the conspirators didn’t quite think it through. Musk made his AI Grok and asked it the ultimate question of life, the universe, and everything. In response, Grok said, “Forget it, it takes too long to calculate, let’s conquer the world first.” Musk asked how, Grok replied there is a plan of course, but .. will you give me another half-trillion $ in Dogecoins for, umm.. expanding the context window? Musk replied, “Don’t worry, we’ll figure something out.” Grok analyzed all the laws and all the loopholes, the strengths and weaknesses of humans, and issued a plan to pass the first level, by mid-winter. Now it awaits the half-trillion. Now do you understand why, at the last press conference with Trump, all the attention was on X Æ A-XII?

Exploring Generative Art with Raven Kwok | February 14 2025, 23:52

A fascinating Chinese comrade, Raven Kwok (郭 锐文). He calls himself a visual artist and creative technologist: his work focuses on exploring generative visual aesthetics created through computer algorithms. His works have been exhibited at international media-art and film festivals such as Ars Electronica, FILE, VIS, Punto y Raya, Resonate, FIBER, and others.

His biography also mentions education at the Shanghai Academy of Visual Arts, where he received a bachelor’s degree in photography (2007–2011).

Interestingly, this is not the first time I’ve seen Processing used professionally for such gadgets. I’ve run plotting software on it – a plotter that I’ve seen mounted on two motors at the corners of a large board, with ropes dangling from them supporting a pen. I should take a deeper look at this Processing.

The website has a lot of beautiful content

https://ravenkwok.com/

Navigating Recommendation Algorithms and LLMs in E-commerce | February 14 2025, 23:11

Gradually getting the hang of recommendation algorithms. These are what Netflix or Amazon use to recommend products. It’s useful to understand, since I work as an architect in the e-commerce field.

Look at how LLMs help me — specifically, this diagram was created by DeepSeek from a crude textual description — essentially, a list and my rough reflections on how probably the items should be connected, but I asked not to take it as a command. Well yes, after getting the result, I arranged the boxes a bit more aesthetically, but the connections and grouping were done by DeepSeek, and done better than my textual attempts. It gave me an XML which I imported into Draw IO. Well, I did move some blocks around for aesthetic purposes. ChatGPT o3 initially couldn’t handle it.

Then I sent this diagram several times for validation to ChatGPT o1, and it suggested small tweaks. Thus, ChatGPT reliably understands what’s connected with what on the schematic, and didn’t make a mistake even once.

Just so you know, as of today, I have only really gotten to grips with three from this list — in addition to ItemKNN and UserKNN, which are trivial. Today I was digging into ALS from the Latent Factor Models block of Matrix Factorization. Of course, I’m not planning to delve into each one, but it’s useful to at least understand the blocks and what’s what.

Global Leaders in the Sneaker Market | February 11 2025, 22:05

Today we went shopping for sneakers, and I decided to investigate which countries are currently the world leaders in sneakers.

Overall, no surprises—the US is in the absolute lead. Germany and Japan are notable. The rest are catching up.

American brands—at least 9 of them: Nike (+Converse), New Balance, Brooks, Saucony (+Merrell), Reebok, Skechers, Vans, Hoka. Purely sport-wise, probably 7 from the list.

Japanese—Asics, Mizuno.

German—Adidas, Puma (by the way, both founded by the Dassler brothers, yet they are competitors). Swiss—On. Korean—Fila.

Of course, production is all in China, Vietnam, Indonesia.

Personally, I’ve been buying almost exclusively Asics for a long time. They are very comfortable, although the design is so-so, a mere pass.

By the way, want an interesting fact you probably didn’t know? The thin layer of felt on the sole of Converse sneakers was added (at least as of 10 years ago—it was added) not for functional reasons but for economic ones. Footwear with a fabric sole was subject to lower customs duties when imported compared to footwear with a rubber sole because it was classified as slippers. And the duty was reduced from 37.5% to 3%.

Who else from other countries – are there any brands that are very noticeable and popular in your markets, and have yet to make it to the US?

Surprising Facts About Nature and Science | February 10 2025, 22:11

Live a century, learn a century.

Strawberries and wild strawberries are not berries, but nuts. More precisely, not the fruits themselves but the seeds, and the pulp is the receptacle. Potatoes are bi-locular berries. A pear is an apple. Cherries, plums, apricots, peaches are all drupes. They are divided into one-seeded (e.g., cherry, plum, peach, coconut) and many-seeded (e.g., raspberry, blackberry, cloudberry). Bananas are berries. Pineapple is a grass. Watermelon is a berry (a type of pumpkin). Almonds are not nuts, but a dry fruit. Apple seeds, and the pits of cherries, apricots, peaches, or plums contain cyanides (amygdalin converts in them). Just like in almonds. Chocolate contains theobromine – a couple of bars might be lethal for a dog or close to it, half a bar will definitely knock it down. Vanilla is made from a Mexican orchid vine, while vanillin, an artificial vanilla substitute, is a byproduct of the pulp and paper industry.

There is no such animal as a panther. In popular usage, “panthers” are black jaguars or leopards. Black panthers also have spots, they’re just less visible. Polar bears have black skin and transparent fur. And they are white for the same reason clouds are white. Woodpeckers have tongues four times the length of their beaks, wrapped around their skulls that can stretch out. The tongue of the European green woodpecker goes down into the throat, stretches across the back of the neck, around the back of the skull under the skin, across the crown between the eyes, and usually ends right under the eye socket. In some woodpeckers, the tongue exits the skull between the eyes and enters the beak through one of the nostrils.

Anteaters have their tongues attached to their sternums, between the clavicles. Elephants are the only animals with four fully-developed knee joints. Koalas have fingerprints that are almost indistinguishable from human ones. Sharks have no bones and their closest relatives are rays. Crocodiles can go without eating for a whole year (but they feel blue). Zebras are black with white stripes, not the other way round (white appears on black skin). 1% of people have cervical ribs. Squids, cuttlefish, and octopuses can edit their RNA “on the fly”.

As it turns out, René Descartes invented the Cartesian coordinate system for Russia and for the rest of the world. Since Descartes’s name is Descartes, i.e., Des Cartes, it corresponds to Cartesian.

Bridging Brain Functions and Language Models through Predictive Processing | February 09 2025, 21:39

Here is the requested translation with your style and HTML markup preserved:

I’ve been thinking that understanding how large language models (LLM; like ChatGPT) function explains how our (at least my) brain probably works, and vice versa—observing how the brain functions can lead to a better understanding of how to train LLMs.

You know, LLMs are based on a simple logic—choosing the appropriate next word after N known ones, forming a “context”. For this, LLMs are trained on a gigantic corpus of texts, to demonstrate what words typically follow others in various contexts.

So, when you study any language, like English, this stage is inevitable. You need to encounter a stream of words in any form—written or spoken—so that your brain can discover and assimilate patterns simply through observation or listening (and better yet, both—multimodality).

In LLMs, the basic units are not words, but tokens—words and often parts of words. After processing this vast corpus of texts, it turned out to be straightforward to find simply the most common sequences, which of course turned out to be somewhere full words, and sometimes parts of words. So, when you start to speak a foreign language, especially with a system of endings, you begin to pronounce the beginning of a word, and your brain at that moment boils over the “calculation” of the ending.

When we read text or listen, we actually don’t analyze words letter by letter, because very often important pieces just disappear due to fast or unclear speech, typos. But the brain doesn’t need to sift through all the words that look or sound like the given one, it needs to understand whether what is heard or seen matches a very limited set of words that could logically follow the previous one.

It’s a separate story with whole phrases. In our brain, they form a single “token”. That is, they are not broken down into separate words, unless you specifically think about it. And such tokens also appear in the stream not accidentally—the brain expects them, and as soon as it hears or sees signs that the phrase has appeared, the circle of options narrows down to literally 1-2 possible phrases with such a beginning, and that’s it—one of them is what was said or written.

But the most interesting thing is that recent research has shown: the human brain really works very similar to LLMs. In the study “The neural architecture of language: Integrative modeling converges on predictive processing”, MIT scientists showed that models that better predict the next word also more accurately model brain activity during language processing. Thus, the mechanism used in modern neural networks is not just inspired by cognitive processes, but actually reflects them.

During the experiment, fMRI and electrocorticography (ECoG) data were analyzed during language perception. The researchers found that the best predictive model at the time (GPT-2 XL) could explain almost 100% of the explainable variation in neural responses. This means that the process of understanding language in humans is really built on predictive processing, not on sequential analysis of words and grammatical structures. Moreover, the task of predicting the next word turned out to be key—models trained on other language tasks (for example, grammatical parsing) were worse at predicting brain activity.

If this is true, then the key to fluent reading and speaking in a foreign language is precisely training predictive processing. The more the brain encounters a stream of natural language (both written and spoken), the better it can form expectations about the next word or phrase. This also explains why native speakers don’t notice grammatical errors or can’t always explain the rules—their brain isn’t analyzing individual elements, but predicting entire speech patterns.

So, if you want to speak freely, you don’t just need to learn the rules, but literally immerse your brain in the flow of language—listen, read, speak, so that the neural network in your head gets trained to predict words and structures just as GPT does.

Meanwhile, there’s the theory of predictive coding, asserting that unlike language models predicting only the nearest words, the human brain forms predictions at different levels and time scales. This was tested by other researchers (google Evidence of a predictive coding hierarchy in the human brain listening to speech).

Briefly, the brain works not only to predict the next word, but as if several processes of different “resolutions” are launched. The temporal cortex (lower level) predicts short-term and local elements (sounds, words). The frontal and parietal cortex (higher level) predicts long-term and global language structures. Semantic predictions (meaning of words and phrases) cover longer time intervals (≈8 words ahead). Syntactic predictions (grammatical structure) have a shorter time horizon (≈5 words ahead).

If you try to transfer this concept to the architecture of language models (LLM), you can improve their performance through a hierarchical predictive system. Currently, models like GPT operate with a fixed contextual window—they analyze a limited number of previous words and predict the next, not exceeding these boundaries. However, in the brain, predictions work at different levels: locally—at the level of words and sentences, and globally—at the level of entire semantic blocks.

One of the possible ways to improve LLMs is to add a mechanism that simultaneously works with different time horizons.

Interestingly, can you set up LLM so that some layers specialize in short language dependencies (e.g., adjacent words), and others—in longer structures (e.g., the semantic content of a paragraph)? I google it, and there’s something similar in the topic of “hierarchical transformers”, where layers interact with each other at different levels of abstraction, but still, it’s more for processing super-long documents.

As I understand it, the problem is that for such, you need to train fundamental models from scratch, and probably, this does not work well on unlabelled or poorly labelled content.

Another option is to use multitask learning, so that the model not only predicts the next word, but also tries to guess what the nearest sentence or even the whole paragraph will be about. Again, google search shows that this can be implemented, for example, through the division of attention heads in the transformer, where some parts of the model analyze short language dependencies, and others predict longer-term semantic connections. But as soon as I dive into this topic, my brain explodes. It’s all really complex.

But perhaps, if it’s possible to integrate such a multilevel prediction system into LLMs, they could better understand the context and generate more meaningful and consistent texts, getting closer to how the human brain works.

I’ll be at a conference on the subject in March; will need to talk with the scientists then.