Exploring the Magic of Neural Networks in Letter Prediction and Visualization | December 14 2025, 23:35

I am currently experimenting with training simple neural networks – primarily to automate the existing toolkit, and some things just seem like magic.

There is a database of 32,000 names. There is a neural network filled with random numbers. I start training, with only this list of names as input. The first layer of the neural network is embeddings, and I set the number of dimensions to 2 for easy visualization. And after 200,000 iterations of training, the system clearly separates vowels from consonants, and for some reason, places the letter “q” slightly apart from other consonants. It seems that this is because the letter ‘q’ almost exclusively predicts the letter ‘u’ (Queen, Quincy, Quentin).

It also very reliably separates vowels and consonants in Russian names. In Russian names, the letters b and l are somewhat away from the other consonants, as are the soft and hard signs (well, that’s understandable).

I wonder how it works. If trained on a normal corpus of texts, the difference would be very clear. Why are vowels separated from consonants? Apparently, from the network’s mathematical perspective, ‘a’ and ‘o’ serve the same function: they “trigger” the prediction of the consonant following them, so the alternation of vowels and consonants is to blame. But damn, it’s interesting 🙂

And since the model can predict the next letters, you might try running it on Russian. On a model with 30-dimensional embeddings, it invents names like: Byaketta, Afsena, Erakey, Zasbat, Daraya, Gaiomahad, Rain, Razhul, Gzhatsiy, Reben, Vureb, Durodira, Turuzhul, Regravgava, Razsan, Gabila, Avganzh, Raksi, Khalebkokhorta, Rather. The model – for those who understand – is this: input of 6×33 characters (because we take up to 6 characters of context), encoded into embeddings of 60, goes to a layer of 100 neurons, and from there back to 33 characters. Some nonsense, but at least it’s clear how it all works at all levels.

Modern Reading: More Words, Digital Shifts, and Surprising Data Insights from 2008 | December 14 2025, 22:33

An interesting study caught my eye, dating back to 2009. According to it, the modern human indeed reads significantly more than in the past, although the format of this reading has changed. The study suggests that in 2008, an average American consumed about 100,000 words a day (approximately a quarter of “War and Peace”) – this is an approximate number of words that passed through consciousness per day (via ears or eyes), calculated based on activity chronometry. This is 140% more than in 1980.

Therefore, contrary to the myth about the degradation of reading, at least in 2008, we processed 2.4 times more textual information than our parents’ generation. Moreover, the study only considered information consumed outside of work (at home, in transit, during leisure).

The structure of reading – if in 1960, 26% of words came from paper, by 2008 this share had fallen to 9%. However, digital media (internet, email, social networks) not only compensated for this decline but also tripled the total reading time. The reason — the internet, as it is predominantly a textual environment (web surfing, email).

But it’s interesting that although the Internet accounts for 25% of consumed words, it only makes up for 2% of bytes (since video on the internet in 2008 was of low quality). Thus, they estimated the information flow from different channels and converted it into bytes 🙂 Radio accounted for 19% of the time but only generated 0.3% of bytes (as audio requires less data). Voice communication (telephone) — accounted for only 5% of words and a negligible share of bytes, but it was the only fully interactive channel before the internet era. TV remained the main source of information by time in 2008 (41% of all hours) and quantity of words (45%), however, in terms of data volume (bytes), television was only second (35%), behind computer games.

Now about games, quite interesting. The main finding from the report: Games generated (or did in 2008) 55% of all “bytes” consumed by households. Meanwhile, they only accounted for 8% of user time. This is quite a controversial topic in their report.

Those 100,500 words — that’s an assessment of actual words that a person either read or heard. This is not a metaphorical “equivalent,” but an attempt to calculate the verbal information precisely. They took the consumption time of each media and multiplied it by the average word inflow rate for that channel. Reading (books, newspapers, internet texts): 240 words per minute. Email and web surfing – 240 words per minute. Television (dialogues in shows/movies): 153 words per minute. Radio: 80 words per minute (less because of many pauses and music). Music: 41 words per minute (song lyrics).

Link in the comments

Harnessing GPU Power Beyond Machine Learning: A Data Processing Experiment | December 13 2025, 01:16

Torturing my supercomputer. Illustration that the GPU is not just for machine learning and some complex math.

My script takes a thick English dictionary (Webster) and multiplies it by 30, creating a list of 12 million words. Then, the algorithm looks through all 12 million words and replaces all the vowels with asterisks using regex. To add more load, a “word length” column is added, and then we take words longer than 10 letters and find the most frequent (top 5).

So, in Python this is

df[‘masked’] = df[‘text’].str.replace(r'[aeiou]’, ‘*’, regex=True)

df[‘len’] = df[‘masked’].str.len()

res = df[df[‘len’] > 10][‘masked’].value_counts().head(5)

and this code is executed first through the main processor, then through a GPU.

The main processor (I have the top-tier Intel i9 285k) completes this task in 24 seconds, while the Nvidia RTX 5090 does it in 0.51 seconds. That’s a 46 times difference!

[Pandas CPU] Top Patterns:

masked

s*r w. sc*tt. 23280

s*r t. br*wn*. 23220

j*r. t*yl*r. 16140

bl*ckst*n*. 10860

b***. & fl. 10830

Name: count, dtype: int64

[Pandas CPU] Computation Time: 23.5596 sec.

Transferring data to GPU…

Transfer complete in 1.16s

— Running Benchmark: cuDF GPU —

[cuDF GPU] Top Patterns:

masked

s*r w. sc*tt. 23280

s*r t. br*wn*. 23220

j*r. t*yl*r. 16140

bl*ckst*n*. 10860

b***. & fl. 10830

Name: count, dtype: int64

[cuDF GPU] Computation Time: 0.5108 sec.

TOTAL SPEEDUP: 46.12x

Stages of Understanding Scientific Papers | December 10 2025, 19:38

As I periodically read scientific papers on my topic, I will try to articulate the levels of understanding the truth.

Level 0: “Read Later Folder” Downloaded the PDF, the title sounds genius, the abstract seems like the solution to all my problems. The file is forever buried in the ~/Downloads/Papers/ToRead folder.

Level 1: “Sumerian Cuneiform” Don’t understand anything at all. Random symbols, the Greek alphabet is over. “Orthogonal extrapolation of cognitive entropy within a quasi-stationary discourse inevitably induces a bifurcation of transcendental synergism.” Such materials really lower self-esteem. Most often from this level, you either fall back to zero, or gradually move to the second level.

Level 2: “Illusion of Competence” The Abstract is clear, the Introduction reads like a good detective story. But as soon as the main section starts, the text turns into a pumpkin. I can’t paraphrase it in my own words, only in general phrases: “Well, they trained a neural net… kind of.”

Level 3: “Formulas where needed and where not” The Abstract is clear, the first half of the article is also okay (architecture, pictures). But then comes formula (4), where “magic” happens. I take the authors’ word for it that equation (3) leads to (4) because, of course, I won’t check it. Beyond that — sheer horror and belief in a miracle.

Level 4: “Goldfish Effect” While reading — everything is crystal clear. The logic is solid, conclusions are obvious, the authors are smart. I close the tab, someone asks me, “What was the article about?” — and I freeze. My mind goes blank. If you take away the paper, I can’t reproduce even the idea because there essentially isn’t an idea, there is a process.

Level 5: “Armchair Expert” Everything’s clear, I can retell the essence over a beer. I know that Input transforms into Output, but the “black box” inside is still black. Give me a computer, I wouldn’t be able to reproduce even the skeleton because, it turns out, the article lacks half of the important stuff.

Level 6: “Critic-Practitioner” Everything is clear, I can recount, understand how to reproduce (even without their code). I see where they cut corners. I definitely know that the “state-of-the-art” result is achieved only thanks to a lucky seed or dataset and this strange trick in preprocessing, mentioned in the footnote on page 12.

Level 7: “Deconstructor” Hooray, I’ve understood everything and implemented it myself. It works worse than in the article, but I know why. However, I understand this work better than the second author (who just made charts). I see that all this complex mathematics over 5 pages boils down to two paragraphs in the middle.

Level 8: “Nirvana” The article is trivial. The idea is secondary, it was all in the ’90s with Schmidhuber, just named differently. Formulas are overcomplicated for importance. I can write the same in 10 lines of code and it will work faster. Reject.

If anything — I’m stuck somewhere between 2 and 4.

Comparing US and Russian Higher Education Systems through Credit Hours | December 10 2025, 17:35

Regarding education in the USA and the USSR/Russia. My degree in the USA is evaluated as a Master of Science degree in Computer Science. My younger colleagues say that a Russian university degree is rarely recognized as a Master’s these days, and often hardly qualifies even for a Bachelor’s. I decided to look at the numbers and was very surprised.

To earn a bachelor’s degree in the USA, you need to spend about 2000 hours in classrooms/laboratories. In terms of credits, this equals 120 credit hours. One credit usually equals 1 hour (50 minutes) of lectures per week for a semester (15 weeks). Laboratory work has a different coefficient (often 2–3 hours in the lab count as 1 credit), so the actual number of classroom hours is slightly higher (closer to 2000+).

So, my diploma states that I spent 7908 hours in classes over five years. That’s four times more than the typical student in the USA. Based on the numbers, it turns out that I spent about 2000 hours on math, physics, and English alone over five years, with a total of 42 subjects.

A colleague shared that in his Russian bachelor’s diploma there are 3140 academic hours, which is twice as less. And can you share how many hours are in your diploma?

Year of graduation, university, specialty, and the number of hours? I’m curious about the range of variation.

Theremin Tones at Splean’s Concert: A Musical Blend | December 05 2025, 23:29

Thereminvox at a Splin concert yesterday. It turns out that this seemingly borrowed word doesn’t exist in English. Instead, the name of this musical instrument is theremin because the generic family name of Lev Theremin had French roots and was spelled as Theremin. The thereminvox was nicely incorporated into the arrangement, although it was played quite simply by a musician from Rostov, and the thereminvox itself had only one antenna.

Among the musicians, Meshcheryakov, the drummer, really stole the show. The most melancholic was the guitarist, Vadim Sergeyev. He just stared motionlessly into the crowd, almost immovable, but performed his part very precisely – evidently, professionalism can’t be diluted.

The Maddening Ambiguity of Mathematical Notation | December 02 2025, 15:30

If someone tells you that mathematics is an exact science, don’t believe them. Since I’m currently into data science as a hobby, I’m studying all sorts of things from different books and my brain is exploding at how this can happen in a science where every little detail should fit into a system, otherwise it goes by the wayside. Until it gets to notations. It’s a complete mess there. A set of dialects.

Take, for example, common logarithms. The “standard” for how to denote a logarithm depends on which room of the university you are in. In calculus and number theory, log(x) almost always means the natural logarithm ln(x) with base e. The derivative of e^x equals e^x. It’s “natural”. They’re too lazy to write ln. Yet, where decimal logarithms might appear (like in computer science), log(x) suddenly becomes decimal, and ln(x) is based on e.

The expected value E has an argument in square brackets. Meanwhile, the same square brackets in computer science are used for the step function 0/1.

Or if you see a vector – is it a column or a row? In classical mathematics, a vector is always a column. To multiply it by weights, we write T after the vector and then w for the weights. But in many papers, vectors are thought of as rows. And if you see y = xW+b, then x is not a column, because otherwise the dimensions wouldn’t match up. x here is a row. But in the next paper they write Wx+b. And there x is a column 🙂

Angle brackets . For the dot product, the symbol “â‹…” is used, but it is hard to see, especially on a whiteboard, and I very often see that mathematicians use angle brackets for dot product. In general, angle brackets are used for the generalized concept of inner product, where the scalar product is a special case. signifies a certain abstract way to multiply a and b and get a number. Meanwhile, in quantum mechanics this would be written as . And for the scalar product, some use a circle with a dot or x in a circle.

And just for the sake of it, in Russia tangent is tg, while in the USA it’s tan. There’s also tan^-1 and arctan, which are the same, though x^-1 generally means 1/x

Navigating Complexity: The Challenge of Wikipedia’s Expert-Driven Content | November 26 2025, 01:06

Wikipedia has one big problem. Well, or we have it with Wikipedia. If you go to almost any Wikipedia page about a relatively complex mathematical or physical concept, you often suddenly don’t want to read it any further. Formally everything is correct there, but the explanation is given through concepts, often even more complex than the concept being explained. Besides, there is often a lot of unnecessary information — what is formally/academically/taxonomically part of the topic, but essentially “pollutes” the first impression.

This problem arises because the authors of Wikipedia (often mathematicians) prioritize rigor and completeness rather than didactics and comprehensibility.

In the English-speaking environment, this is sometimes called “Drift into pedantry”. Articles are often written by experts for experts, not for those who are trying to learn the subject from scratch.

Let’s take, for example, a “tensor”. Imagine a student who has heard that tensors are used in machine learning (Google TensorFlow) or physics and wants to understand the essence.

What the reader expects (intuition): “A tensor is a table of numbers (or some sort of data container) that describes the properties of an object and correctly changes if we rotate the coordinate system”

What Wikipedia provides: “A tensor (from Latin tensus, ‘strained,’ as per the classical layout of mechanical stress at the sides of a deformable cube, see illustration) — is a layout (arrangement in space) of numbers (components), used in mathematics and physics as a special type of multi-index object, possessing mathematical properties.” The article immediately starts listing ranks, covariance and contravariance of indices. This is formally correct but it “pollutes” the first impression.

The illustration at the very top is captioned like this: “Mechanical stress, deforming a cube with faces perpendicular to the coordinate axes, in classic elasticity theory is described by the Cauchy stress tensor, which links 2 indices: the normal vector to the face with the stress vector T (force per unit area); there are 3 directions of normals and 3 directions of stress components, which gives a 2nd rank tensor 3Ă—3 — consisting of 9 components.”

Formally — not a single error. In fact — it’s a wall of text that requires knowledge of linear algebra just to read the definition.

It’s as if you asked “What is an apple?”, and you were responded with: “An apple is a fruit of plants from the subfamily Amygdaloideae or Spiraeoideae, featuring an epicarp, mesocarp, and endocarp, often participating in Newton’s gravitational experiments.”

On one hand, it seems like with the emergence of LLM, Wikipedia is no longer necessary. There are conditional LLMs like ChatGPT, which essentially paraphrase everything that is in Wikipedia in the required form. But they do it because they were trained on Wikipedia, and undoubtedly Wikipedia was given much more weight during training than other internet junk. If there was no Wikipedia in the training set, it would be much more difficult. Meanwhile, Wikipedia is constantly edited, and LLM and Google use it exactly when answering questions.

Therefore, on the one hand, it seems to me that it is high time for Wikipedia to transition to generating on the basis of expert-curated data and packaging knowledge in the required format, for example, in the form of questions and answers. On the other, the whole idea of encyclopedia master-data for LLM/RAG is lost.

The paradox is that LLM is, in essence, the only “interface” that was able to read these pedantic definitions of Wikipedia, “understand” them (through thousands of examples of code and articles) and translate them back into humane language. Wikipedia has become an excellent database for robots, but a poor textbook for people.

The Inner Mechanics of Old Rotary Phones | November 25 2025, 00:59

When I was little, I used to take apart old telephones many times, and only now, in my grey years, I realized that I never wondered how they worked. And they worked in a very interesting way.

Let’s start with the dial. The phone is connected to the network by two wires. The dial is a rotary one. When you wind up the disk, the contacts are blocked, and when you release it, the disk returns backward and delivers a series of interruptions/pulses to the line. But how was it made to return at a constant speed (which is 10 pulses per second)?

It operated based on a centrifugal friction governor. The mechanics (gearbox) accelerated the governor’s axle to thousands of revolutions per minute. Two weights with friction pads (consider them brakes) were seated on the axle. The centrifugal force pressed them against the stationary drum, creating a braking effort. This is a direct heir to Watt’s centrifugal governor, allowing the mechanism to work stably regardless of how sharply you released the disk.

Next. The Central Office connected you with a friend. You both speak at the same time, and sound is transmitted there and back through two wires—why two wires and not four, you understand? Well, okay, but why don’t you hear yourself too loudly, since the microphone sends the sound there, from where the “speaker” hears it?

I couldn’t answer quickly. Went googling. So, it turns out that a special differential transformer was responsible for this. There, the current from the microphone branches off: part goes into the line to the friend, and part goes into the “balance circuit” (a chain of a resistor and capacitor inside the phone), mimicking the line resistance. The transformer coils are wound in opposition: the magnetic flows from the current in the line and the current in the balance circuit mutually annihilate themselves in the coil that goes to the speaker. Engineers purposely adjusted the balance not perfectly, leaving a “local effect” – a quiet sound of one’s own voice, so the phone wouldn’t seem “dead.” But the incoming signal from the friend has nothing to unbalance it (silence on your side), so it freely passes to the speaker.

Now about the microphone. At that time there were no transistors in phones, but the signal was loud. The secret is in the design of the microphone, it’s carbon. Essentially, it is a box with carbon powder and a movable diaphragm. The sound from your mouth compresses and decompresses the powder, changing its resistance. The microphone does not generate current but modulates the powerful current coming from the Central Office. Essentially, it worked as an amplifier. Over time, the charcoal compacted, and the audibility dropped—hence the habit of tapping the handset to “shake up” the powder.

The speaker was normal, electromagnetic. Although not quite. If there were only an electromagnet inside (without a permanent magnet), the phone would horribly distort the voice. An electromagnet attracts iron regardless of the polarity of the current. If you supply a sine wave (voice), the diaphragm would be attracted during both the positive and the negative half-waves. Result: the frequency of the sound would have doubled, and you would hear not the voice of a friend, but an unintelligible high-frequency buzzing. The permanent magnet solves this problem: It creates “preload.” The diaphragm is always attracted to the magnet with medium force. When the “plus” of the signal arrives, the magnetic field strengthens and the diaphragm flexes more. When the “minus” arrives, the field weakens and the diaphragm springs back.

In modern speakers, the force strictly depends on the direction of the current. Plus pushes, minus pulls. Therefore, the frequency doubling, which old phone engineers feared, physically cannot occur here. The diaphragm doesn’t need “preload” by a magnet, it just needs to hang in peace.

Interestingly, the principle of old electromagnetic capsules (metal diaphragm + “anchor”) is used now in the most expensive in-ear headphones—google “balanced armature headphones” (prices around $500).

The voltage in the telephone network was negative – minus 48/60 volts. Plus was grounded, and the “live” wire was the minus. Why? It turns out, this is protection against electrochemical corrosion. The cables lie in moist earth. If there were a “plus” (anode) on the wire, upon insulation damage, copper would dissolve (electrolysis) and the cable would rot. With “minus” (cathode), metal ions, on the contrary, tend to settle on the conductor from the soil, which prolonged the cable’s life by decades.