MachineLearning – Hi, I'm Rauf Aliev.

The Crucial Role of Data Quality Oversight in Development Projects | May 06 2026, 16:07

Almost every development project features a dedicated functional testing automation team, yet surprisingly, a similar emphasis on Data Quality is rarely found. Regardless of whether data comes from external integrations, users, or is generated by the system itself, it often remains without proper control simply because no one seems to consider it important, and later they struggle with the consequences – they accumulate like a snowball. The longer such issues persist, the harder they are to resolve, eventually leading to a situation where people just resign themselves to the “irreparable” state of the database. It is much better to identify these problems at the moment they arise, while the technical debt has not yet become insurmountable, rather than later figuring out how to prevent them from causing everything to crash;

In essence, there needs to be a constant “supervisor” over all types of databases used by the system (relational, NoSQL, search indexes, or graph databases) — essentially, this is a layer of data quality checking over processes. Of course, there must be clear rules – specifically what to check and which flags to use to mark specific anomalies.

There must be a responsible party for the process (a human, not AI), who will integrate these reports into the development and support workflows. Many data integrity issues cannot just be resolved through an interface — they require the engineering team to develop scripts for mass correction and data cleansing.

Incidentally, this also transitions into the realm of anomaly detection (outlier detection). Machine learning and LLMs for identifying subtle “bad” patterns that traditional rule-based systems might miss.

What do you think about this? Are similar mechanisms implemented in your processes?

Harnessing Chat Data for Semantic Q&A Search | April 30 2026, 04:05

In one evening, I created a simple utility that extracts the Natural Language Processing chat for a year and a half – there are 65,000 messages, and converts it into question-answer pairs with semantic search available. Clicking on a search result (on the left) opens the dialogue in the chat. The messages that are responses to the question are highlighted. And at the top, the original phrasing of the question is highlighted as well.

How it works: the system assumes that people mainly reply to messages that are relatively close in the past. If several replies are made to one message, then it is likely useful and caught the interest of others in the chat. The system takes messages starting from the one many have replied to, ending with the last in the reply-to chain – and among such messages, it selects those that have at least 3 reply-tos to the original question. In essence, it cuts a piece from the chat starting with a popular question so that after the bottom cut, most likely, irrelevant content follows. Such blocks can overlap each other – for example, if someone asked a question while others were replying to something else.

So, if user A asked what the weather was like, and they received answers like “good,” “bad,” “rain,” and there were five messages without a reply-to, and then someone replied to “rain” with the question “why rain”, and five more people replied to this question, then the first question about the weather makes it into the system – the piece ends with 13 messages.

Afterwards, these pieces are summarized into question-answer pairs.

It turns out quite cool.

P.S. In the screenshot, the search query has nothing to do with the search result because I foolishly took the screenshot after I changed the query but before I hit send.

Navigating the Depths of High-Dimensional Spaces | April 13 2026, 23:17

I am now working a lot with high-dimensional vectors, and some things that I hadn’t fully realized before are really starting to tickle my brain. Our 3D intuition doesn’t just not work there—it lies.

It turns out that any two random vectors in high-dimensional space are almost certainly nearly perpendicular to each other. Almost all the space is one continuous “equator”.

Much of machine learning is built on exactly this. If your embeddings suddenly show high cosine similarity (for example, 0.8 — this is not a statistical error, but a powerful signal. It’s almost impossible to randomly converge like this in a 1000-dimensional world.

In such spaces, almost all the mass of data is concentrated in an extremely thin surface layer. The “insides” of objects are mathematically empty.

This can be easily verified with such an imaginary example. Take the “skin” of a multidimensional sphere with a thickness of just 1% of the radius. The volume of the sphere is proportional to the radius raised to the power of its dimensionality.

• In three-dimensional space, the pulp (0.99 of the radius) occupies 97% of the volume, you raise 0.99 to the third power.

• In 1000D, the pulp occupies just 0.000043%.

You can understand it differently. For a point to be closer to the origin, it requires that along all axes the coordinates need to be close to the origin. If one axis has a high value, that’s it, the point has gone. If you take points randomly, the mere probability that they all at once will be below any value decreases with the growth of dimensionality, and decreases quickly.

All the “meat” of the data always ends up in the skin. Any sample in High-D is essentially a set of boundary values.

For white noise in high dimensions, the distance between the closest and the farthest neighbor becomes almost the same. The concept of “closeness” simply degrades.

Smartfolio.me: Revolutionizing Knowledge Organization with Advanced Features | March 19 2026, 04:01

My creation – the knowledge organization tool Smartfolio.me – has gained new features. I’m attaching a five-minute video overview.

It’s like Google Docs, but you can embed documents within each other, creating a network of connected knowledge, and these documents can be PDFs and regular texts.

Upload a PDF, the program converts it into images, and you can highlight any sections right on the pages to leave a comment or ask a question.

If something in the text is unclear, you highlight the area and press “elaborate” — the LLM will detail everything thoroughly, taking into account the context of the entire document, and the explanation will stay linked to the highlighted fragment.

You can simply cut out a piece from a PDF, and the LLM extracts clean text or a ready-made formula from it.

In the PDF window, there is now a small panel — all comments and explanations are immediately visible there, so you can quickly jump to the necessary parts.

You can cut out a diagram or graph from a PDF, copy it as a picture, and paste it into your text. It will automatically crop “on the fly” and save in the database, not as a copy but as a link to the page with crop parameters.

If you delete the page link in the text, it won’t disappear completely but will go into a special list, from where you can reattach it somewhere else or delete it finally. The same document can be inserted in several places. If you add a comment to it, it updates everywhere where this document is linked.

Mathematics is fully supported — LaTeX formulas can be not only viewed but also clicked to adjust them in the editor.

You can generate formulas by description. Just write in words what formula you need (for example, “binomial distribution”), and the system itself outputs the ready formula code.

Now there is a system of plugins – essentially isolated experimental functions separate from the main program. For instance, there is a plugin that recursively collects all subpages into one long document — convenient if you need to read or print everything at once.

Or consider the “YouTube Transcript Cleaning” plugin. If there is a dirty lecture text from YouTube, the plugin will punctuate, paragraph, and create neat headers.

If you insert a link to a website, it opens in a column next to it — you can read the source and simultaneously take your notes. However, some websites do not allow embedding on foreign pages. The system recognizes such sites, and they open in a new tab.

The left panel with the list of pages can be hidden or resized with the mouse, so it doesn’t take up space on the screen.

You can simply copy and paste an image or screenshot, and it will not just insert, but also upload to the database.

It supports working from a mobile phone. On the phone, the interface switches to a single-column mode for convenient reading and commenting on the go.

Multiple databases are supported – you can switch between them. You can connect different databases and different LLMs and switch between them.

Crafting Nabokov’s Dictionary: A Multilingual Lexical Journey | March 15 2026, 18:30

I’m reading Nabokov and decided to take a break to create a convenient app “Nabokov’s Dictionary” and am considering selling it on Amazon as a book. Essentially, it looks like this (see screenshot) – definitions of complex words in English, Russian, German, and French, in the same order they appear in the original book.

Would you buy such a book?

To accurately make their definitions, I also wrote an aligner – a program that matches sentences and paragraphs in English with their translations (Nabokovian) into Russian. And when a word’s definition is created, it uses not only the knowledge of LLM but also the Russian translation by the author. It’s worth separately discussing how the algorithm works (I invented it myself because everything I found online did not work as I needed). It first finds long sentences and matches the longest sentences with their pair through cosine similarity of embedding vectors created through the multilingual e5 model. These sentences become anchors. Then, assuming that for long sentences the error is almost excluded, the longest sentence between anchors is found, and everything repeats recursively. There are many situations where a sentence in Russian has no equivalent in English and vice versa, where a sentence is split into two, or conversely two are merged into one. The algorithm handles this as best as it can. The result is quite a good quality of alignment. To such an extent, that errors in alignment can hardly be found (but they are likely still there). Either way, it is only needed for the context for translating words, even if there are rare errors, it’s not a big deal.

Would you buy such a book?

Revolutionizing Research: Introducing a Web-Based Notebook Integrated with AI and PDF Support | February 19 2026, 16:19

I’ve further developed a new tool for myself for working with information and organizing it. The main idea is a web-based notebook for research, studying subjects, working on them, integrated with AI and PDF support.

The main problem with typical PDF readers and notes is that the context is lost as soon as you switch to a new tab. In my tool, each text fragment or PDF becomes a node in a “live” hypertext tree, which I can access from multiple computers at any time.

Work process:

– Contextual AI. I can ask the AI to clarify complex passages right within the document. The explanation stays right where the question was asked. Moreover, it is a separate document, linked to the specific spot in the source. When clicked, you see both the original and the explanation on the screen at the same time.

– Panels instead of windows. If the explanation itself requires clarification, a new panel opens to the right. This allows for an endless chain of queries, never losing the place in the original text. That is, you see several panels at once, and unnecessary ones can be closed.

– PDF support. I can upload a PDF, select an area on the page (e.g., a complex diagram or a list of authors), and the LLM instantly extracts data, supplements, or explains them. The explanation is attached to the spot where it was requested, just like with non-PDFs.

– Nested annotations. My comments are not just static text. They can contain their own PDFs, links, and further sub-tasks for AI, maintaining a depth of nesting that reflects how we actually think.

This is not just a file storage system, but an “engine” for building knowledge.

The tool suits me personally very well, but perhaps it only solves my specific tasks. What do you think, would something like this be useful to others? Would it be useful to you? Should I develop the project into a fully-fledged product and give it to other users for testing?

Exploring LLMs and AI: Connecting Neural Processors to Natural Language Learning | February 15 2026, 15:41

Some thoughts on LLMs and artificial intelligence in general. And in the end about neuromorphic processors and Intel Loihi.

As you all know, fundamentally LLMs operate on the principle of “propose the likely next word using the context from the previous N words,” and then the word enters the context, and the process repeats all over again for the next word. Well, and the context is also processed considering the importance of words.

Now let’s think about how children were taught languages in primitive societies. There were no alphabets, nor grammar. But the grammar itself, according to estimates, was quite complex—based on observations of the small languages of small peoples. Simple grammar is modern when the language has spread to millions and billions.

That is, a child’s brain had to reconstruct grammar in its neurons simply from the flow of speech from those around and through testing the understanding of what was said. It’s likely that the child was corrected if they spoke incorrectly, but somehow this grammar and sound extraction had to settle in the brain—and here the same mechanism as in LLMs is used: which words/sounds go next in what context is determined by latent and uninterpretable rules, which each person in childhood creates in their brain in their own way. That is, roughly speaking, it trains the ML model every time from scratch on the flow of speech from those around. A child does not know what a “case” is, but feels what ending is statistically more likely in a given context.

Actually, modern cognitive science (Karl Friston’s theory) asserts that the brain is literally a “prediction machine.” We constantly generate hypotheses about the next sound or word and correct them when they don’t match (prediction error).

The peculiarity of LLMs is that for them, teachers are texts and images, but for a child’s brain, it’s the living world around, and if all the texts they hear were digitized, their volume wouldn’t even be enough to train a very weak model. LLM sees the word “apple” next to the word “red.” A child sees an apple, feels its smell, taste, weight, and simultaneously hears the sound. This “stitching” of different sensory channels allows building neural connections thousands of times faster than on plain text. That is, modern LLMs take a brute force approach—simply observing the speech of billions, not just their immediate environment. A good question is how the human brain manages to learn from a relatively small dataset. However, it’s a big question whether this dataset is small—for example, lip movements, facial expressions, context provide a lot for building this neural network in the biological brain.

About the context: unlike LLMs, a child understands the speaker’s intention. If mom looks at a cup and says “hot,” the child’s brain limits the search space of meanings to one cup. And if he didn’t understand, he’ll get burned and remember.

One might assume, of course, that the brain already has a ready network at birth. It’s true, but science can’t yet explain it properly. Our entire genetic program has about 20,000 genes encoding proteins, and these 20,000 are responsible for everything—where and how the lungs, heart, bones, blood should be built, and they themselves are of mind-boggling complexity, and somewhere among 3 billion nucleotides and 20,000 genes this information must be recorded.

Apparently, genes encode not a map but an algorithm of self-assembly. Essentially, the architecture of the neural network is built dynamically, and this process begins long before birth. Then it is calibrated by all the signals received by the unborn child, and by the time of birth, there is already a somewhat tuned network in the brain.

It’s likely that the child’s brain is millions of neural networks of different “architectures” that evolve and merge in the learning process. Unlike LLMs, here learning and usage are strictly separated in time. But most importantly—the brain, although the most energy-consuming in the body, consumes very little energy in absolute terms, especially compared to the current “candidates for replacements in hardware.”

In the last few years, there has been active development in the field of neuromorphic systems (for example, the old IBM TrueNorth processor and the actively developing Intel Loihi). In conventional AI, neurons transmit numbers (0.15, 0.88…). In neuromorphic systems, they transmit “spikes” (impulses)—as in the living brain (and the architecture is called Spiking Neural Network – SNN). A few years ago, Intel released Loihi 2. Fully programmable. Neurons on Loihi can change their connections (synapses) right during operation. Supports plasticity—the very biological mechanism when the connection between neurons is strengthened if they often “fire” together. But the main thing—it consumes very little.

In this architecture, the model can continue learning “on the fly” right during operation, without forgetting old data (Continual Learning). Besides that—extreme energy efficiency.

Loihi 2 cannot multiply matrices as modern GPUs do, so completely new software has to be written for them (and this is moving very slowly). No PyTorch or TensorFlow—for Loihi there is only the Lava framework available today. And 1 million neurons from Loihi 2 is very little for LLMs. Therefore, Intel creates systems like Hala Point—it’s an array of 1152 Loihi 2 processors. It contains up to 1.15 billion neurons. Theoretically, in terms of performance per watt, such a system can surpass traditional GPUs by 10–50 times when working with AI models.

Experimental LLMs are already being launched on Loihi 2 (for example, models with 370 million parameters). They are not yet going to replace ChatGPT in the cloud, but theoretically, they are the future for “smart” robots and gadgets that need to understand human speech while running off a small battery.

We’ll observe. It might turn out to be a dud, or it could be another major revolution.

My Ambitious 2026 Plan: From Galapagos Travel to Academic Achievements and Creative Pursuits | January 20 2026, 04:44

My plan for 2026:

– Travel to the Galápagos Islands, Ecuador for a week (summer)

– Finish and release a book on Information Retrieval (also summer, progressing slowly, first couple of chapters are already written. Already spent about 50-100 hours on this, the easy part)

– Release at least one scientific paper, probably on Data Mining (spring). Ideally, submit it somewhere to a journal (challenging). Already spent about 30 hours on this topic, a lot left to do.

– Make a step towards a PhD. Find professors, visit universities, understand the cost and assess my capabilities and resources.

– Continue studying fundamental mathematics and not die (linear algebra, calculus, probability theory, statistics, classical ML). In 2025, I spent about 200-400 hours on this topic.

– Continue studying Deep Learning and reach the “can teach” level. In 2025, I spent about 100-200 hours on this topic.

– Continue studying Data Mining/NLP.

– Update my book on RecSys, releasing version 2.0 with updates and corrections (autumn 2026)

– Make noticeable progress in painting and playing the piano. Specifically, learn Schubert’s serenade (Ständchen, D 889) completely and create at least one canvas that I wouldn’t be ashamed to give as a gift.

Navigating the Future: Embracing Earth’s Magnetic Field as a GPS Alternative | January 10 2026, 17:41

I learned today that there is and is actively used a technology for navigation using the Earth’s magnetic field. It is used as a replacement or an extension of GPS.

For example, there is the Scandinavian ferry Express 5 of Bornholmslinjen, which insures against GPS problems (which do happen) by using MagNav navigation. Unlike GPS, the Earth’s magnetic field cannot be jammed or spoofed—it simply exists. The ferry follows the same route, and generally, navigation could even be achieved through household fishing sonars.

But there are a few startups that use this technology for indoor navigation, where GPS signals cannot reach. It’s claimed that the navigation accuracy is within 1 meter. That’s more interesting.

GiPStech, Oriient, Mapsted.

The basis of this technology is a process called magnetic fingerprinting. Engineers or mapping robots walk through a building with a smartphone, recording unique distortions of the magnetic field at every point. These distortions are created by the steel frame of the building, rebar in the walls, and large electrical equipment. A database is formed where each coordinate (x, y, z) corresponds to its unique magnetic field vector (intensity, inclination, deviation).

The collected data is uploaded to the cloud platform of the provider company. There, they undergo noise cleaning and are “stitched” together with the digital floor plan. When a user walks through a shopping center, their smartphone reads data from the built-in magnetometer in real-time. Special software (SDK) compares the current readings with those stored in the database. For accuracy to be within 1–2 meters, the system relies not only on magnets. It uses sensor fusion—combining data from the magnetic field with inertial sensors (accelerometer counts steps, gyroscope determines turns) and sometimes Wi-Fi/Bluetooth signals for rough localization.

This technology is certainly being actively implemented for drones. The main technical difficulty there is dealing with their own interference and considering that the magnetic field changes, requiring constant map updates. Electrics, engines create strong magnetic fields, which “drown out” the natural background of the Earth. However, various filtering algorithms (including neural networks) are used, which in real-time “subtract” motor interference from the overall sensor readings. From what I understand, at high altitudes (kilometers), the magnetic field is more “smooth”, therefore the accuracy is lower (about 1–5 km). But if several drones fly together and exchange signals, overall they can provide very good accuracy each. Additionally, a group of drones can measure the gradient (rate of change) of the magnetic field in space, tying location not to absolute values, but to relative ones. Essentially, using a group of drones turns the navigation system from a set of individual receivers into a distributed phased array antenna, capable of filtering global interferences and working with much weaker useful signals. Considering that small drones capable of staying airborne for long periods can be released into the air by the hundreds (and cost pennies), this is a quite promising area for military.

There’s an interesting startup, Zerokey. They release QUANTUM RTLS 2.0. This device provides spatial accuracy to 1.5mm. It’s used in production, for example. Their video shows a “watch” on a worker’s hand that monitors the correctness of assembling something on a table. Here, the principle is ultrasonic, and it’s understandable that these “watches” are paired with stationary sensors and further multilateration.

Exploring ASML’s Advanced Chip-Making Equipment with Veritasium | January 02 2026, 00:47

Veritasium released a very cool report yesterday from ASML about the equipment used to print chips for your little phones, cameras, and laptops.

For those who aren’t familiar with the process. First, a monocrystal is grown from ultra-pure silicon and cut into thin wafers, then multiple layers of thin dielectrics, conductors, and semiconductors are repeatedly applied to the wafer surface, each time shaping the necessary areas using photolithography, etching, and ion doping, eventually creating billions of transistors and connecting metallic paths; finally, the wafer is tested, cut into individual crystals, and packaged into casings, making them into finished microchips.

This process had a limitation – the width of the paths and the distance to the next one are limited by the wavelength of the light used, and reducing it is difficult because there’s nothing to focus such a beam with – lenses simply absorb/reflect everything. In EUV lithography (extreme ultraviolet), the wavelength is 13.5 nm. This is virtually soft X-ray radiation.

The video explains details about the ASML machine costing 400 million dollars. Instead of refracting lenses, highly complex systems of reflecting mirrors are used. These mirrors are the smoothest surfaces ever created by humanity. If the mirror of this machine were enlarged to the size of the Earth, the largest bump on it would not be thicker than a playing card. To enable the mirrors to reflect X-rays, up to 76 alternating layers of tungsten and carbon, each less than a nanometer thick, are applied. All this is done by Zeiss. In addition, this mirror has a controlled curvature—it is constantly adjusted by robots with precision up to picoradians. The precision of the mirror control is so high that if a laser were mounted on it, directed at the Moon, the system could choose on which exact side of a 10-cent coin lying on the moon’s surface to hit with the beam.

But. We don’t have a “light bulb” that emits light in the EUV range.

To generate this light, a laser “shoots” at a droplet of molten tin the size of a white blood cell, traveling at 250 km/h. The first pulse flattens the droplet into a disc, the second and third turn this “disc” into plasma – and all this occurs within just 20 microseconds. When hit by the laser, the droplet heats up to 220,000 Kelvin — approximately 40 times hotter than the surface of the Sun. This plasma emits that very necessary light. And it does so 50,000 times a second. They say it’s been brought up to 100,000. Imagine, at a hundred thousand laser shots per second, it never misses a single one. All this happens in a deep vacuum. To clean the mirrors from tin particles, the chamber is constantly blown with hydrogen at a speed of 360 km/h — faster than a Category 5 hurricane. This process is described by the same formula (Taylor-von Neumann) that describes a nuclear explosion or supernova explosion.

The machine layers the chip with an error margin of no more than five atoms, while the matrix swings back and forth with an overload of 20G.

A single High-NA machine is transported in 250 containers on 25 trucks and seven Boeing 747 aircraft.

Link to the video – in the comments. Or search on YouTube on the channel veritasium.