I wonder if there exists an agent that takes an Excel table significantly larger than the context window and begins to document its essence. Here are several tabs. Here on tab 5, there is a table with a million rows and five columns. The columns are as follows. We take random data from the table, looks like there are numbers, and there – surnames. We assume that there are numbers everywhere – we write a code that checks this assumption and at the same time calculates min/max and a set of unique values. So, few values, only five. We record it. Now we check the surnames. Yes, these are just strings, new sampling showed that they are indeed surnames. Here’s a formula. We see where it points. And so on. And this column – unclear purpose. We look at the data – these are some numbers from 0 to 1. We measure the average and the spread. We ask the user – maybe they’ll provide some comments. They did. It turned out to be a KPI issued to this user from an external system. We record it. And so on. Documentation emerges. Later, when there is documentation, one can request to perform some operations with all this, since the LLM now more or less understands the purpose of the data and their connection, and can build some hypotheses on detecting outliers and verifying them.
Category: Programming
The Crucial Role of Data Quality Oversight in Development Projects | May 06 2026, 16:07
Almost every development project features a dedicated functional testing automation team, yet surprisingly, a similar emphasis on Data Quality is rarely found. Regardless of whether data comes from external integrations, users, or is generated by the system itself, it often remains without proper control simply because no one seems to consider it important, and later they struggle with the consequences – they accumulate like a snowball. The longer such issues persist, the harder they are to resolve, eventually leading to a situation where people just resign themselves to the “irreparable” state of the database. It is much better to identify these problems at the moment they arise, while the technical debt has not yet become insurmountable, rather than later figuring out how to prevent them from causing everything to crash;
In essence, there needs to be a constant “supervisor” over all types of databases used by the system (relational, NoSQL, search indexes, or graph databases) — essentially, this is a layer of data quality checking over processes. Of course, there must be clear rules – specifically what to check and which flags to use to mark specific anomalies.
There must be a responsible party for the process (a human, not AI), who will integrate these reports into the development and support workflows. Many data integrity issues cannot just be resolved through an interface — they require the engineering team to develop scripts for mass correction and data cleansing.
Incidentally, this also transitions into the realm of anomaly detection (outlier detection). Machine learning and LLMs for identifying subtle “bad” patterns that traditional rule-based systems might miss.
What do you think about this? Are similar mechanisms implemented in your processes?

Harnessing Chat Data for Semantic Q&A Search | April 30 2026, 04:05
In one evening, I created a simple utility that extracts the Natural Language Processing chat for a year and a half – there are 65,000 messages, and converts it into question-answer pairs with semantic search available. Clicking on a search result (on the left) opens the dialogue in the chat. The messages that are responses to the question are highlighted. And at the top, the original phrasing of the question is highlighted as well.
How it works: the system assumes that people mainly reply to messages that are relatively close in the past. If several replies are made to one message, then it is likely useful and caught the interest of others in the chat. The system takes messages starting from the one many have replied to, ending with the last in the reply-to chain – and among such messages, it selects those that have at least 3 reply-tos to the original question. In essence, it cuts a piece from the chat starting with a popular question so that after the bottom cut, most likely, irrelevant content follows. Such blocks can overlap each other – for example, if someone asked a question while others were replying to something else.
So, if user A asked what the weather was like, and they received answers like “good,” “bad,” “rain,” and there were five messages without a reply-to, and then someone replied to “rain” with the question “why rain”, and five more people replied to this question, then the first question about the weather makes it into the system – the piece ends with 13 messages.
Afterwards, these pieces are summarized into question-answer pairs.
It turns out quite cool.
P.S. In the screenshot, the search query has nothing to do with the search result because I foolishly took the screenshot after I changed the query but before I hit send.

Misadventures in Keyboard Layouts: Searching for Gremlin, Finding Surprises | April 28 2026, 20:33
This is me typing the word gremlin, without switching the keyboard layout. Wanted to read about the query language for graph databases, need it for work. Google surprises, it does surprise

CPU vs GPU: A Speed Challenge in Embedding Creation | April 11 2026, 18:08
When working with certain tasks, the difference between a CPU and a GPU is simply astounding. For example, I need to create many (millions) of embeddings, model BGE M3. Running this on my quite powerful 24-core Intel Core Ultra 9 285K processor takes 45.85 seconds to create 500 embeddings, while using an NVIDIA 5090 GPU, the same task is completed in just 0.36 seconds. It is so fast that I specifically wrote this benchmark to figure out whether my GPU is being utilized at all. The program that sends requests to TEI does it in test mode not actively enough (roughly a couple of times per second), and the GPU load graphs are practically zero.
— Testing http://localhost:8080/embed — <– CPU version
Requests completed: 500
Total time: 45.85 sec
Throughput: 10.90 req/sec
Average latency (Avg Latency): 4386.11 ms
P95 latency: 5021.88 ms
— Testing http://localhost:8090/embed — <– GPU version (NVIDIA 5090)
Requests completed: 500
Total time: 0.36 sec
Throughput: 1398.69 req/sec
Average latency (Avg Latency): 31.38 ms
P95 latency: 53.18 ms
========================================
RESULT: http://localhost:8090/embed is 99.22% faster
Celebrating a Milestone: Rauf Aliyev’s Programming Qualification from 1994 | March 21 2026, 13:54
Mom sent it. This was given to me when I graduated from school. The education was quite good back then, at least. Part of the science classes were conducted at the institute.

Crafting Nabokov’s Dictionary: A Multilingual Lexical Journey | March 15 2026, 18:30
I’m reading Nabokov and decided to take a break to create a convenient app “Nabokov’s Dictionary” and am considering selling it on Amazon as a book. Essentially, it looks like this (see screenshot) – definitions of complex words in English, Russian, German, and French, in the same order they appear in the original book.
Would you buy such a book?
To accurately make their definitions, I also wrote an aligner – a program that matches sentences and paragraphs in English with their translations (Nabokovian) into Russian. And when a word’s definition is created, it uses not only the knowledge of LLM but also the Russian translation by the author. It’s worth separately discussing how the algorithm works (I invented it myself because everything I found online did not work as I needed). It first finds long sentences and matches the longest sentences with their pair through cosine similarity of embedding vectors created through the multilingual e5 model. These sentences become anchors. Then, assuming that for long sentences the error is almost excluded, the longest sentence between anchors is found, and everything repeats recursively. There are many situations where a sentence in Russian has no equivalent in English and vice versa, where a sentence is split into two, or conversely two are merged into one. The algorithm handles this as best as it can. The result is quite a good quality of alignment. To such an extent, that errors in alignment can hardly be found (but they are likely still there). Either way, it is only needed for the context for translating words, even if there are rare errors, it’s not a big deal.
Would you buy such a book?

From MS-DOS to Modern CAD: My Journey with Bazis Soft | March 06 2026, 17:43
My first job as a programmer, with an office in Kolomna and for money. It was 1993, or maybe even a year earlier. 10th-11th grade of school. And this company still exists, and the guys I worked with are still there! Natalya Bakulina, Pavel Bunakov, Nikolai Kaskevich. Imagine that. Moreover, they started back in 1986, that is, 40 years ago already! I can hardly remember other commercial companies of such age in Russia. When I came to work there, there was MS DOS, they wrote in Turbo Pascal, but they had started many years before me on the SM-1420 computer, though back then, the company was not entirely commercial. At the time of my arrival, their system was a competitor of AutoCAD in the market, locally also competing with “Kompas”. I made an installer from 5.25″ and 3.5″ disks – to capture the spirit of the era. Later they switched to Delphi and Windows. After that, they narrowed down their focus, transitioning from CAD for engineering to CAD for furniture, where they still hold very strong positions.

Revolutionizing Research: Introducing a Web-Based Notebook Integrated with AI and PDF Support | February 19 2026, 16:19
I’ve further developed a new tool for myself for working with information and organizing it. The main idea is a web-based notebook for research, studying subjects, working on them, integrated with AI and PDF support.
The main problem with typical PDF readers and notes is that the context is lost as soon as you switch to a new tab. In my tool, each text fragment or PDF becomes a node in a “live” hypertext tree, which I can access from multiple computers at any time.
Work process:
– Contextual AI. I can ask the AI to clarify complex passages right within the document. The explanation stays right where the question was asked. Moreover, it is a separate document, linked to the specific spot in the source. When clicked, you see both the original and the explanation on the screen at the same time.
– Panels instead of windows. If the explanation itself requires clarification, a new panel opens to the right. This allows for an endless chain of queries, never losing the place in the original text. That is, you see several panels at once, and unnecessary ones can be closed.
– PDF support. I can upload a PDF, select an area on the page (e.g., a complex diagram or a list of authors), and the LLM instantly extracts data, supplements, or explains them. The explanation is attached to the spot where it was requested, just like with non-PDFs.
– Nested annotations. My comments are not just static text. They can contain their own PDFs, links, and further sub-tasks for AI, maintaining a depth of nesting that reflects how we actually think.
This is not just a file storage system, but an “engine” for building knowledge.
The tool suits me personally very well, but perhaps it only solves my specific tasks. What do you think, would something like this be useful to others? Would it be useful to you? Should I develop the project into a fully-fledged product and give it to other users for testing?
Interactive Text Enhancer: A Tool for Embedding Clarifications | February 12 2026, 16:11
I whipped up this thing in just an hour. Do you think anyone besides me needs it?
Here’s the idea. Take any text – a Wikipedia article, for example. Highlight any segment, say something unclear. The LLM gives us an explanation, and instantly inserts a box right in the text which you can click to open the explanation. In this explanation, there might be something unclear too. We highlight it with the mouse from this explanation, and a box appears there too. This continues until everything is clear. All the boxes remain in the text, so you can always return to them. So, if the idea was unclear to me, maybe it will be to others, and then a ready link with explanations will come in very handy. The result can be shared with colleagues.
For explanations, not just the fragment is used, but also the context. For example, otherwise, the highlighted word Terrier would yield text about a dog breed, not about the search system.
