AI – Page 2 – Hi, I'm Rauf Aliev.

Exploring Automated Documentation of Large Excel Datasets | May 06 2026, 22:28

I wonder if there exists an agent that takes an Excel table significantly larger than the context window and begins to document its essence. Here are several tabs. Here on tab 5, there is a table with a million rows and five columns. The columns are as follows. We take random data from the table, looks like there are numbers, and there – surnames. We assume that there are numbers everywhere – we write a code that checks this assumption and at the same time calculates min/max and a set of unique values. So, few values, only five. We record it. Now we check the surnames. Yes, these are just strings, new sampling showed that they are indeed surnames. Here’s a formula. We see where it points. And so on. And this column – unclear purpose. We look at the data – these are some numbers from 0 to 1. We measure the average and the spread. We ask the user – maybe they’ll provide some comments. They did. It turned out to be a KPI issued to this user from an external system. We record it. And so on. Documentation emerges. Later, when there is documentation, one can request to perform some operations with all this, since the LLM now more or less understands the purpose of the data and their connection, and can build some hypotheses on detecting outliers and verifying them.

The Crucial Role of Data Quality Oversight in Development Projects | May 06 2026, 16:07

Almost every development project features a dedicated functional testing automation team, yet surprisingly, a similar emphasis on Data Quality is rarely found. Regardless of whether data comes from external integrations, users, or is generated by the system itself, it often remains without proper control simply because no one seems to consider it important, and later they struggle with the consequences – they accumulate like a snowball. The longer such issues persist, the harder they are to resolve, eventually leading to a situation where people just resign themselves to the “irreparable” state of the database. It is much better to identify these problems at the moment they arise, while the technical debt has not yet become insurmountable, rather than later figuring out how to prevent them from causing everything to crash;

In essence, there needs to be a constant “supervisor” over all types of databases used by the system (relational, NoSQL, search indexes, or graph databases) — essentially, this is a layer of data quality checking over processes. Of course, there must be clear rules – specifically what to check and which flags to use to mark specific anomalies.

There must be a responsible party for the process (a human, not AI), who will integrate these reports into the development and support workflows. Many data integrity issues cannot just be resolved through an interface — they require the engineering team to develop scripts for mass correction and data cleansing.

Incidentally, this also transitions into the realm of anomaly detection (outlier detection). Machine learning and LLMs for identifying subtle “bad” patterns that traditional rule-based systems might miss.

What do you think about this? Are similar mechanisms implemented in your processes?

Harnessing Chat Data for Semantic Q&A Search | April 30 2026, 04:05

In one evening, I created a simple utility that extracts the Natural Language Processing chat for a year and a half – there are 65,000 messages, and converts it into question-answer pairs with semantic search available. Clicking on a search result (on the left) opens the dialogue in the chat. The messages that are responses to the question are highlighted. And at the top, the original phrasing of the question is highlighted as well.

How it works: the system assumes that people mainly reply to messages that are relatively close in the past. If several replies are made to one message, then it is likely useful and caught the interest of others in the chat. The system takes messages starting from the one many have replied to, ending with the last in the reply-to chain – and among such messages, it selects those that have at least 3 reply-tos to the original question. In essence, it cuts a piece from the chat starting with a popular question so that after the bottom cut, most likely, irrelevant content follows. Such blocks can overlap each other – for example, if someone asked a question while others were replying to something else.

So, if user A asked what the weather was like, and they received answers like “good,” “bad,” “rain,” and there were five messages without a reply-to, and then someone replied to “rain” with the question “why rain”, and five more people replied to this question, then the first question about the weather makes it into the system – the piece ends with 13 messages.

Afterwards, these pieces are summarized into question-answer pairs.

It turns out quite cool.

P.S. In the screenshot, the search query has nothing to do with the search result because I foolishly took the screenshot after I changed the query but before I hit send.

Misadventures in Keyboard Layouts: Searching for Gremlin, Finding Surprises | April 28 2026, 20:33

This is me typing the word gremlin, without switching the keyboard layout. Wanted to read about the query language for graph databases, need it for work. Google surprises, it does surprise

Tesla Robots Gradually Taking to the Streets | April 25 2026, 05:37

Tesla robots are slowly being kicked out onto the street. I rode by on my bike today. Too bad they’re not turned on

Shiba Inu at Work: Turning Daily Moments into Cozy Companionship | April 23 2026, 01:49

How to occupy a dog

Navigating the Depths of High-Dimensional Spaces | April 13 2026, 23:17

I am now working a lot with high-dimensional vectors, and some things that I hadn’t fully realized before are really starting to tickle my brain. Our 3D intuition doesn’t just not work there—it lies.

It turns out that any two random vectors in high-dimensional space are almost certainly nearly perpendicular to each other. Almost all the space is one continuous “equator”.

Much of machine learning is built on exactly this. If your embeddings suddenly show high cosine similarity (for example, 0.8 — this is not a statistical error, but a powerful signal. It’s almost impossible to randomly converge like this in a 1000-dimensional world.

In such spaces, almost all the mass of data is concentrated in an extremely thin surface layer. The “insides” of objects are mathematically empty.

This can be easily verified with such an imaginary example. Take the “skin” of a multidimensional sphere with a thickness of just 1% of the radius. The volume of the sphere is proportional to the radius raised to the power of its dimensionality.

• In three-dimensional space, the pulp (0.99 of the radius) occupies 97% of the volume, you raise 0.99 to the third power.

• In 1000D, the pulp occupies just 0.000043%.

You can understand it differently. For a point to be closer to the origin, it requires that along all axes the coordinates need to be close to the origin. If one axis has a high value, that’s it, the point has gone. If you take points randomly, the mere probability that they all at once will be below any value decreases with the growth of dimensionality, and decreases quickly.

All the “meat” of the data always ends up in the skin. Any sample in High-D is essentially a set of boundary values.

For white noise in high dimensions, the distance between the closest and the farthest neighbor becomes almost the same. The concept of “closeness” simply degrades.

CPU vs GPU: A Speed Challenge in Embedding Creation | April 11 2026, 18:08

When working with certain tasks, the difference between a CPU and a GPU is simply astounding. For example, I need to create many (millions) of embeddings, model BGE M3. Running this on my quite powerful 24-core Intel Core Ultra 9 285K processor takes 45.85 seconds to create 500 embeddings, while using an NVIDIA 5090 GPU, the same task is completed in just 0.36 seconds. It is so fast that I specifically wrote this benchmark to figure out whether my GPU is being utilized at all. The program that sends requests to TEI does it in test mode not actively enough (roughly a couple of times per second), and the GPU load graphs are practically zero.

— Testing http://localhost:8080/embed — <– CPU version

Requests completed: 500

Total time: 45.85 sec

Throughput: 10.90 req/sec

Average latency (Avg Latency): 4386.11 ms

P95 latency: 5021.88 ms

— Testing http://localhost:8090/embed — <– GPU version (NVIDIA 5090)

Requests completed: 500

Total time: 0.36 sec

Throughput: 1398.69 req/sec

Average latency (Avg Latency): 31.38 ms

P95 latency: 53.18 ms

========================================

RESULT: http://localhost:8090/embed is 99.22% faster

Smartfolio.me: Revolutionizing Knowledge Organization with Advanced Features | March 19 2026, 04:01

My creation – the knowledge organization tool Smartfolio.me – has gained new features. I’m attaching a five-minute video overview.

It’s like Google Docs, but you can embed documents within each other, creating a network of connected knowledge, and these documents can be PDFs and regular texts.

Upload a PDF, the program converts it into images, and you can highlight any sections right on the pages to leave a comment or ask a question.

If something in the text is unclear, you highlight the area and press “elaborate” — the LLM will detail everything thoroughly, taking into account the context of the entire document, and the explanation will stay linked to the highlighted fragment.

You can simply cut out a piece from a PDF, and the LLM extracts clean text or a ready-made formula from it.

In the PDF window, there is now a small panel — all comments and explanations are immediately visible there, so you can quickly jump to the necessary parts.

You can cut out a diagram or graph from a PDF, copy it as a picture, and paste it into your text. It will automatically crop “on the fly” and save in the database, not as a copy but as a link to the page with crop parameters.

If you delete the page link in the text, it won’t disappear completely but will go into a special list, from where you can reattach it somewhere else or delete it finally. The same document can be inserted in several places. If you add a comment to it, it updates everywhere where this document is linked.

Mathematics is fully supported — LaTeX formulas can be not only viewed but also clicked to adjust them in the editor.

You can generate formulas by description. Just write in words what formula you need (for example, “binomial distribution”), and the system itself outputs the ready formula code.

Now there is a system of plugins – essentially isolated experimental functions separate from the main program. For instance, there is a plugin that recursively collects all subpages into one long document — convenient if you need to read or print everything at once.

Or consider the “YouTube Transcript Cleaning” plugin. If there is a dirty lecture text from YouTube, the plugin will punctuate, paragraph, and create neat headers.

If you insert a link to a website, it opens in a column next to it — you can read the source and simultaneously take your notes. However, some websites do not allow embedding on foreign pages. The system recognizes such sites, and they open in a new tab.

The left panel with the list of pages can be hidden or resized with the mouse, so it doesn’t take up space on the screen.

You can simply copy and paste an image or screenshot, and it will not just insert, but also upload to the database.

It supports working from a mobile phone. On the phone, the interface switches to a single-column mode for convenient reading and commenting on the go.

Multiple databases are supported – you can switch between them. You can connect different databases and different LLMs and switch between them.

Exploring Multilingual Vocabulary in Nabokov’s Works with Apple Books | March 15 2026, 23:20

Man, it’s really convenient. Just sitting here reading.

The usage pattern is as follows: I hold the phone in my hands. There, in apple books, this and that book. You see an unfamiliar word – it will likely be in the word list of the chapter. The definition takes into account the translation by Nabokov himself. Then you look a couple words ahead, put the phone down, continue reading. You encounter those words, and they are still in your short-term memory, and hooray, you understand. During a break, you load the next couple of words into your brain. You have to hold the phone and flip through, each page contains 4-5 definitions.

Now, every word has definitions in English (interpretation), French, and German. Consequently, I can publish four books.

Overall, my level of English matches what my app predicts about which words will be challenging. But someday I’ll need the same for French, and it will require an assessment of the difficulty level for each word because even some basic words will be unclear to me. I’m not sure that a book with basic words will be handy. With rare ones – definitely handy.