DataScience – Hi, I'm Rauf Aliev.

Exploring Automated Documentation of Large Excel Datasets | May 06 2026, 22:28

I wonder if there exists an agent that takes an Excel table significantly larger than the context window and begins to document its essence. Here are several tabs. Here on tab 5, there is a table with a million rows and five columns. The columns are as follows. We take random data from the table, looks like there are numbers, and there – surnames. We assume that there are numbers everywhere – we write a code that checks this assumption and at the same time calculates min/max and a set of unique values. So, few values, only five. We record it. Now we check the surnames. Yes, these are just strings, new sampling showed that they are indeed surnames. Here’s a formula. We see where it points. And so on. And this column – unclear purpose. We look at the data – these are some numbers from 0 to 1. We measure the average and the spread. We ask the user – maybe they’ll provide some comments. They did. It turned out to be a KPI issued to this user from an external system. We record it. And so on. Documentation emerges. Later, when there is documentation, one can request to perform some operations with all this, since the LLM now more or less understands the purpose of the data and their connection, and can build some hypotheses on detecting outliers and verifying them.

The Crucial Role of Data Quality Oversight in Development Projects | May 06 2026, 16:07

Almost every development project features a dedicated functional testing automation team, yet surprisingly, a similar emphasis on Data Quality is rarely found. Regardless of whether data comes from external integrations, users, or is generated by the system itself, it often remains without proper control simply because no one seems to consider it important, and later they struggle with the consequences – they accumulate like a snowball. The longer such issues persist, the harder they are to resolve, eventually leading to a situation where people just resign themselves to the “irreparable” state of the database. It is much better to identify these problems at the moment they arise, while the technical debt has not yet become insurmountable, rather than later figuring out how to prevent them from causing everything to crash;

In essence, there needs to be a constant “supervisor” over all types of databases used by the system (relational, NoSQL, search indexes, or graph databases) — essentially, this is a layer of data quality checking over processes. Of course, there must be clear rules – specifically what to check and which flags to use to mark specific anomalies.

There must be a responsible party for the process (a human, not AI), who will integrate these reports into the development and support workflows. Many data integrity issues cannot just be resolved through an interface — they require the engineering team to develop scripts for mass correction and data cleansing.

Incidentally, this also transitions into the realm of anomaly detection (outlier detection). Machine learning and LLMs for identifying subtle “bad” patterns that traditional rule-based systems might miss.

What do you think about this? Are similar mechanisms implemented in your processes?

Harnessing Chat Data for Semantic Q&A Search | April 30 2026, 04:05

In one evening, I created a simple utility that extracts the Natural Language Processing chat for a year and a half – there are 65,000 messages, and converts it into question-answer pairs with semantic search available. Clicking on a search result (on the left) opens the dialogue in the chat. The messages that are responses to the question are highlighted. And at the top, the original phrasing of the question is highlighted as well.

How it works: the system assumes that people mainly reply to messages that are relatively close in the past. If several replies are made to one message, then it is likely useful and caught the interest of others in the chat. The system takes messages starting from the one many have replied to, ending with the last in the reply-to chain – and among such messages, it selects those that have at least 3 reply-tos to the original question. In essence, it cuts a piece from the chat starting with a popular question so that after the bottom cut, most likely, irrelevant content follows. Such blocks can overlap each other – for example, if someone asked a question while others were replying to something else.

So, if user A asked what the weather was like, and they received answers like “good,” “bad,” “rain,” and there were five messages without a reply-to, and then someone replied to “rain” with the question “why rain”, and five more people replied to this question, then the first question about the weather makes it into the system – the piece ends with 13 messages.

Afterwards, these pieces are summarized into question-answer pairs.

It turns out quite cool.

P.S. In the screenshot, the search query has nothing to do with the search result because I foolishly took the screenshot after I changed the query but before I hit send.

Navigating the Lexical Complexity of Nabokov’s “Lolita” | April 02 2026, 15:56

I’ve finished the first version of a dictionary-style book on Nabokov’s “Lolita”. The chart shows how the complexity of vocabulary is distributed across the pages of the book. The lower chart averages 25 sentences, displaying the number of complex words on the vertical axis, with colors indicating their complexity/rarity (purple – the most complex, red – less complex, yellow – even less so). But I have already removed two levels, and overall, for a foreigner, all five levels are challenging. In the book, level 3 is marked with a dashed line, level 4 with a simple frame, and level 5 with a double frame. Currently, there are 5794 words, of which 541 are fifth level, 1070 are fourth, 1883 are third, 1393 are second, and 54 are first (the simplest ones). Considering that the first version ended up being 1148 pages, the dictionary will need to be significantly streamlined by removing what can be dispensed with. This mainly pertains to the first and second levels, and some from the third and fourth. The rarity of words is calculated in three ways: through LLM, and through two lists of word frequencies in the English language corpus (300K words).

Not all words are complex. For instance, in the sentence “With the ebb of lust, an ashen sense of awfulness, abetted by the realistic drabness of a gray neuralgic day, crept over me and hummed within my temples.” someone well-acquainted with English might not know the words ebb, abet, drabness, while everything else is familiar, but lower the requirements for the reader, and the dictionary might not be very useful for such cases.

Or consider the sentence:

Homo pollex of science, with all its many sub-species and forms; the modest soldier, spic and span, quietly waiting, quietly conscious of khaki’s viatric appeal; the schoolboy wishing to go two blocks; the killer wishing to go two thousand miles; the mysterious, nervous, elderly gent, with brand-new suitcase and clipped mustache; a trio of optimistic Mexicans; the college student displaying the grime of vacational outdoor work as proudly as the name of the famous college arching across the front of his sweatshirt; the desperate lady whose battery has just died on her; the clean-cut, glossy-haired, shifty-eyed, white-faced young beasts in loud shirts and coats, vigorously, almost priapically thrusting out tense thumbs to tempt lone women or sadsack salesmen with fancy cravings.

My browser even highlights four words here.

I have definitions of words in English, German, French, and Russian. I’ve encountered the issue that different words from the text are considered complex in different languages, yet they are unified for me. So, I’ll have to mark, for example, French words in the English text separately, so they are not included in the French version, since there, the reader knows, for instance, what quel mot means.

Overall, this weekend I’ll be manually removing about half, and then I can make the cover and list it on Amazon.

When the Night Lit Up: Unraveling the Mystery of a Superbolt Storm | March 21 2026, 12:55

We had a thunderstorm last night. The whole county is buzzing because everyone thinks that something exploded just before midnight. Several posts in a row on social media. In short, it was thunder. But a bit more rare than usual. Caused by a 401 kA lightning, dubbed the Wild House Shaker. A typical lightning strike is 30 kA. If the numbers are to be believed, 401 kA is really damn a lot. They will likely say we haven’t had such lightning here for decades.

Attaching an interesting map.

The points on the map show superbolts — lightning strikes with an energy of no less than 1M J. Red points — particularly powerful superbolts with an energy of more than 2M J. That is, superbolts mostly occur in the northeastern part of the Atlantic and in the Mediterranean Sea, and less frequently — in the Andes, off the coast of Japan, and near South Africa.

this is what the page from which I took the map says (translation):

“New work shows that superbolts most often occur over the Mediterranean Sea, the northeastern Atlantic, and over the Andes, as well as in smaller amounts to the east of Japan, in tropical oceans, and near the southern tip of Africa. Unlike regular lightning, superbolts often strike over water.

“Ninety percent of lightning occurs over land,” said Holzworth (that’s the main guy on lightning at the University of Washington).

“But superbolts mostly arise over water, right up to the coastline. For example, in the northeastern Atlantic, the distribution maps of superbolts clearly show the outlines of the coasts of Spain and England.”

“The average energy of a discharge over water is higher than over land—that we knew,” he said. “But we did not expect such a stark difference.”

The season for superbolts also does not match the usual patterns of lightning. Regular lightning most often occurs in the summer—the three main so-called “lightning chimneys” coincide with summer thunderstorms over America, Africa south of the Sahara, and Southeast Asia. However, superbolts, which are more common in the Northern Hemisphere, occur in both hemispheres from November to February.

The reason for such a distribution remains a mystery. In some years, there are significantly more superbolts than in others: the end of 2013 was record-breaking, and the end of 2014 was the second largest, while in other years such events were much less frequent.

“We speculate that this may be related to sunspots or cosmic rays, but we will leave that for future research,” said Holzworth.

“For now, we are just demonstrating that there is a previously unknown pattern.”

Smartfolio.me: Revolutionizing Knowledge Organization with Advanced Features | March 19 2026, 04:01

My creation – the knowledge organization tool Smartfolio.me – has gained new features. I’m attaching a five-minute video overview.

It’s like Google Docs, but you can embed documents within each other, creating a network of connected knowledge, and these documents can be PDFs and regular texts.

Upload a PDF, the program converts it into images, and you can highlight any sections right on the pages to leave a comment or ask a question.

If something in the text is unclear, you highlight the area and press “elaborate” — the LLM will detail everything thoroughly, taking into account the context of the entire document, and the explanation will stay linked to the highlighted fragment.

You can simply cut out a piece from a PDF, and the LLM extracts clean text or a ready-made formula from it.

In the PDF window, there is now a small panel — all comments and explanations are immediately visible there, so you can quickly jump to the necessary parts.

You can cut out a diagram or graph from a PDF, copy it as a picture, and paste it into your text. It will automatically crop “on the fly” and save in the database, not as a copy but as a link to the page with crop parameters.

If you delete the page link in the text, it won’t disappear completely but will go into a special list, from where you can reattach it somewhere else or delete it finally. The same document can be inserted in several places. If you add a comment to it, it updates everywhere where this document is linked.

Mathematics is fully supported — LaTeX formulas can be not only viewed but also clicked to adjust them in the editor.

You can generate formulas by description. Just write in words what formula you need (for example, “binomial distribution”), and the system itself outputs the ready formula code.

Now there is a system of plugins – essentially isolated experimental functions separate from the main program. For instance, there is a plugin that recursively collects all subpages into one long document — convenient if you need to read or print everything at once.

Or consider the “YouTube Transcript Cleaning” plugin. If there is a dirty lecture text from YouTube, the plugin will punctuate, paragraph, and create neat headers.

If you insert a link to a website, it opens in a column next to it — you can read the source and simultaneously take your notes. However, some websites do not allow embedding on foreign pages. The system recognizes such sites, and they open in a new tab.

The left panel with the list of pages can be hidden or resized with the mouse, so it doesn’t take up space on the screen.

You can simply copy and paste an image or screenshot, and it will not just insert, but also upload to the database.

It supports working from a mobile phone. On the phone, the interface switches to a single-column mode for convenient reading and commenting on the go.

Multiple databases are supported – you can switch between them. You can connect different databases and different LLMs and switch between them.

Crafting Nabokov’s Dictionary: A Multilingual Lexical Journey | March 15 2026, 18:30

I’m reading Nabokov and decided to take a break to create a convenient app “Nabokov’s Dictionary” and am considering selling it on Amazon as a book. Essentially, it looks like this (see screenshot) – definitions of complex words in English, Russian, German, and French, in the same order they appear in the original book.

Would you buy such a book?

To accurately make their definitions, I also wrote an aligner – a program that matches sentences and paragraphs in English with their translations (Nabokovian) into Russian. And when a word’s definition is created, it uses not only the knowledge of LLM but also the Russian translation by the author. It’s worth separately discussing how the algorithm works (I invented it myself because everything I found online did not work as I needed). It first finds long sentences and matches the longest sentences with their pair through cosine similarity of embedding vectors created through the multilingual e5 model. These sentences become anchors. Then, assuming that for long sentences the error is almost excluded, the longest sentence between anchors is found, and everything repeats recursively. There are many situations where a sentence in Russian has no equivalent in English and vice versa, where a sentence is split into two, or conversely two are merged into one. The algorithm handles this as best as it can. The result is quite a good quality of alignment. To such an extent, that errors in alignment can hardly be found (but they are likely still there). Either way, it is only needed for the context for translating words, even if there are rare errors, it’s not a big deal.

Would you buy such a book?

Seeking Alpha Testers for a Revolutionary Text and PDF Management Tool | March 03 2026, 03:02

Looking for alpha-testers. As part of R&D and for my own tasks, I wrote a productivity tool (I actually wrote about this in my last post, but Facebook said that because I put a link in the post, only 12% saw it). Now I want to check if it will be useful to anyone else. If the idea resonates with you — let me know, and I will share access.

Website smartfolio dot me. What’s the main idea?

It’s an online notebook for working with text and PDFs, organized as a graph. It looks like Google Docs, but there’s an important difference: you can attach “child” documents to specific parts of the main text to expand on details or clarify concepts. These “comments” themselves are full documents and can have their own nested branches.

If there’s a fragment in the text that is unclear, you can ask the system to explain it (this will require your Google Gemini API key).

The system uses the full context of the document to generate a response.

Explanations are permanently attached to a specific place in the text.

This is super convenient when reading complex scientific articles. For instance, you can highlight the authors’ surnames in a PDF and instantly get a background on them — the information will be attached right to that fragment on the page.

Typical workflow

Upload a complex text and read it right in the app from either a mobile or a computer. As you go, add manual or AI-generated notes to important or unclear sections for future reference.

I do not store your documents, PDFs, images, or API keys on my servers. All data is stored in Turso DB (SaaS, free up to 5 GB).

Screenshots on the website’s main page best describe the project.

How to try?

To register in the app, you need an invite code. Just write me in the comments or in a private message, and I will send it.

Website smartfolio-dot-me

Revolutionizing Research: Introducing a Web-Based Notebook Integrated with AI and PDF Support | February 19 2026, 16:19

I’ve further developed a new tool for myself for working with information and organizing it. The main idea is a web-based notebook for research, studying subjects, working on them, integrated with AI and PDF support.

The main problem with typical PDF readers and notes is that the context is lost as soon as you switch to a new tab. In my tool, each text fragment or PDF becomes a node in a “live” hypertext tree, which I can access from multiple computers at any time.

Work process:

– Contextual AI. I can ask the AI to clarify complex passages right within the document. The explanation stays right where the question was asked. Moreover, it is a separate document, linked to the specific spot in the source. When clicked, you see both the original and the explanation on the screen at the same time.

– Panels instead of windows. If the explanation itself requires clarification, a new panel opens to the right. This allows for an endless chain of queries, never losing the place in the original text. That is, you see several panels at once, and unnecessary ones can be closed.

– PDF support. I can upload a PDF, select an area on the page (e.g., a complex diagram or a list of authors), and the LLM instantly extracts data, supplements, or explains them. The explanation is attached to the spot where it was requested, just like with non-PDFs.

– Nested annotations. My comments are not just static text. They can contain their own PDFs, links, and further sub-tasks for AI, maintaining a depth of nesting that reflects how we actually think.

This is not just a file storage system, but an “engine” for building knowledge.

The tool suits me personally very well, but perhaps it only solves my specific tasks. What do you think, would something like this be useful to others? Would it be useful to you? Should I develop the project into a fully-fledged product and give it to other users for testing?

My Ambitious 2026 Plan: From Galapagos Travel to Academic Achievements and Creative Pursuits | January 20 2026, 04:44

My plan for 2026:

– Travel to the Galápagos Islands, Ecuador for a week (summer)

– Finish and release a book on Information Retrieval (also summer, progressing slowly, first couple of chapters are already written. Already spent about 50-100 hours on this, the easy part)

– Release at least one scientific paper, probably on Data Mining (spring). Ideally, submit it somewhere to a journal (challenging). Already spent about 30 hours on this topic, a lot left to do.

– Make a step towards a PhD. Find professors, visit universities, understand the cost and assess my capabilities and resources.

– Continue studying fundamental mathematics and not die (linear algebra, calculus, probability theory, statistics, classical ML). In 2025, I spent about 200-400 hours on this topic.

– Continue studying Deep Learning and reach the “can teach” level. In 2025, I spent about 100-200 hours on this topic.

– Continue studying Data Mining/NLP.

– Update my book on RecSys, releasing version 2.0 with updates and corrections (autumn 2026)

– Make noticeable progress in painting and playing the piano. Specifically, learn Schubert’s serenade (Ständchen, D 889) completely and create at least one canvas that I wouldn’t be ashamed to give as a gift.