Migrating SAP Commerce Content with Graph Databases: A Neo4j and Memgraph Guide | June 10 2026, 03:12

Published a new article on Hybrismart after a long hiatus. It’s about how to migrate data from an old site to a new one using a graph db (specifically, I used neo4j and memgraph). The case is as follows: there is an old site and a new site, and you need to transfer CMS data – components, pages, layout from the old to the new, while along the way making various transformations – for example, in the new site the styles are different, the layout is different, some of the components are different. For this task, I used a graph db.

It’s been a while since I wrote on my blog about SAP Commerce Cloud. I worked at SAP for two years, and thought it inappropriate to write about their products while formally having access to internal documents. Currently, I am working on two projects simultaneously – one about migrating SAP Commerce Cloud, and the other significantly about graph databases. At the junction of these worlds, the article was born.

https://hybrismart.com/2026/06/10/migrating-sap-commerce-content-with-a-graph-database/

Migrating SAP Commerce Content with a Graph Database

Script Evolution: Creating Multi-Dimensional Word Art | May 27 2026, 21:12

I created a script that generates inscriptions readable as three different words from the left, right, and top. Overall, this is a development of what I had in my previous post – there it was only left-right. One script generates triplets of words from a dictionary, which technically can be done. Another creates a 3D model that can be thrown onto a printer (might do that today), and the third does a visualization of this model – see video

Scripting Letter-Matched Phrase Translations | May 27 2026, 18:28

Made a script that creates stuff like this. You can translate different phrases into each other, as long as the number of letters matches. Now thinking about printing it on a 3D printer, it’s all ready

Exploring Algorithmic Image Processing for Large Format Printing | May 24 2026, 22:40

I’m playing with algorithmic image processing. Images only look interesting when printed in a large format – because all these fine lines merge when scaled to a phone screen. I’ll post a close-up in the comments.

It works like this: an image is given as input, and it is divided into squares of different sizes. Each square represents one number: how dark it is. The darker it is, the more lines are drawn inside. The lines are not straight – they are Bezier splines. They smoothly transition from one square to another because the points at the boundaries are shared. What results is not a grid, but a single continuous thread. Color – the image is split into CMYK channels (like in printing). Each channel is processed separately: its own grid, its own lines. Then the layers are superimposed on each other – and from three or four black-and-white plates, a colored picture emerges.

The image doesn’t look blocky because the splines smoothly transition from one square to another, but there is a problem: dividing the image into 10×10 squares essentially reduces the resolution tenfold. To correct this, several passes are made with different square sizes and shifted grids. The first pass uses large cells, the second is finer and shifted 10 pixels to the right, the third is even finer and shifted diagonally.

The entire process is controlled by a JSON config – separate parameters for each channel, specific settings for each pass within a channel. On output – SVG, which can be scaled to the size of a wall without loss of quality, and PNG, in which CMYK layers are superimposed with transparency.

Mastering Cross-Posting: From Facebook Frustrations to Dual Blogging Excellence | May 23 2026, 14:28

I have perfected the cross-posting from Facebook to my two blog sites [which almost no one visits] – beinginamerica.com and raufaliev.com. When a new post is published on Facebook, a mechanism is triggered to translate the post into English, process attached images, generate descriptions for them, create a title based on the text of the post and descriptions of the images, generate tags from the same basis, record the post in turso db – this is a cloud database, free up to certain limits, create embeddings via openai, record in qdrant cloud – this is also a cloud database, but vector-based, and finally, upload images to wordpress via API, and publish the post in English and Russian via API.

All would be well, but of all the APIs, the silliest one is Facebook’s. Firstly, for pages like mine, transitioned to New Experience, it’s almost impossible to use most of this API. Well, it’s possible, but you have to spend a long time proving to Facebook that you really need it, by showing startup documents, demonstrating the application, etc. Obviously, they are reluctant to deal with something that takes content out of their system. In addition, the token that gives access to the latest messages is relatively short-lived (possibly a few weeks), and it needs to be obtained anew through a browser only. So, any automation requires regular attention, otherwise it breaks.

If you mess up and don’t offload the latest posts through this Facebook Graph API in time, they just disappear from the list of recent ones and that’s it, no more API access to them. The only way is to request an archive download from Facebook. This download is also rather silly – it requires a lot of transformations and removing unnecessary stuff. For example, in the file containing posts, which I process, for some reason there are links that I sent in comments without accompanying text. And the comments are in a separate file!

To assign tags, I had to solve a separate challenge. Here’s the thing: there are about 10,000 posts over all time. That’s a big chunk, and you can’t build tags from it because it doesn’t fit into the contextual window of the LLM. But you need to. So, I did this: a script takes random posts from the 10,000 in such a volume that their total size is just below the specified limit in tokens, and at the end of this block, it adds the prompt “generate the most common tags for me, 30 pieces” (I simplify the prompt used). In the end, I ran this 10 times and got 10 sets of tags with 30 pieces each, generated for different slices of the database. That made 300 tags, some of which are complete duplicates, while others are synonyms and closely related in meaning. All this is fed into the LLM, and we get a list of tags and a hierarchy of tags. Now we have a limited set of tags that reflect the 10,000 posts as closely as possible. Turns out, that in almost 20 years on Facebook, my breakdown is as follows:

Tag Posts

==================================================

#Russia 3412

#Thoughts 3146

#Tech 3105

#Culture 2765

#Hobbies 2726

#AI 1603

#Science 1367

#Software 1358

#Travel 1298

#Learning 1138

#Society 1050

#Nature 958

#Education 915

#Business 902

#Art 894

#Programming 889

#Humor 840

#History 807

#Gadgets 750

#Moscow 713

#USA 614

#Cinema 567

#Webdev 493

#Music 476

#Sports 473

#Mindset 443

#Auto 400

#Books 386

and so on. This list includes both tags from the limited list and tags that the LLM appointed to content simply because it didn’t find anything suitable in the limited one.

Tags from the limited list became categories on the site. The rest of the tags + these just became regular wordpress tags.

As for image search. I had two ideas on how to do it. The first – OpenCLIP. It’s pretty straightforward but requires hosting the model somewhere. Easy on my machine, but inconvenient to start it each time, plus I planned to move the migrator to a cheap server on Amazon. It’s also okay to calculate in cloud models, but you have to pay a bit, which is yet another dependency. But the main thing – it works quite well without it. I generate descriptions for images using OpenAI, which is used for translating into English anyway, and then create embeddings using a large model. So far, all search tests are a great success. Especially when there’s text on the image, and it’s a big question whether OpenCLIP would have interpreted it successfully.

In the end:

1) wordpress raufaliev.com – free

2) wordpress beinginamerica.com – free

3) turso db where all posts are stored – free

4) qdrant cloud where embeddings are stored – free

5) openai for translation and image descriptions – not free, but inexpensive (cost $30 for post processing over a year).

I attach two screenshots – how the search by images works, and by texts, as well as the migrator dashboard.

Exploring Automated Documentation of Large Excel Datasets | May 06 2026, 22:28

I wonder if there exists an agent that takes an Excel table significantly larger than the context window and begins to document its essence. Here are several tabs. Here on tab 5, there is a table with a million rows and five columns. The columns are as follows. We take random data from the table, looks like there are numbers, and there – surnames. We assume that there are numbers everywhere – we write a code that checks this assumption and at the same time calculates min/max and a set of unique values. So, few values, only five. We record it. Now we check the surnames. Yes, these are just strings, new sampling showed that they are indeed surnames. Here’s a formula. We see where it points. And so on. And this column – unclear purpose. We look at the data – these are some numbers from 0 to 1. We measure the average and the spread. We ask the user – maybe they’ll provide some comments. They did. It turned out to be a KPI issued to this user from an external system. We record it. And so on. Documentation emerges. Later, when there is documentation, one can request to perform some operations with all this, since the LLM now more or less understands the purpose of the data and their connection, and can build some hypotheses on detecting outliers and verifying them.

The Crucial Role of Data Quality Oversight in Development Projects | May 06 2026, 16:07

Almost every development project features a dedicated functional testing automation team, yet surprisingly, a similar emphasis on Data Quality is rarely found. Regardless of whether data comes from external integrations, users, or is generated by the system itself, it often remains without proper control simply because no one seems to consider it important, and later they struggle with the consequences – they accumulate like a snowball. The longer such issues persist, the harder they are to resolve, eventually leading to a situation where people just resign themselves to the “irreparable” state of the database. It is much better to identify these problems at the moment they arise, while the technical debt has not yet become insurmountable, rather than later figuring out how to prevent them from causing everything to crash;

In essence, there needs to be a constant “supervisor” over all types of databases used by the system (relational, NoSQL, search indexes, or graph databases) — essentially, this is a layer of data quality checking over processes. Of course, there must be clear rules – specifically what to check and which flags to use to mark specific anomalies.

There must be a responsible party for the process (a human, not AI), who will integrate these reports into the development and support workflows. Many data integrity issues cannot just be resolved through an interface — they require the engineering team to develop scripts for mass correction and data cleansing.

Incidentally, this also transitions into the realm of anomaly detection (outlier detection). Machine learning and LLMs for identifying subtle “bad” patterns that traditional rule-based systems might miss.

What do you think about this? Are similar mechanisms implemented in your processes?

Harnessing Chat Data for Semantic Q&A Search | April 30 2026, 04:05

In one evening, I created a simple utility that extracts the Natural Language Processing chat for a year and a half – there are 65,000 messages, and converts it into question-answer pairs with semantic search available. Clicking on a search result (on the left) opens the dialogue in the chat. The messages that are responses to the question are highlighted. And at the top, the original phrasing of the question is highlighted as well.

How it works: the system assumes that people mainly reply to messages that are relatively close in the past. If several replies are made to one message, then it is likely useful and caught the interest of others in the chat. The system takes messages starting from the one many have replied to, ending with the last in the reply-to chain – and among such messages, it selects those that have at least 3 reply-tos to the original question. In essence, it cuts a piece from the chat starting with a popular question so that after the bottom cut, most likely, irrelevant content follows. Such blocks can overlap each other – for example, if someone asked a question while others were replying to something else.

So, if user A asked what the weather was like, and they received answers like “good,” “bad,” “rain,” and there were five messages without a reply-to, and then someone replied to “rain” with the question “why rain”, and five more people replied to this question, then the first question about the weather makes it into the system – the piece ends with 13 messages.

Afterwards, these pieces are summarized into question-answer pairs.

It turns out quite cool.

P.S. In the screenshot, the search query has nothing to do with the search result because I foolishly took the screenshot after I changed the query but before I hit send.

Misadventures in Keyboard Layouts: Searching for Gremlin, Finding Surprises | April 28 2026, 20:33

This is me typing the word gremlin, without switching the keyboard layout. Wanted to read about the query language for graph databases, need it for work. Google surprises, it does surprise

CPU vs GPU: A Speed Challenge in Embedding Creation | April 11 2026, 18:08

When working with certain tasks, the difference between a CPU and a GPU is simply astounding. For example, I need to create many (millions) of embeddings, model BGE M3. Running this on my quite powerful 24-core Intel Core Ultra 9 285K processor takes 45.85 seconds to create 500 embeddings, while using an NVIDIA 5090 GPU, the same task is completed in just 0.36 seconds. It is so fast that I specifically wrote this benchmark to figure out whether my GPU is being utilized at all. The program that sends requests to TEI does it in test mode not actively enough (roughly a couple of times per second), and the GPU load graphs are practically zero.

— Testing http://localhost:8080/embed — <– CPU version

Requests completed: 500

Total time: 45.85 sec

Throughput: 10.90 req/sec

Average latency (Avg Latency): 4386.11 ms

P95 latency: 5021.88 ms

— Testing http://localhost:8090/embed — <– GPU version (NVIDIA 5090)

Requests completed: 500

Total time: 0.36 sec

Throughput: 1398.69 req/sec

Average latency (Avg Latency): 31.38 ms

P95 latency: 53.18 ms

========================================

RESULT: http://localhost:8090/embed is 99.22% faster