I have perfected the cross-posting from Facebook to my two blog sites [which almost no one visits] – beinginamerica.com and raufaliev.com. When a new post is published on Facebook, a mechanism is triggered to translate the post into English, process attached images, generate descriptions for them, create a title based on the text of the post and descriptions of the images, generate tags from the same basis, record the post in turso db – this is a cloud database, free up to certain limits, create embeddings via openai, record in qdrant cloud – this is also a cloud database, but vector-based, and finally, upload images to wordpress via API, and publish the post in English and Russian via API.
All would be well, but of all the APIs, the silliest one is Facebook’s. Firstly, for pages like mine, transitioned to New Experience, it’s almost impossible to use most of this API. Well, it’s possible, but you have to spend a long time proving to Facebook that you really need it, by showing startup documents, demonstrating the application, etc. Obviously, they are reluctant to deal with something that takes content out of their system. In addition, the token that gives access to the latest messages is relatively short-lived (possibly a few weeks), and it needs to be obtained anew through a browser only. So, any automation requires regular attention, otherwise it breaks.
If you mess up and don’t offload the latest posts through this Facebook Graph API in time, they just disappear from the list of recent ones and that’s it, no more API access to them. The only way is to request an archive download from Facebook. This download is also rather silly – it requires a lot of transformations and removing unnecessary stuff. For example, in the file containing posts, which I process, for some reason there are links that I sent in comments without accompanying text. And the comments are in a separate file!
To assign tags, I had to solve a separate challenge. Here’s the thing: there are about 10,000 posts over all time. That’s a big chunk, and you can’t build tags from it because it doesn’t fit into the contextual window of the LLM. But you need to. So, I did this: a script takes random posts from the 10,000 in such a volume that their total size is just below the specified limit in tokens, and at the end of this block, it adds the prompt “generate the most common tags for me, 30 pieces” (I simplify the prompt used). In the end, I ran this 10 times and got 10 sets of tags with 30 pieces each, generated for different slices of the database. That made 300 tags, some of which are complete duplicates, while others are synonyms and closely related in meaning. All this is fed into the LLM, and we get a list of tags and a hierarchy of tags. Now we have a limited set of tags that reflect the 10,000 posts as closely as possible. Turns out, that in almost 20 years on Facebook, my breakdown is as follows:
Tag Posts
==================================================
#Russia 3412
#Thoughts 3146
#Tech 3105
#Culture 2765
#Hobbies 2726
#AI 1603
#Science 1367
#Software 1358
#Travel 1298
#Learning 1138
#Society 1050
#Nature 958
#Education 915
#Business 902
#Art 894
#Programming 889
#Humor 840
#History 807
#Gadgets 750
#Moscow 713
#USA 614
#Cinema 567
#Webdev 493
#Music 476
#Sports 473
#Mindset 443
#Auto 400
#Books 386
…
and so on. This list includes both tags from the limited list and tags that the LLM appointed to content simply because it didn’t find anything suitable in the limited one.
Tags from the limited list became categories on the site. The rest of the tags + these just became regular wordpress tags.
As for image search. I had two ideas on how to do it. The first – OpenCLIP. It’s pretty straightforward but requires hosting the model somewhere. Easy on my machine, but inconvenient to start it each time, plus I planned to move the migrator to a cheap server on Amazon. It’s also okay to calculate in cloud models, but you have to pay a bit, which is yet another dependency. But the main thing – it works quite well without it. I generate descriptions for images using OpenAI, which is used for translating into English anyway, and then create embeddings using a large model. So far, all search tests are a great success. Especially when there’s text on the image, and it’s a big question whether OpenCLIP would have interpreted it successfully.
In the end:
1) wordpress raufaliev.com – free
2) wordpress beinginamerica.com – free
3) turso db where all posts are stored – free
4) qdrant cloud where embeddings are stored – free
5) openai for translation and image descriptions – not free, but inexpensive (cost $30 for post processing over a year).
I attach two screenshots – how the search by images works, and by texts, as well as the migrator dashboard.



