Automating Multilingual Blog Management with AI | May 03 2024, 18:46

I refined the mechanism for cross-posting to Russian and English blogs. Firstly, I figured out how to group posts by topics, assign them tags, and categorize them. Plus, this is now done on the fly for new posts. I’ll write an article on hybrismart later, but the gist is that openAI vectors are first created for all posts, then they are divided into 50 groups via KMeans and sorted by their distance from the center. Subsequently, the first posts are selected (so as not to exceed the N Kb limit), and the script asks openai about the topic of this cluster of posts. Eventually, I end up with 50 topics, from which I choose, say, Art or Books, and then extract all posts close to the theme of art or books, again sorted by their distance from the theme. The accuracy isn’t very high, especially for posts with little text. Therefore, each post is fed into a local LLAMA3 8B on my laptop, and it decides whether it truly fits the theme or not. Overall, also with rare mistakes, but out of 2000 found by the script based on proximity, it left 600 on the theme of art, and generally quite well.

A separate script iterates over posts on beinginamerica, and there it corrects tags and categories for posts from the list provided by the script above.

I have already distributed posts on themes like art, books, science. Overall, everything is automated, and it’s easy to create another 10 new themes. I will be doing it gradually. For now, tags are only on beinginamerica; I will do it on raufaliev.com later.

Additionally, if a post has ENG in parentheses, it sends the piece after ENG to the English site, and the piece after ENG below in parentheses to the Russian site. This is convenient when I write a post in both languages simultaneously.

The title for my archive was generated through LLAMA3 8B, but openAI is still more powerful, albeit more expensive. For new posts, openAI GPT-4 is now used.

Neither LLAMA3 nor OpenAI GPT-4 excel at creating titles for texts that are too short and uninformative, often producing quite incoherent outputs. Feel free to read and smile.

#TechStories