Today, I finished searching and verifying a solution for an interesting task needed for a project we launched last week. The task looks like this: you have 10 XML files, each being a dump from an old CMS system, one for each language version of the site. The old system allows you to construct pages for each version arbitrarily, which makes the XMLs similar to each other but not exact duplicates in structure. Meanwhile, some text about the company might be present in all languages somewhere deep in each XML. In the first XML, it might be closer to the beginning, in the second – a bit further, in the third – it might not be there at all. But the relative arrangement of the blocks within each file is constant. If a text about the company is between two other segments, it is likely the same way in another file (if those two other segments are translated there at all). The number one task – to link different fragments about the same thing but in different languages. To use one or two known languages to determine translations into all other languages (or as many as will be available).
The second task is to take 4 languages in which the site has already been launched, and find matches for content in the four XMLs in these languages, and the attributes of components in Hybris, where this content was uploaded months ago and has since been actively edited by the client. After these matches are found, it’s possible to load the remaining six languages into the existing components, since after the first exercise we have the links. However, there are almost no exact matches between what’s in Hybris and what’s in the XMLs, but there are approximate ones. From the example above – the text about the company was split into three parts, and two of them were edited, but overall, it is the same as in the old CMS. Thus, the task is to link the components from the system containing texts in the current edition with texts from XMLs many months old. As accurately as possible. The rest can indeed be manually fine-tuned.
I successfully solved both tasks today. I really love such challenges.)
