Are there any specialists in Machine Learning and AI?
How can the following task be solved: there are two sets of HTML files with almost identical content but different layouts – let’s call it the training set, and one set in design #1. The task is to obtain for this last set a collection of files in design #2 – let’s call it for processing, using knowledge derived from the training set, in the best possible way.
In the training set, each pair A and B has segments that can be called “template,” and segments that can be called “data.” For example, the article title – data, and the wrapper around it – template. After processing N files, the system should identify where the data is and where the template is, both in design #1 and design #2, as well as being able to recognize data in the design #1 files from for processing, and insert them into the corresponding template locations in design #2 for each of the files for processing.
How can this be done?
I save the thoughts below on Facebook for myself. If you’re interested, join in)
The first thing that comes to mind is to translate the files into a linked list of tags and text fragments for both design #1 and its counterpart in design #2. Then search for the longest fragments of the same length in all files of design #1 – these fragments will most likely be part of the template. Identical fragments that are adjacent – merge into one larger piece, while also preserving the constituent pieces. This results in a set of trees, with vertices consisting of tags and letters, and roots consisting of large fragments that are the same in both files. Do this for all remaining pairs of files, obtaining a multitude of similar trees. Next, the trees need to be processed to find the largest common fragments among the majority. Common fragments in design #1 will be suggested as the template, and differing ones as data elements.
This analysis is also conducted for design #2.
Fragments marked as data are automatically matched, as theoretically, there should be a complete correspondence. If in some cases there is no complete match, we rely on the majority.
As a result, for design #1 and design #2, we obtain two sequences consisting of “template fragment” and “data fragment” nodes. Name the sequences, assigning the same names to data fragments in the sequence for design #1 and the sequence for design #2. Simply number the templates.
Next, process design #1 from for processing, identifying its segments marked as the template in the learning set. If they are in the same order, the data is between them. And they are already marked, so we collect the data in the order set by the results of processing design #2. If some fragments are not found, then we mark these things for manual processing. If some data is not found – we just ignore it.
These decisions are later manually adjusted by an analyst.
However, this approach will not work if the learning set contains lists of variable lengths, such as a list of products, for example. The system will not call design #1 similar for 10 products and for 20. It will obtain sequences of “template fragment” and “data fragment” of different lengths, and statistically get one sequence after processing. In theory, a separate mechanism could find recurring patterns and somehow mark this.
Perhaps someone knows ready-made solutions or approaches to solving the task? Interesting topic, isn’t it?

