November 02 2018, 23:08

Are there any specialists in Machine Learning and AI?

How can the following task be solved: there are two sets of HTML files with almost identical content but different layouts – let’s call it the training set, and one set in design #1. The task is to obtain for this last set a collection of files in design #2 – let’s call it for processing, using knowledge derived from the training set, in the best possible way.

In the training set, each pair A and B has segments that can be called “template,” and segments that can be called “data.” For example, the article title – data, and the wrapper around it – template. After processing N files, the system should identify where the data is and where the template is, both in design #1 and design #2, as well as being able to recognize data in the design #1 files from for processing, and insert them into the corresponding template locations in design #2 for each of the files for processing.

How can this be done?

I save the thoughts below on Facebook for myself. If you’re interested, join in)

The first thing that comes to mind is to translate the files into a linked list of tags and text fragments for both design #1 and its counterpart in design #2. Then search for the longest fragments of the same length in all files of design #1 – these fragments will most likely be part of the template. Identical fragments that are adjacent – merge into one larger piece, while also preserving the constituent pieces. This results in a set of trees, with vertices consisting of tags and letters, and roots consisting of large fragments that are the same in both files. Do this for all remaining pairs of files, obtaining a multitude of similar trees. Next, the trees need to be processed to find the largest common fragments among the majority. Common fragments in design #1 will be suggested as the template, and differing ones as data elements.

This analysis is also conducted for design #2.

Fragments marked as data are automatically matched, as theoretically, there should be a complete correspondence. If in some cases there is no complete match, we rely on the majority.

As a result, for design #1 and design #2, we obtain two sequences consisting of “template fragment” and “data fragment” nodes. Name the sequences, assigning the same names to data fragments in the sequence for design #1 and the sequence for design #2. Simply number the templates.

Next, process design #1 from for processing, identifying its segments marked as the template in the learning set. If they are in the same order, the data is between them. And they are already marked, so we collect the data in the order set by the results of processing design #2. If some fragments are not found, then we mark these things for manual processing. If some data is not found – we just ignore it.

These decisions are later manually adjusted by an analyst.

However, this approach will not work if the learning set contains lists of variable lengths, such as a list of products, for example. The system will not call design #1 similar for 10 products and for 20. It will obtain sequences of “template fragment” and “data fragment” of different lengths, and statistically get one sequence after processing. In theory, a separate mechanism could find recurring patterns and somehow mark this.

Perhaps someone knows ready-made solutions or approaches to solving the task? Interesting topic, isn’t it?

November 02 2018, 21:38

I’m reading something interesting here, I stumbled upon a great illustration of the usefulness of data mining:

“…Walmart mined their massive retail transaction database to see what their customers really wanted to buy prior to the arrival of a hurricane. They discovered one particular item that increased in sales by a factor of 7 over normal shopping days. That one item was not bottled water, or batteries, or beer, or flashlights, or generators, or any of the usual things that we might imagine. The item was strawberry pop tarts! One could imagine lots of reasons why this was the most desired product prior to the arrival of a hurricane – pop tarts do not require refrigeration, they do not need to be cooked, they come in individually wrapped portions, they have a long shelf life, they are a snack food, they are a breakfast food, kids love them, and we love them. Despite these “obvious” reasons, it was still a huge surprise!”

October 30 2018, 11:51

How can the vertebrate population reduce by half over half a century when the number of large domestic animals like cows has increased to one and a half billion heads? From an evolutionary perspective, they, like corn, wheat, and potatoes, are more than well-off.