Such is life.
Month: November 2018
November 06 2018, 09:48
I wondered why in English redheads are called redhead, not orangehead, or something like that; after all, there is far less red than there is orange or even yellow. It turns out that the term originated when red encompassed shades from orange to red, and there simply wasn’t a word for orange — it emerged with the arrival of oranges. And consider the red fox, which is also “red,” surprisingly enough.
Why didn’t it occur to the English to name the orange color after the color of pumpkins or carrots as “Pumpkin” or “Carrot”? Why wait until oranges were introduced to England? It’s quite simple. In fact, both pumpkins and carrots come in colors other than orange. As we know, pumpkins can be yellow, green, and bright orange, while carrots were originally not orange but purple. As for oranges, they are exclusively orange, hence the word “orange” became the perfect name for the color.
However, in English, there is another word meaning “red-haired” — ginger — derived from the word “ginger”, which is another puzzle, since ginger root is golden-yellow, not orange. Logically, it might make more sense to call blondes ginger, not redheads.
November 06 2018, 00:40
Pianist Victoria Ermolaeva, excellent channel https://www.youtube.com/watch?v=T6WzfhGwkiU
November 04 2018, 01:12
Does your Mac also calculate things oddly?)


November 02 2018, 23:08
Are there any specialists in Machine Learning and AI?
How can the following task be solved: there are two sets of HTML files with almost identical content but different layouts – let’s call it the training set, and one set in design #1. The task is to obtain for this last set a collection of files in design #2 – let’s call it for processing, using knowledge derived from the training set, in the best possible way.
In the training set, each pair A and B has segments that can be called “template,” and segments that can be called “data.” For example, the article title – data, and the wrapper around it – template. After processing N files, the system should identify where the data is and where the template is, both in design #1 and design #2, as well as being able to recognize data in the design #1 files from for processing, and insert them into the corresponding template locations in design #2 for each of the files for processing.
How can this be done?
I save the thoughts below on Facebook for myself. If you’re interested, join in)
The first thing that comes to mind is to translate the files into a linked list of tags and text fragments for both design #1 and its counterpart in design #2. Then search for the longest fragments of the same length in all files of design #1 – these fragments will most likely be part of the template. Identical fragments that are adjacent – merge into one larger piece, while also preserving the constituent pieces. This results in a set of trees, with vertices consisting of tags and letters, and roots consisting of large fragments that are the same in both files. Do this for all remaining pairs of files, obtaining a multitude of similar trees. Next, the trees need to be processed to find the largest common fragments among the majority. Common fragments in design #1 will be suggested as the template, and differing ones as data elements.
This analysis is also conducted for design #2.
Fragments marked as data are automatically matched, as theoretically, there should be a complete correspondence. If in some cases there is no complete match, we rely on the majority.
As a result, for design #1 and design #2, we obtain two sequences consisting of “template fragment” and “data fragment” nodes. Name the sequences, assigning the same names to data fragments in the sequence for design #1 and the sequence for design #2. Simply number the templates.
Next, process design #1 from for processing, identifying its segments marked as the template in the learning set. If they are in the same order, the data is between them. And they are already marked, so we collect the data in the order set by the results of processing design #2. If some fragments are not found, then we mark these things for manual processing. If some data is not found – we just ignore it.
These decisions are later manually adjusted by an analyst.
However, this approach will not work if the learning set contains lists of variable lengths, such as a list of products, for example. The system will not call design #1 similar for 10 products and for 20. It will obtain sequences of “template fragment” and “data fragment” of different lengths, and statistically get one sequence after processing. In theory, a separate mechanism could find recurring patterns and somehow mark this.
Perhaps someone knows ready-made solutions or approaches to solving the task? Interesting topic, isn’t it?

November 02 2018, 21:38
I’m reading something interesting here, I stumbled upon a great illustration of the usefulness of data mining:
“…Walmart mined their massive retail transaction database to see what their customers really wanted to buy prior to the arrival of a hurricane. They discovered one particular item that increased in sales by a factor of 7 over normal shopping days. That one item was not bottled water, or batteries, or beer, or flashlights, or generators, or any of the usual things that we might imagine. The item was strawberry pop tarts! One could imagine lots of reasons why this was the most desired product prior to the arrival of a hurricane – pop tarts do not require refrigeration, they do not need to be cooked, they come in individually wrapped portions, they have a long shelf life, they are a snack food, they are a breakfast food, kids love them, and we love them. Despite these “obvious” reasons, it was still a huge surprise!”
November 02 2018, 15:49
It turns out that in the light-blue countries, a billion means a trillion (10^12) by our standards, not a billion (10^9). Keep this in mind when discussing salaries during interviews
Yellow ones, like China, stand apart with their own system

