A cryptic post today. While writing a book on RecSys, I caught myself thinking that modern data science is essentially the alchemy of the 21st century. Half of the “best practices” in algorithms lack a solid mathematical framework. It’s a set of heuristics that “just work”. Much like in the 17th century where they mixed everything indiscriminately, it happens now, and if something works better, everyone else starts doing the same. There’s just no answer to the question “why”.
Take, for example, the NCF/NeuMF (Neural Collaborative Filtering) algorithm. The logic goes like this. Say, there are a million movie ratings by users. And 100 million ratings by users yet given – users can’t watch every movie in the world. But out of these 100 million, you need to choose candidates for advertising for a particular user. The algorithm, of course, has a training phase, where weights are calculated, and a prediction stage, where these weights are used on the incoming data.
(What the algorithm does. Essentially, it’s an ensemble of three sub-algorithms, two of which generate their own conclusions, and then their decisions go to a new neural network, the third algorithm, which provides the final recommendation. Smartly, it’s a hybrid of GMF (matrix factorization) and MLP (Multi-Layer Perceptron). The first of these two is based on matrix decomposition, and the second represents a neural network with multiple layers. Weights are adjusted on training data.)
For one positive example, it takes 4 negative ones. Why four? Just because it’s “not too many and not too few”. Would 8 be better? Unknown, but it would definitely take longer to learn.
Why are embedding dimensions 32? or 64? There’s no formula. It’s the “golden mean” between a “dumb” model (few k) and an “overtrained” (many k).
Now about the neural network. Why is the MLP block built as a “tower” (64 -> 32 -> 16)? Why not (50 -> 25 -> 10)? Why ReLU between them (and not tanh for example)? Pure empiricism. The number of layers in the tower is also adjusted.
Why do GMF and MLP parts have different embeddings at the input? Because the authors of the paper tried it, and it “worked out better”. No mathematical proof. Why do they go to the final layer with equal weights? Because they just do.
Why are the outputs of the two paths “concatenated” (concat), and not added or multiplied? “Experience showed that this way the result is more accurate.”
And so it is with everything, up to the choice of optimizer Adam or the “magical” learning_rate=0.001, although at least these have some mathematical basis.
That is, at least a dozen parameters of one algorithm are empirically chosen, with no clear confidence that they are independent of each other. But many of them depend on the dataset, but no one knows how 😉
In general, alchemy.

