October 12 2017, 13:03

(TIL) In my spare time from work, I watch Lavrenko’s lectures. This morning, I listened to the lecture “Laws of the Text”.

For example, did you know that there’s something called Zipf’s Law, stating that the frequency of the n-th word in the list of most common words of any language is roughly inversely proportional to its rank n?

Or here’s the empirical Benford’s Law: in number tables based on data from real-life sources (anything from electricity bills to house numbers in cities) the digit 1 appears at the beginning much more often than all the others (approximately in 30% of cases), the digit 2 appears more often than, for instance, 8 and so forth. Simply put, Benford’s Law can be described thus: there are always more small things in the world than large ones. The explanation for Benford’s Law lies in the fact that quantities in this world tend to grow exponentially, not linearly. Very intriguing.

Or take Heaps’ Law. The number of unique words in any text with N words follows the pattern f(N) = k*N^b, where b is most often equal to 1/2.

These laws allow, for example, to check data or a text for “naturalness”.

Or another example. For any very rare word, the probability that it will occur in a text is very low, which makes sense. But if this word does appear in the text, the likelihood of it appearing again is very high.

October 11 2017, 18:07

A wonderful text about the peculiarities of e-commerce in the Russian hinterlands. It’s over a year old, I’m probably the only one who hasn’t read it yet. Many letters, but interesting.

October 11 2017, 13:14

I look at these fascinations with chatbots, Yandex’s Alice, etc., and recall how I toyed with something similar back in 2003. We had a chat, Starchat.ru, where people constantly hung out and interacted with each other.

I developed the chat, so for fun, I made a bot that you could chat with simply by sending it a private message. It was always online, and not everyone realized that it was a bot. When the robot received a message, it searched the chat logs for messages containing the maximum number of words from the query that also had a response. A response is defined as the next message directed to the user by someone (like “Vasya: go to hell!” being a response to Vasya’s message). When there were multiple options (and there always were), a random one was chosen.

The result was a robot that amusingly responded to questions. If you asked its name, it would always respond with different names, but relevantly, complete with emojis and suffixes. The bot also always provided suitable answers to standard questions like “where do you live” or “how old are you”. Since there was a huge history and people discussed everything in general, it was hard to find a question that the system couldn’t give an interesting/correct/funny answer to.

So, the bot had an interesting side effect. If you started swearing at it offensively, it would swear back even more offensively. And in general, it often reacted inadequately to attacks and reproaches. Simply because in real dialogs, a polite question is answered politely, and a rude one, of course, rudely. The audience had a lot of fun with this bot.

It was especially interesting to read the bot’s own logs later. People there didn’t understand that it was a robot. They asked it questions, argued with it, and made up with it. It was fun.)

Dima Gordy Plugotarenko Sergey Max Nizamov Dmitry Mottl Dmitry Nilov

October 10 2017, 21:19

Somehow it’s not yet in the regular news: the Brazil national team will come to live in “our dear Lobnya.” Locals are proud, Brazilians don’t yet understand. There, they are indeed renovating the stadium, but it was in quite a bad state before.. But funny, yes, the Brazilian national team in Lobnya on a Moskvich 🙂

(

)

October 10 2017, 07:02

In 1975, instead of installing expensive road signs or “speed bumps,” Napa, California experimented with using chickens to slow down drivers on one of the streets—Streblow Drive, adjacent to Kennedy Park. They simply released 85 chickens to roam as they pleased. Park manager Bob Pelusi said, “Only occasionally would an impatient driver cut through the flock. Over nine months, we lost just 12 of them. You could say they died in the line of duty.”

An interesting idea. Only I think that in the Russian hinterlands, these chickens wouldn’t survive a night. Perhaps they should have had POLICE painted on their sides, so that if someone tried to harm them, it could trigger a criminal charge?

October 09 2017, 22:24

Published the second part of the video from my presentation at SAP Moscow two weeks ago.

I discuss Search Analytics — a development that has enabled the identification and correction of search issues on the site through user behavior analysis. This approach will work on any site, but is specifically designed for eCommerce. It allows for the identification of issues such as “search queries that are not performing well enough but could be” or “products that turn out to be difficult to find”.

I recommend watching it to everyone involved in online trading and search. This video does not cover Hybris, it is all about site search. Slides are included.

Stay tuned for another interesting topic on hybrismart.com in about a week.

October 09 2017, 15:18

Colleagues, programmers. Could you direct me to some proper material to read about automatic trend detection in data?

For example, if you have a log of events – say, temperatures from 10,000 sensors. We need to identify which sensor suddenly began to rise rapidly.

The first thing that comes to mind is to find trends over a short period of time and analyze micro-trends for two or three periods, but this approach has plenty of downsides: starting with the fact that there could be fluctuations not related to growth, and secondly, some sensors might rarely show readings compared to the analysis period, which causes a problem in correctly choosing the time period for finding the average. Essentially, this approach only works with very high-density information on sensors. And here, the density fluctuates – thick at times, then sparse for different sensors. Well, okay, you can make dynamic groups and somehow tag sensors as “frequent” and “rare”. But all this complicates things, and I feel like my thought is heading the wrong way.

Essentially, it is necessary to construct first and second order derivatives over time and analyze their shapes. Another problem is that the number of sensors is generally unlimited – some may appear, others disappear. Generally, new ones should also be trending.

What to read?