I made a really cool thing for myself. I launch a program, it turns on the microphone and listens. I switch to, say, a browser, comment on what I see on the screen, periodically pressing a hotkey to take a screenshot. Meanwhile, my program makes a time-stamped transcript of my comments, saves the screenshots with time stamps, then it recognizes the screenshots, extracting therefrom the spellings of various words, brands, identifiers, people’s names, so as to then transform the transcript of my speech into correct text. And all this – local models, running on my laptop, which means, absolutely free.
After I finish talking to the computer, I start processing the transcript, which takes the raw transcript and text-recognized screenshots as input and outputs a processed transcript, which now looks presentable (Gemini API is used here). One could even go a step further and automatically cut out fragments from the screenshots that were discussed, and insert them in the text exactly where they were mentioned.
Or here’s another thing I can do: just turn on a video on the speakers and the program immediately makes such a transcript for me. Google on YouTube the video “Angular HttpClient Under The Hood. Design Patterns & Source Code Overview” starting at 3:51 – I just put it on autopilot for a couple of minutes, then stopped my script.

