I’m experimenting with LLAMA 3 from Facebook. There’s a modification called llama3-gradient:8b-instruct-1048k-q6_K, which has a context window of 1M tokens (that’s about 2 megabytes). And there’s even more. I feed it the entire book about Elon Musk (highly recommend it, by the way!) and it produces a pretty good summary—and does it quickly, any text from a screenshot is generated in about 40-60 seconds. And yet, it’s still relatively a weaker model (8B), while Facebook has a 70B. But the main feature here is that all this works locally on a laptop. No need to pay for API, it works quite fast, the script is small, fits on one screen.
Still, there are some rough edges—for example, for direct questions about the text (questions to which I definitely know the answers), the system does not always confidently provide answers. When you send significantly less text, it works fine.

