Title: Exploring the Future of Large Language Models with Ofer Mendelevitch

Title: Exploring the Future of Large Language Models with Ofer Mendelevitch

Ofer Mendelevitch Talks LLM, RAG Community, and More

Stuart Clark
Amazon Employee
Published Jun 12, 2024
Introduction: In this insightful interview, I sat down with Ofer Mendelevitch, a seasoned expert in the field of large language models (LLMs), and head of developer relations at Vectara. Ofer, who has been working with LLMs since 2019, shares his journey, experiences, and thought-provoking perspectives on the challenges, opportunities, and future impact of these powerful AI models.
Stuart: To begin, can you share your journey and experience in working with large language models since 2019? What initially sparked your interest in this field?
Ofer: "So in 2019, I started a company. It was called Syntegra, along with my co-founder, Dr. Michael Lesh, and the vision we had was creating the best technology for synthetic data generation in healthcare and medicine. The idea was to solve data mobility in healthcare. Data in healthcare is very siloed, and hard to get at, making research difficult. By taking real data and generating synthetic data out of it, you can advance medical research powered by this synthetic data.
Now, when we looked at technologies for this, we came across GPT, which was the version 1 at the time (fun fact: it was not called GPT-1, just GPT) and it seemed perfect for what we wanted to do. That is really how I got introduced into this field of LLMs. Actually, it was my friend Johannes that sort of suggested to me to look into this. I fell in love with it immediately. I saw that it was so powerful and so cool."
Stuart: That's amazing that you were involved from the early days. What would you say were some of the key challenges you faced while developing products using these large language models, and how did you overcome them?
Ofer: "So let me talk about a few of the challenges that I see when people develop products that are powered by RAG (retrieval augmented generation).
The first one is scale. When you build a RAG application for one or even a hundred documents, it is relatively easy. But when you go to a large enterprise deployment, and you have to deal with thousands or millions of documents - things stop being so simple. Accurate retrieval (that's the “R” in RAG) - becomes much more important and difficult to get right. It’s a lot of documents, and getting the most relevant piece of text for your LLM becomes difficult, and you realize things like Hybrid search, MMR and reranking become super important. Not to mention that setting up the machine learning pipeline to implement these accurate retrieval techniques at scale is challenging by itself from an MLOps perspective.
The second one is the whole issue of security and privacy. We need to set up our environment, our VPC, in a way that everything is encrypted, and everything has all the bells and whistles in terms of security that an enterprise-scale deployment requires. If you build a chat-with-my-pdf demo with a few lines of code, you don’t have to deal with these issues, but at enterprise scale - this is table stakes.
The third and last thing I would mention is cost and latency. In 2024, we're starting to see that people have this realization that, ‘Oh, you mean I pay by the token and you don't tell me how much it might cost me upfront? it's kind of like a blank check?”. Developers are starting to understand that the cost of do-it-yourself RAG when you pay by token can actually be really high."
Stuart: Those are definitely major hurdles. That leads me nicely into the next question about RAG-as-a-service and why a SaaS version of RAG makes sense?
Ofer: "Yeah, great question, Stuart. So that's where we found that the service that we provide at Vectara, which is RAG-as-a-service, makes sense to our customers who actually don't want to build it themselves and don't want to handle all of these concerns we just talked about.
Vectara is a single service that implements within the platform all the components of building RAG: chunking documents, calculating embeddings for text and queries, running the vector database, crafting a prompt, calling the LLM, etc. These components are already integrated with each other, and we make sure that the whole system runs efficiently and cost-effectively, and optimize the calls in between all those components. You, as a developer, can interface with the platform using much simpler abstractions and and an easy-to-use API. Recently we also launched an integrated Chat functionality which means you don’t even need to worry about maintaining chat history - it all happens in the platform.
The benefit you get is that it can scale up and down easily - we're responsible for making sure that this actually works really, really well, and is cost-effective for you. You don't have to worry about contracting and paying different vendors separately. You just pay us with a much easier-to-understand payment model that's based on usage as opposed to by token."
Stuart: I see, that makes a lot of sense. There is this ongoing discussion about RAG versus fine-tuning - what are your thoughts here?
Ofer: "Okay. So that's an important one, Stuart. I have heard this conversation for the last, probably nine to twelve months, and I still hear it from a lot of people I talk to in the community and our customers. I just want to make my opinion about this very explicit and very clear. 
Fine-tuning is a technique used to change the language model's behavior - like responding in a different tone or style. You can make it talk like a kid or in Shakespearean style. It's quite effective for that. Fine-tuning technically means re-training the model for a few more epochs on original or new data or to perform a new task - as part of what is known as “transfer learning”. In this context -  the goal for developers is often adapting an LLM to some private data, as an alternative to using RAG. 
So let me be very clear: this does not work well for learning new information. Language models are notoriously difficult to train, require a lot of expertise to do right, and when you fine-tune on your private data, it may “forget” information or skills it knew before-hand (during pre-training). Of course there is also the issue of cost - it requires expensive GPUs. But the important point is: it rarely does what you intended it to do - effectively “integrate” your private data into the LLM. 
Actually, there is a paper that just got published by researchers from Google and Technion about this exact question, demonstrating how integrating private data into an LLM with fine-tuning does not work as intended. 
RAG, on the other hand, is designed exactly for this purpose - augmenting the LLM with your own data. One analogy I like to use when explaining RAG is that it’s like the difference between a closed-book test and an open-book test. An LLM is like a closed-book test: what it learns during training is all the knowledge it has, whereas with RAG (like in an open-book test) we provide additional knowledge to the LLM at run time that it can “use” in addition to its existing knowledge from pre-training. It's about retrieving and bringing the right facts to the LLM at the right time to answer the user’s question. With RAG, your application can be updated with new data pretty much immediately. It supports permissions, and your private data is never used for training any model, so you won’t be worried about data leaks.
So to sum up, I like this quote from a great blog post on the topic: “fine-tuning is for form, not facts”.
Stuart: That certainly made it a lot clearer for me. In your current role leading developer relations at Vectara, what advice would you give to developers and researchers interested in working with large language models?
Ofer: "Oh, that's a great question. I talk to developers in the community quite a bit, and I think there are three types of developers. There's going to always be the real experts who want to build it themselves from scratch. They don't even use things like LangChain and LlamaIndex. They just build the whole stack themselves. They're going to call the OpenAI or Anthropic APIs directly. They're going to build their own vector database, even in memory, and they're going to put it all together. If you want to do that, that's absolutely fine, you should do that.
There’s another group of developers that are using the open source, do-it-yourself orchestration frameworks like LangChain and LamaIndex and others, and I think those are very good tools as well. 
The third group is developers who just want to build a GenAI application, and need the enterprise features that platforms like Vectara provide. They don’t want to spend all their time being experts in selecting the right LLM, the right prompt, choosing a Vector database, and making sure all the data is secure and private.  
So the advice I would give is to think through not just the fun part of building a simple POC, but consider the long term plan. Like any technology architecture decision, it’s buy-vs-build, and if you want to build, are you ready to make that much larger investment? 
Stuart: Those are really helpful perspectives. Finally, what excites you the most about the future of large language models and their potential impact across various industries and domains?
Ofer: "So first of all, I would say I'm really excited about the future. One thing I got to experience was how GPT-2 moved into GPT-3 and then turned into GPT-4. And I know this has been said before many times - that we humans think linearly and not exponentially - but those progressions have been exponential. I've seen that firsthand. I expect the GPT-5 and the equivalent LLMs from other companies like Google or Meta will be much better than what our linear brain expects them to be. I'm really excited about this.
What do I expect of them? What do I want? Here's my wishlist: I want less hallucinations. I think we all want that. Next on my list would be better reasoning capabilities, and that’s something people are working on and I expect will get better very soon. And of course we’ve seen a taste of multi-modality with the recent GPT-4o and Google Gemini, and these I think will continue to evolve and become significantly better.
The applications in industries like Healthcare, Fintech and legal - there are so many applications that will make things better, faster and cheaper. One of my favorite examples is in Healthcare: the ability for a physician to use AI to understand a patient’s record before they see the patient. Instead of having to read every single visit, every prescription, every diagnosis test - in order to figure out what is critical for this specific visit - they can use RAG and LLM technology to summarize it for them and reduce physician burnout in a significant way. 
Honestly, we are just starting to imagine all the applications, and there are so many. I think the impact will be truly profound. 
Stuart: I want to thank Ofer for taking the time to share his invaluable insights and experiences with us. His perspectives highlight the incredible potential of large language models, while also underscoring the importance of addressing key challenges around scalability, security, cost, and responsible development. Ofer's journey in this field, from the early days of GPT to the cutting-edge solutions offered by Vectara today, showcases the power of continuous learning, embracing new technologies, and fostering a passion for innovation. As we look ahead, his excitement about the future advancements in areas like reasoning, multimodality, and real-world applications across industries like healthcare and finance is truly inspiring.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.