Did LLM kill NLP research?
I recently felt a disconnect from the NLP field’s advancements, except for the widespread use of ChatGPT. Curious about what I missed, I revisited some GPT blog posts. A wave of nostalgia hit as I encountered familiar terms: Attention Is All You Need, Semi-supervised Sequence Learning, ULMFiT, ELMo, BERT (remember BERTology?), GloVe, word2vec, and more.
One striking difference was the length of research papers. Gone are the days of 10–15 page papers; LLM papers, particularly those on GPTs, now stretch to around 100 pages. So, I relied on blog posts, skimmed some papers, and even (ironically) consulted ChatGPT for insights.
Another observation was the prescience of early GPT predictions. The first GPT blog post mentioned achieving remarkable results with just an 8-GPU machine and 5GB of text, highlighting the significant room for improvement with more data and compute. The GPT-2 post prophetically stated, that the public will need to become more skeptical of online text, just like deepfakes necessitate caution with images. The ChatGPT release post already mentioned the issue with hallucinations, even if the term was later popularized through ChatGPT use.
Is Bigger Always Better? A Look at LLM Similarities
My key takeaway from exploring GPT development and briefly examining other SOTA models (like Gemini, LLaMA, and Claude) is that while their architectures have slight variations, they all heavily rely on next-word prediction, BPE tokenization (this paper helped me better understand the tokenization), and throwing more data and compute at the problem. Does NLP simply involve tweaking architectures, scaling data and compute, and churning out new SOTA models?
Perhaps. But, as a non-NLP expert, these are just initial thoughts after catching up.
Beyond NLP: The Rise of Multimodality
Taking a broader view, I made a (perhaps not so bold) prediction at a GenAI workshop in January 2024: We’d see more multimodal capabilities. This aligns perfectly with GPT-4o’s end-to-end training across text, vision, and audio. I foresee this trend continuing, leading to more realistic content and potentially extending to video and music (existing examples already showcase this).
True multimodality would be if we could talk directly to our device, and it would accurately do what we actually asked it (not just what it was thought what to do). Rabbit excited me greatly when it was released earlier this year, until I realized that it anyways had to be trained for each individual action. Something that would be truly groundbreaking would be if this could be generalized and commercialized. Which is something which is currently being worked on.
However, this focus on more data and compute feels like a natural progression. What would truly impress me is granting models the ability to self-verify. Imagine giving them internet access to self-check their answers through multiple queries and crawling subpages/linked pages (beyond a simple “I’m feeling lucky” search). Additionally, imagine empowering them to execute generated code and test for expected behavior. While elements of this exist, models often time out before finding a working version if the code doesn’t function initially.
What I envision is an LLM that can “talk to itself” across multiple queries, to arrive at deeper, more validated answers. That’s what would truly impress me!
ChatGPT’s JSON Mode: A Step Toward Usable AI
One feature that already excites me is ChatGPT’s JSON mode/structured outputs in the API. It allows specifying the expected response format from ChatGPT for use in downstream applications. A significant challenge with early ChatGPT API usage was the difficulty of ensuring a consistent output format and parsing the results. Responses sometimes came in slightly different formats, causing parsing failures. While this wasn’t an issue for complete user responses, it became a major pain point for downstream applications.
JSON mode simplifies using ChatGPT’s output, making it the ideal tool for initial versions of various ML problems. Since ChatGPT is multimodal and multi-purpose, most ML problems could be posed to it, and its output could be used as a starting point for a solution or a baseline. This also compels ML developers to focus on the more critical question: How do we measure and define “good enough” performance for our solutions? By addressing these questions earlier in project development, I believe we can create far more usable AI products in the future.
However, as of today (August 2024), ChatGPT’s JSON mode is still limited to specific models and doesn’t fully support knowledge bases. While OpenAI has made strides in improving JSON mode capabilities, there’s still room for further development in this area.
Another area with great potential is function calling, enabling ChatGPT to interact directly with your APIs. Function calling, together with assistants and file search unlocks a vast potential for future applications!
Closing Thoughts
In conclusion, revisiting this topic solidified the immense impact LLMs have had on the entire AI landscape. While core concepts haven’t seen drastic changes, the scaling of data and compute, coupled with improved user experience, has commoditized LLMs. Now, it’s up to us, the developers, to leverage this commodity to create amazing things! Good luck, and feel free to reach out if you’d like to collaborate on something cool (or question any of my statements)!