Over the past several weeks I’ve chatted with 50+ researchers, entrepreneurs & industry experts from companies like Lux portfolio companies Hugging Face, Runway, MosaicML and academic institutions like the Stanford’s Center for Foundational Model Research on the AI infrastructure & application landscape. While by no means comprehensive (left out a lot of interesting tidbits on hardware, fine tuning and beyond), here’s what I’ve learned:
1. Access to GPUs matters…
- Most companies underestimate how many GPUs (graphical processing units) they need to run compute & fully power AI apps — especially as AI becomes more prevalent in every application and for multiple use cases in each app workflow, demand for GPUs will only grow.
- Every company running large language models at scale today needs an expensive compute contract to usually run A100 GPUs (often with a large cloud provider — AWS, GCP, Azure etc.) or super-compute cluster.
- Today, there are few companies who have a monopoly on major GPU access at scale especially over smaller startups and labs and compute is energy intensive.
- Some companies offer compute cheaply as a competitive advantage (e.g. OpenAI API offering cheaper compute in exchange for more expensive Chat-GPT consumer product) and/or establish alliances with major players (Microsoft and Google are biggest giants here, but wouldn’t underestimate Meta and Oracle.)
- Without compute, it’s not possible to run LLMs at scale.
2. Proprietary data is a moat
- We don’t have enough “good clean” data on the web to train net new large language models today — data is a crucial blocker to LLM development. Unstructured data is being maximized more than ever.
- Data incentivization encouraging organizations to contribute data and data retrieval will become increasingly important to tie results back to actual data sources for mission critical use cases.
- Specialized datasets for large language models will provide a crucial moat for AI applications (especially if combined with data product flywheel or workflow targeted for specific use cases) — RunwayML is a great example in the Lux portfolio focused on the creative economy and generative AI in video production.
- It’s possible overtime that more advanced LLMs e.g. GPT-5 or GPT-6 could outperform even models trained on specialized vertical data sets.
3. Infrastructure margin compression
- Most of the larger cloud players (AWS Sagemaker, AzureML, Google Cloud tooling) offer or will offer the AI infrastructure stack for free from inference to model deployment and experimentation.
- Many infrastructure models are low margin — making it hard to compete for the favor of price sensitive customers with the larger players undercutting.
- As a net new infrastructure player, need to appeal to a broader ROI beyond cost (e.g. self-hosted or decentralized servers, vertical infrastructure, 10x better user or developer experience.) Together.xyz is a great example combining cryptographic principles with AI.
4. Evaluation is an unsolved problem
- It’s difficult today to evaluate whether or not a large language model responded accurately to your prompt, and if the answer was satisfactory.
- Most options in the market today have been more qualitative or “hand-wavy,” e.g. it “appeared” right. It’s still hard to measure whether or not the model hallucinated or was outright false (Truthful QA paper provides one interesting example.)
- A more quantitative, data-driven way to evaluate large language models both from an accuracy and Q&A perspective is critical.
5. Open vs. Closed tension
- Open source vs. Closed model tension (e.g. OpenAI vs. Meta’s LLaMA)
- Self vs. cloud hosted AI infrastructure (e.g. MosaicML vs. AzureML)
- How will companies self-host AI applications and maintain sovereignty when working with an LLM or ML tooling that’s hosted by a large player with compounding performance vs. on their own private cloud?
- Will the LLM ecosystem overtime resemble the semis industry, the database industry or something different altogether?
- It’s hard to know, but my hunch is it will be based on enterprise production use cases and offer a combo of open vs. closed models and self vs. closed infrastructure. Companies like Hugging Face in Lux portfolio have championed the open source ecosystem, recently hosting WoodstockAI to feature over 5000 folks as the largest open source meet-up ever.
- Large language models like Anthropic are particularly interesting to track as they align themselves with cloud and ecosystem players outside of the OpenAI and Microsoft ecosystems.
If you’re building in any of these spaces taking strategic advantage of the opportunity in AI infrastructure or applications today, I’d love to chat — feel free to email me at firstname.lastname@example.org.
Special thanks to Danny Crichton, Siddarth Sharma, Aleksandra Piktus, Ankit Mathur, Andy Chen and Moin Nadeem for their thoughts and feedback on this piece.