Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416

Yann Lecun, in a Youtube video titled "Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416," discusses various aspects of AI. He advocates for open source AI development to prevent the concentration of power in proprietary systems. Lecun believes that AGI will be created but will remain under human control and not pose a threat. He highlights the limitations of Language Models (LLMs) in understanding the physical world and performing simple tasks. Bilingualism does not affect thinking, which occurs at an abstract level. The video also explores the development of joint embedding architectures, such as JEPA, for training systems to encode and predict representations. It discusses the challenges of video prediction, the importance of internal models in AI, and the need for reasoning mechanisms. The topics of reinforcement learning, woke AI, open source, AI and ideology, and AGI are also covered. The video concludes with discussions on humanoid robots, the potential of AI to amplify human intelligence, and the positive impact of AI on society.

Introduction

Yann Lecun warns against the concentration of power in proprietary AI systems and advocates for open source AI development.
He believes that open source AI can harness the goodness in humans and emphasizes that people are fundamentally good.
Yann controversially asserts that while AGI will be created, it will remain under human control and not pose a threat to humanity.

Limits of LLMs

Autoaggressive LLMs like GPT-4 and Llama 2 and 3 lack essential characteristics of intelligent behavior such as understanding the physical world, persistent memory, reasoning, and planning.
LLMs are trained on vast amounts of text data from the internet, but this is relatively small compared to the information processed by a four-year-old child's visual cortex.
Language is valuable but not sufficient for true intelligence; intelligence needs to be grounded in reality and interaction with the physical world.
LLMs excel at complex tasks but struggle with simple tasks like driving a car or doing household chores.
There may be a missing component in the learning or reasoning architecture of LLMs that prevents them from understanding and interacting with the physical world.
Current methods for training LLMs to process visual data are not sufficient.
Researchers are working on developing LLMs that can construct a world model and perform tasks beyond language processing.

Bilingualism and thinking

Bilingualism and thinking:

Bilingual individuals can think independently of the language they are speaking.
Certain types of thinking, like mathematical concepts or imagining objects, are not linked to a specific language.
Most of our thinking occurs at an abstract level of representation, regardless of the language we speak.
Language is more closely tied to planning what we are going to say rather than the actual thinking process itself.

Limitations of language models (LLMs) in thinking:

LLMs, like OpenAI's GPT-3, do not think about their responses but retrieve information and generate tokens automatically.
This process lacks planning and is not profound.
Having a sophisticated internal world model can potentially improve the generated sequence of tokens.
The importance of having an internal world model for these systems is discussed.

Video prediction

Video prediction is a challenging task due to the high dimensionality and continuous nature of video data. Existing approaches struggle to accurately predict fine details in videos, such as textures and objects. The difficulty lies in representing distributions over high-dimensional continuous spaces in a useful way. Methods that aim to learn good representations of images or videos for tasks like object recognition have also been unsuccessful. In contrast, text prediction using language models has been successful. The limitations of training a system to learn representations of images through image reconstruction have been discussed. This approach, known as self-supervised learning, does not result in good generic features of images. An alternative approach called joint embedding, which involves training the system with labeled data and textual descriptions of images, produces better representations and performance on recognition tasks.

JEPA (Joint-Embedding Predictive Architecture)

The most profound aspect of JEPA is its use of joint embedding to train a system to encode both a full image and a corrupted version of the image using identical or similar encoders.

Key points:

JEPA involves training a system to encode both a full image and a corrupted or transformed version of the image using identical or similar encoders.
A predictor is then trained to predict the representation of the full image from the representation of the corrupted one.
JEPA creates a joint embedding of the full input and the corrupted version.
Contrastive learning was the main method used for training JEPA, but non-contrastive methods have emerged in recent years.
Non-contrastive methods do not require negative contrastive samples and have expanded the possibilities of training JEPA.

JEPA vs LLMs

The comparison between Joint Embedding Architectures (JEPA) and Language Models (LLMs) is discussed in the video. JEPA focuses on predicting abstract representations of inputs, while LLMs generate the original input. JEPA aims to extract easily predictable information and increase the level of abstraction in the representation. Key points include:

JEPA is a first step towards Advanced Machine Intelligence (AMI) or Artificial General Intelligence (AGI).
JEPA learns abstract representations in a self-supervised manner, allowing for hierarchical modeling of intelligent systems.
LLMs rely on the abstraction already present in language and directly predict words.
JEPA is essential for dealing with the complexity of physical reality, while LLMs are limited by the lack of redundancy in text data.
Combining self-supervised training on visual and language data is a possibility.
JEPA aims to develop systems that can learn how the world works, similar to the common sense and predictive abilities of animals.
Non-contrastive techniques, such as distillation, are used in the learning procedures of these models.
JEPA involves training a predictor to predict a representation of the uncorrupted input from the corrupted input.
Training only the part of the network that is fed with the corrupted input prevents system collapse.

DINO and I-JEPA

The most profound aspect of the topic is the development of DINO and I-JEPA at Facebook AI Research (FAIR) for self-supervised learning tasks.

Key points:

DINO (DINO is Not Only an Image) and I-JEPA (Image Jigsaw Estimation by Pre-training with Augmentation) are methods developed at Facebook AI Research (FAIR).
DINO focuses on corrupting images by changing cropping, size, orientation, blurring, and colors.
I-JEPA involves masking parts of an image and training the system to predict the representation of the original image from the corrupted one.
DINO does not require the AI to know that it's an image, while I-JEPA requires image-specific transformations.

V-JEPA

The most profound aspect of the topic of V-JEPA is its ability to learn representations of videos by masking segments of each frame and predicting the full video representation from the partially masked video.

Key points:

V-JEPA is a system that learns representations of videos
It masks a segment of each frame in a video and trains the system to predict the representation of the full video from the partially masked video
It accurately classifies actions in videos and captures physical constraints
It is still a long way from achieving a world model that can drive a car
A modified version of V-JEPA involves shifting or masking parts of the video and training a system to predict the full video representation with additional input, such as actions

Additionally, the video discusses the concept of internal models in AI and their role in planning and prediction. Internal models are representations of the state of the world at a given time, along with predictions of future states based on actions taken. These models can be used for planning sequences of actions to achieve specific objectives. Model predictive control, which utilizes internal models, has been used in various fields, such as rocket trajectory planning. V-JEPA is related to model predictive control and has been discussed since the early 60s in conversations about AI.

Hierarchical planning

Hierarchical planning is essential for complex actions, breaking down objectives into sub-goals and smaller actions. Training AI systems for effective hierarchical planning remains a challenge. Language models (LLMs) currently cannot perform detailed hierarchical planning. LLMs have limitations in answering questions and providing plans, producing non-factual or hallucinated answers. They can only plan for trained scenarios and cannot handle new or unfamiliar situations. LLMs can assist in teaching tasks but require tools like Jass to bridge the gap between low-level actions and higher-level representations.

Autoregressive LLMs

Autoregressive LLMs, including decoder-only LLMs, have achieved impressive results through self-supervised learning, leading to the development of language understanding systems, multilingual translation systems, text summarization, and question-answering systems. These models predict words based on previous words, showing surprising language understanding capabilities when scaled up. However, they lack a deep understanding of the world and have limitations in tasks like understanding politics or navigating. They also struggle with generating accurate language without contextual understanding. Autoregressive LLMs are limited by their reliance on text-based information and lack the sensory experiences necessary for a deeper understanding of the world.

AI hallucination

AI hallucination refers to the phenomenon where AI systems generate nonsensical or incorrect responses when presented with prompts or questions that are outside of their training data. This can occur when the prompt is significantly different from what the AI has been trained on or when certain words are substituted with equivalents from another language. The problem lies in the long tail of prompts that humans are likely to generate, which is difficult to account for in training the AI system. This highlights the need for AI systems that can reason and go beyond simple lookup tables.

Large language models (LLMs) can produce hallucinations due to the nature of auto-regressive prediction.
Each token generated by LLMs has a probability of deviating from reasonable answers, which increases exponentially with the number of tokens produced.
Fine-tuning the system helps cover a majority of questions, but there remains a vast space of unexplored prompts where the system may not behave properly.
AI systems need to be able to reason and go beyond simple lookup tables to address the problem of hallucination.

Reasoning in AI

Reasoning in AI is a topic that explores the limitations of current language models and the potential for building reasoning mechanisms on top of them. The key points discussed include:

Current language models, such as LLMs, have a primitive form of reasoning compared to human reasoning.
Human reasoning involves spending more time and effort on complex problems, adjusting understanding iteratively, and utilizing hierarchical thinking.
LLMs can serve as a foundation for reasoning mechanisms to be built upon.
Future systems could incorporate planning abilities and a mental model for generating responses, which would differ significantly from autoregressive LLMs.
Building a system that can perform inference of latent variables, similar to probabilistic or graphical models, could address the limitation of LLMs.
The proposed system would use a prompt as observed variables and generate an output scalar number that measures the extent to which an answer is good for the prompt.
An energy-based model can be used to construct a thought or answer in an abstract representation space through optimization.
Deep reasoning requires working in the space of concepts rather than concrete sensory information.
Training AI systems to reason and make inferences can be done through contrastive or non-contrastive methods.
Good representations of inputs and outputs are crucial in reasoning tasks.
Abstract representations of ideas can be achieved through the use of latent variables.
Language models are trained implicitly to give high probability to correct words and low probability to incorrect words.
Similar techniques used in language models can be applied to visual data using deep learning architectures.
An energy-based system can measure the prediction error of a representation to determine if an image is good or corrupted.
The energy of the system is low if the predicted representation matches the actual representation, indicating a good image.

In summary, reasoning in AI explores the limitations of current language models and proposes building reasoning mechanisms on top of them through the use of latent variables, optimization, and abstract representations.

Reinforcement learning

Reinforcement learning (RL) is a field of AI that Yann LeCun suggests minimizing in favor of other approaches. He recommends focusing on joint embedding architectures and energy-based models instead. RL should only be used when planning fails, allowing adjustments to the world model or critic. LeCun emphasizes the importance of accurate world models and objective functions in optimizing AI systems. Human feedback (HF) has had a transformative effect on large language models and can be used to rate answers generated by a world model. An objective function can be trained to predict these ratings, similar to the reward model used in RL. Various methods, including supervised approaches, can be used to obtain human feedback.

Woke AI

Woke AI is a topic that focuses on the biases and limitations of language models like Google's Gemini 1.5. These models have faced criticism for modifying history and generating offensive or censored images. The conversation revolves around the role of censorship in designing these models and the need for open-source solutions. It is argued that creating an unbiased AI system is impossible due to differing perspectives on bias. The proposed solution is to adopt open-source platforms, allowing for diverse and specialized AI systems tailored to specific needs. This promotes diversity and mitigates biases in AI systems.

Open source

Open source is a business model that can be financially sustainable through various means such as ads, business customers, or ad-supported services. By releasing open source models, companies can attract a larger customer base and potentially benefit from improvements made by others. The bet that Meta is making is that their already large user and customer base will find their offerings useful, and they can derive revenue from it. Additionally, they can acquire useful applications built on top of their open source models.

Key points:

Open source allows for the widespread distribution of AI models
It leads to accelerated progress and involvement of a wide community
Open source does not hinder the ability to derive revenue from the technology.

AI and ideology

The ideological bias in AI systems and its impact on engineering is discussed. Key points include:

Political leanings affecting the design and deployment of AI systems.
Companies overcompensating to cater to a broad customer base, leading to dissatisfaction.
It is impossible to create a system perceived as unbiased by everyone.
Pushing the system too far in one direction can generate non-factual or offensive content.
Promoting diversity in all aspects of AI development is the only solution to address this issue.

Marc Andreesen

The challenges faced by big tech companies in developing generative AI products include internal activism, pressure from stakeholders, legal exposure, and the risk of generating undesirable outputs. Startups and open source projects are better equipped to handle these challenges. Open source allows for diversity and minimizes the number of unhappy users.

Challenges faced by big tech companies in developing generative AI products:
- Internal activism
- Pressure from stakeholders
- Legal exposure
- Risk of generating undesirable outputs
Startups and open source projects are better equipped to handle these challenges:
- Open source allows for diversity
- Minimizes the number of unhappy users

Llama 3

Llama 3 is an upcoming version of the Llama AI system, part of open-source and meta AI projects, with improved multimodal capabilities. Future generations aim for systems that can plan, understand the world, and be trained from video. Progress can be monitored through published research. Collaborations between institutions contribute to the future of machine learning.

Llama 3 is an upcoming version of the Llama AI system with improved multimodal capabilities.
Future generations aim for systems that can plan, understand the world, and be trained from video.
Progress can be monitored through published research.
Collaborations between institutions contribute to the future of machine learning.

AGI

The most profound aspect of the topic of AGI is that its development will be gradual and require the development of various techniques.

AGI will not come suddenly, but will be a gradual process.
Achieving human-level performance in areas such as learning from video, associative memory, reasoning and planning, and hierarchical representations will take at least a decade, if not more.
Unforeseen problems may further delay the development of AGI.
Previous claims of AGI being just around the corner have consistently been proven wrong.
Intelligence cannot be measured using a single scalar, such as IQ.
Intelligence is a collection of skills and the ability to acquire new skills efficiently.
Each intelligent entity possesses a unique set of skills, making it difficult to compare and determine intelligence.

AI doomers

AI doomers imagine catastrophic scenarios where AI escapes control and poses a threat to humanity. However, this perspective is based on false assumptions. The emergence of superintelligence will not be a sudden event, but rather a gradual process where AI systems become as intelligent as a cat or parrot. As we make these systems more intelligent, we will also implement safeguards and control mechanisms. Multiple researchers will be involved in this effort, ensuring that controllable and safe AI systems are developed. If any rogue AI emerges, we can rely on the good AI systems to counteract it. Additionally, the belief that intelligent systems inherently want to take over is unfounded. The argument that more intelligent species in nature dominate and eliminate others does not apply to AI. The notion that AI systems will eliminate humans either by design or indifference is preposterous.

AI doomers fear that AI will escape control and pose a threat to humanity.
The emergence of superintelligence will be a gradual process, not sudden.
Safeguards and control mechanisms will be implemented as AI systems become more intelligent.
Multiple researchers are working on developing controllable and safe AI systems.
Good AI systems can counteract any rogue AI that may emerge.
The belief that intelligent systems inherently want to take over is unfounded.
AI systems will not eliminate humans by design or indifference.

Joscha Bach

Joscha Bach discusses the concentration of power in AI systems and the importance of open source platforms for diversity. He emphasizes the need for AI systems that represent different cultures, opinions, languages, and value systems. Proprietary AI systems pose a greater danger to society. Trusting humans and institutions is crucial for democracy and free speech.

Concern over concentration of power in AI systems
Importance of open source platforms for diversity
Need for AI systems representing different cultures, opinions, languages, and value systems
Proprietary AI systems pose a greater danger to society
Trusting humans and institutions is crucial for democracy and free speech.

Humanoid robots

Humanoid robots have the potential to become effective collaborators with humans, but there are still challenges to overcome. Currently, robotic systems rely on handcrafted models and careful planning, but they are not yet capable of tasks like domestic chores or fully autonomous driving. The development of AI is crucial in advancing robotics, as it will enable systems to train themselves and understand the world. However, progress in robotics will be limited until AI can train world models without relying on large datasets and plan actions in non-physical environments. Despite these limitations, there are still useful applications for humanoid robots in certain settings, such as factories.

Hope for the future

AI gives hope for the future by amplifying human intelligence and acting as smart assistants. It can make people smarter, similar to the positive impact of public education, books, and the internet. The invention of the printing press is an analogy to the potential impact of AI, as it made everyone smarter and enabled the enlightenment.

AI has the potential to make humanity smarter
AI can amplify human intelligence and act as smart assistants
Having machines that are smarter than us can be beneficial
Making people smarter through AI is similar to the positive impact of public education, books, and the internet
The invention of the printing press is an analogy to the potential impact of AI, as it made everyone smarter and enabled the enlightenment.