Cheryl’s Birthday and LLM Theory of Mind Part I

This briefing document analyzes the logic puzzle “Cheryl’s Birthday,” its sequel, and a related variant. The document explores the origins of the puzzle, presents the puzzle statement and solution, examines a common incorrect solution, and discusses subsequent iterations of the puzzle. Origins “Cheryl’s Birthday” is a knowledge puzzle that gained widespread attention in 2015 after being posted online by Singaporean television personality Kenneth Kong. The puzzle, authored by Dr. Joseph Yeo Boon Wooi, was initially part of the 2015 Singapore and Asian Schools Math Olympiad (SASMO), intended for high-performing 14-year-old students. The Puzzle The puzzle presents a scenario where a […]

Beyond Keywords: Neural Retrieval with Context

This research paper proposes two methods for improving the performance of neural retrieval models by incorporating contextual information. The first method involves a training procedure that clusters documents into batches based on similarity, creating more challenging training examples. The second method introduces a new architecture that augments the standard encoder with additional information about neighboring documents, allowing the model to dynamically learn corpus statistics. The paper demonstrates that both methods achieve better results than traditional biencoders, particularly in out-of-domain settings.

Will o1 Ever Escape ChatGPT’s Old Training?

This study investigates whether the reasoning abilities of large language models (LLMs) are still influenced by their origins in next-word prediction. The authors examine the performance of a new LLM from OpenAI called o1, which is specifically optimized for reasoning, on tasks that highlight the limitations of LLMs based on their autoregressive nature. While o1 shows significant improvements compared to previous LLMs, it still displays a sensitivity to the probability of both the task and the output, suggesting that reasoning optimization may not fully overcome the probabilistic biases ingrained during training. The study provides evidence for the “teleological perspective,” which […]

Multilayer Perceptrons (MLPs) in Deep Neural Network Architecture

Let’s explore multilayer perceptrons (MLPs), a type of deep neural network architecture. The text first discusses the limitations of linear models and how they struggle to capture complex non-linear relationships in data. It then introduces hidden layers as a solution, explaining how they allow MLPs to represent non-linear functions. The excerpt explores the activation functions that are critical to making MLPs non-linear, including the ReLU, sigmoid, and tanh functions. It also highlights the importance of activation functions in optimization and discusses various activation function variations, such as pReLU and Swish. Finally, the excerpt touches on the concept of universal approximators, demonstrating that MLPs can learn any function given enough hidden units, but emphasizes […]

Where’s Waldo? The Power of CNNs

This excerpt from Dive into Deep Learning explores the evolution of convolutional neural networks (CNNs) from basic multi-layered perceptrons (MLPs). It begins by showing the limitations of MLPs in processing high-dimensional data like images, particularly the large number of parameters required. The excerpt then introduces the concepts of translation invariance and locality, which are crucial for building effective CNNs. These concepts are then applied mathematically to derive the structure of a convolutional layer, where a convolutional kernel is used to weigh pixel intensities in a local region. Finally, the excerpt discusses the importance of channels in images and how they are integrated into convolutional operations, leading to the formation […]

Softmax Regression in Linear Neural Networks

Let’s get into the process of softmax regression, a method used in machine learning for classification problems where the goal is to predict which category a data point belongs to. It introduces the softmax function, which transforms outputs from a neural network into probabilities for each category, ensuring that they sum to 1. The cross-entropy loss function is then used to measure the difference between the model’s predicted probabilities and the actual category, guiding the model to improve its accuracy. The explanation also covers the underlying concepts from information theory, such as entropy and surprisal, which provide a deeper understanding […]

Opportunities and Challenges with Knowledge Graphs

This article from the Artificial Intelligence Review examines the opportunities and challenges of knowledge graphs, a type of graph data that accumulates and conveys knowledge of the real world. The authors discuss how knowledge graphs are used in various AI systems, such as recommender systems, question-answering systems, and information retrieval tools, and highlight their potential benefits in fields like education, scientific research, social media, and medical care. The article also addresses the technical challenges of knowledge graph development, including knowledge graph embeddings, knowledge acquisition, knowledge graph completion, knowledge fusion, and knowledge reasoning. By outlining both the opportunities and challenges, the authors […]

LoRA: The Original Paper on LoRA Fine-Tuning

This is a discussion of the original LoRA paper, which proposed a novel approach called Low-Rank Adaptation (LoRA) to make large language models (LLMs) more efficient for downstream tasks. LoRA avoids the computational and storage burden of traditional fine-tuning by freezing the pre-trained model weights and instead injects trainable low-rank matrices into each layer of the Transformer architecture. This technique results in a significant reduction in trainable parameters and memory requirements without compromising model quality. The authors provide a comprehensive evaluation of LoRA across various NLP tasks, showcasing its effectiveness on models like RoBERTa, DeBERTa, GPT-2, and GPT-3. They also […]

Transferring Knowledge from Large Models to Small Models

We discuss a research paper that proposes a new method called Adaptive Feature Transfer (AFT) for transferring knowledge from large foundation models to smaller, task-specific downstream models. AFT prioritizes transferring only the most relevant information from the pre-trained model to the downstream model, leading to improved performance and reduced computational cost. The paper showcases AFT’s effectiveness on various vision, language, and multimodal datasets, demonstrating its ability to achieve significant performance gains compared to existing transfer learning methods. AFT’s design decisions, such as using a kernel formulation and learning feature weights, are analyzed and shown to be essential for its robust […]

Weight Decay in Linear Regression

Let’s talk about weight decay as a method of regularization to combat overfitting in machine learning models. Weight decay involves adding a penalty term to the loss function, which encourages the model to use smaller weights, thereby reducing the model’s complexity and improving its ability to generalize to new data. The text introduces the mathematical concept of norms and explains how weight decay relates to ridge regression in statistics. The source then walks through a simple linear regression example, implementing weight decay from scratch and then using a deep learning framework. Finally, the chapter summarizes the key points of weight decay and concludes with exercises for the reader. Read more here: https://d2l.ai/chapter_linear-regression/weight-decay.html