Saurav Kadavath

research

∙ 02/15/2023

The Capacity for Moral Self-Correction in Large Language Models

We test the hypothesis that language models trained with reinforcement l...

0 Deep Ganguli, et al. ∙

research

∙ 12/15/2022

Constitutional AI: Harmlessness from AI Feedback

As AI systems become more capable, we would like to enlist their help to...

0 Yuntao Bai, et al. ∙

research

∙ 09/24/2022

DeepChrome 2.0: Investigating and Improving Architectures, Visualizations, Experiments

Histone modifications play a critical role in gene regulation. Consequen...

15 Saurav Kadavath, et al. ∙

research

∙ 08/23/2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

We describe our early efforts to red team language models in order to si...

0 Deep Ganguli, et al. ∙

research

∙ 07/11/2022

Language Models (Mostly) Know What They Know

We study whether language models can evaluate the validity of their own ...

12 Saurav Kadavath, et al. ∙

research

∙ 04/12/2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

We apply preference modeling and reinforcement learning from human feedb...

2 Yuntao Bai, et al. ∙

research

∙ 10/06/2021

Pretraining Reinforcement Learning: Sharpening the Axe Before Cutting the Tree

Pretraining is a common technique in deep learning for increasing perfor...

0 Saurav Kadavath, et al. ∙

research

∙ 05/20/2021

Measuring Coding Challenge Competence With APPS

While programming is one of the most broadly applicable skills in modern...

0 Dan Hendrycks, et al. ∙

research

∙ 03/05/2021

Measuring Mathematical Problem Solving With the MATH Dataset

Many intellectual endeavors require mathematical problem solving, but th...

0 Dan Hendrycks, et al. ∙

research

∙ 06/29/2020

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

We introduce three new robustness benchmarks consisting of naturally occ...

5 Dan Hendrycks, et al. ∙

research

∙ 06/28/2019

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

Self-supervision provides effective representations for downstream tasks...

5 Dan Hendrycks, et al. ∙

Saurav Kadavath

Featured Co-authors

Sign in with Google

Consider DeepAI Pro