In January 2025, DeepSeek launched DeepSeek-R1, their first large reasoning model. It sent "shock waves" through the industry due to its open-source, cost-effective, and high-performing AI models.
This small summary was inspired by the research paper by Guo et al. (2025).
What is DeepSeek-R1? Why did Nvidia’s share price dropped by US$600 billion?
First, they used a previous LLM (DeepSeek-V3-Base, released in December 2024) to initialise DeepSeek-R1-Zero.
Then, they used a reinforcement learning technique called Group Relative Policy Optimisation to train DeepSeek-R1-Zero using a reward function.
The reward function used were rule-based:
Accuracy rewards:
Math problems: Ensures correct response
Coding problems: A compiler is used to ensure the feedback fell in the predefined test cases.
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
q = prompts[0][-1]['content']
extracted_responses = [extract_xml_answer(r) for r in responses]
print('-'*20, f"Question:\\n{q}", f"\\nAnswer:\\n{answer[0]}", f"\\nResponse:\\n{responses[0]}", f"\\nExtracted:\\n{extracted_responses[0]}")
return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
Format rewards:
Tag formatting: Positive rewards were given if the answer respected the <think> and </think> tags.
def soft_format_reward_func(completions, **kwargs) -> list[float]:
"""Reward function that checks if the completion has a specific format."""
pattern = r"<reasoning>.*?</reasoning>\\s*<answer>.*?</answer>"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]