In January 2025, DeepSeek launched DeepSeek-R1, their first large reasoning model. It sent "shock waves" through the industry due to its open-source, cost-effective, and high-performing AI models.

This small summary was inspired by the research paper by Guo et al. (2025).

What is DeepSeek-R1? Why did Nvidia’s share price dropped by US$600 billion?

First, creating DeepSeek-R1-Zero

First, they used a previous LLM (DeepSeek-V3-Base, released in December 2024) to initialise DeepSeek-R1-Zero.

Group Relative Policy Optimisation

Then, they used a reinforcement learning technique called Group Relative Policy Optimisation to train DeepSeek-R1-Zero using a reward function.

The reward function used were rule-based:

Accuracy rewards:

Math problems: Ensures correct response

Coding problems: A compiler is used to ensure the feedback fell in the predefined test cases.

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\\n{q}", f"\\nAnswer:\\n{answer[0]}", f"\\nResponse:\\n{responses[0]}", f"\\nExtracted:\\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

Format rewards:

Tag formatting: Positive rewards were given if the answer respected the <think> and </think> tags.

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]