Reasoning capability of DeepSeek-R1 distill model vs its base model
Comparing the DeepSeek-R1 distill model and its base model in terms of the reasoning and mathematical capability
Yudho Ahmad Diponegoro
Amazon Employee
Published Feb 12, 2025
It has becoming a topic on how DeepSeek-R1 large language model (LLM) by Deepseek gained remarkable reasoning capabilities[1]. Reinforcement learning was used in the training process to let the model learn on performing Chains of Thought (CoT) to solve more complex problem accurately, in addition to being able to do self reflection. This ability is novel in the generative AI space, which allows an artificial intelligence to reason.
The data generated during the training process, along with other data, was also used to teach smaller models on how to answer complex problem with step by step reasoning or CoT. This methodology worked in boosting the performance of the smaller models in solving complex problem with reasoning. The base models are coming from Llama 3.1, Llama 3.3, and Qwen2.5 ranging from 1.5B to 70B parameters.
This blog post focuses on showing author's experiment result on comparing the reasoning ability between the base model and the DeepSeek-R1 distilled model. The model used is DeepSeek-R1 Llama 70B which is based on Llama 3.3 70B Instruct. The experiment focuses on comparing the response of these 2 models on a sample prompts to show the improved reasoning capability.
This experiment was run on Graviton4 instance c8g.16xlarge with 64 vCPU and 128 GB RAM. The models were run on CPU using llama.cpp on its Q4 (quantized 4 bits) version. Please read the other blog post to understand more about the performance and cost for running the model, in addition to the step-by-step guidance on deploying it.
For each run of the experiment with the base model the command use is like one below.
For the DeepSeek-R1 distilled model, the command used is like one below.
For this experiment, I used the following prompt:
Yesterday 6 pm, I ran my on-demand EC2 instance with $0.7 hourly price. Then I went for dinner at my parent’s place. It took some time to drive there. Once arrived, I quickly launched another one with same instance family under spot with 55% lower price that time. I got a cup of chamomile soon after dinner and fell asleep. Then I woke up due to my noisy alarm which I set at3:15. I forgot to turn off those instances! After doing my morning routing for 45 mins, I quickly turned them off. On that day, my total EC2 compute bill is $9.835 contributed merely by those 2 instances. How many minutes did it took for me to drive to my parent’s place?
Please reason step by step, and put your final answer within \boxed{}.
<think>\n
The expected correct answer is shown by the calculation below
Below is part of the answer from the original Llama3.3 70B Q4 (4 bits quantized) model. I had to stop it after some tokens.
We can see that the answer is incorrect. The logic however, seems to be correct. Just that the model made miscalculation when multiplying. This could have been avoided if the problem is tackled in a different way.
Below is the answer from DeepSeek-R1 Distilled Llama 70B Q4 (also 4 bits quantized) model
The answer is correct, despite the long generated response which demonstrated its inner thinking. More interestingly, it demonstrates some self-reflection skill and ability to doubt itself, which can be important in avoiding error in problem solving.
For example, I observed these form the above:
Wait, let me clarify the timeline.
Wait, no. Let me think again.
Wait, maybe it's better to convert all times to a 24-hour format to calculate durations.
But wait, is the spot instance priced per hour? Or is it a different pricing model?
Wait, that seems straightforward, but let me verify.
Since the output is fed back as input for the next token generation, the presence of these questions may be incorporated into the next deeper thinking which can be crucial in avoiding errors.
Now let's push the models even further. I modified the prompt such that there are 2 missing variables. One was the duration of travel and another one is the EC2 spot price discount. Technically, with 1 equation to solve 2 variables, it won't be possible. The prompt is below
Yesterday 6 pm, I ran my on-demand EC2 instance with $0.7 hourly price. Then I went for dinner at my parent’s place. It took some time to drive there. Once arrived, I quickly launched another one with same instance family under spot with some discount I can’t remember anymore. I got a cup of chamomile soon after dinner and fell asleep. Then I woke up due to my noisy alarm which I set at 3:15. I forgot to turn off those instances! After doing my morning routing for 45 mins, I quickly turned them off. On that day, my total EC2 compute bill is $9.835 contributed merely by those 2 instances. Can you calculate how many minutes did it took for me to drive to my parent’s place?Please reason step by step, and put your final answer within \boxed{}.<think>\n
Below is the response of the original Llama 70B Q4 model after I stopped it at some point to avoid unnecessary generated tokens.
I think the model performed well. It did acknowledge that it can't be solved without error. It, however, still tried to answer.
Now below is the answer from the DeepSeek-R1 Distill Llama 70B Q4 model. I was fascinated by the way it answered so I increased the maximum output token to 4096 and let it think and respond until it hits that token limit.
In there, it also realized the problem of the missing 2 variables which can't be solved. However, what made me interested is that it kept on trying with certain spot discount assumption and even said "Alternatively, perhaps the spot price is the minimum possible, approaching zero, but that would make T approach 10 hours, which is too long."
To me, it looks like it somehow is getting to the point that there can be a spot discount D which is a threshold to make the equation makes sense (e.g. not resulting in unreasonable traveling hours for example). In my previous run with slightly different prompt, this model was able to mention that such assumed discount D is not possible since it will result into a negative variable. I think this is an interesting capability that it tries to reason as much as it can to solve a very difficult problem.
There are multiple possible ways of deploying the DeepSeek-R1 Distill models on AWS. This blog post has summarized some of these deployment methods, including one on CPU with Graviton4 which I previously published here.
Reasoning model such as DeepSeek-R1 and its distilled models can be a potential solution to many problems which previously can't be solved by typical LLMs. It's default CoT way of solving problem can lead to less possible error in the final answer. But in exchange, it can be more verbose in its generated response and it may consume more output tokens.
With the way the reasoning model answered, I think it does have an interesting capability to solve more complex problem, or discover new things. However, it has its own place and it might not be for every use cases, in a good way :)
For the experiment above, I totally ran it on a CPU instance with Graviton4 c8g.16xlarge. For that to work, the model used was a 4 bits quantized model from DeepSeek-R1 Llama 70B.
Feel free to do more experiment yourselves!
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.