Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions
I ran a quick experiment investigating how DeepSeek-R1 carries out on agentic jobs, despite not supporting tool usage natively, and I was quite amazed by preliminary outcomes. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not just prepares the actions but likewise develops the actions as executable Python code. On a subset1 of the GAIA recognition split, links.gtanet.com.br DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% right, and other designs by an even bigger margin:
The experiment followed design use standards from the DeepSeek-R1 paper and the model card: Don't utilize few-shot examples, prevent including a system prompt, and library.kemu.ac.ke set the temperature to 0.5 - 0.7 (0.6 was used). You can discover more examination details here.
Approach
DeepSeek-R1's strong coding capabilities allow it to function as an agent without being clearly trained for tool usage. By allowing the design to create actions as Python code, it can flexibly engage with environments through code execution.
Tools are implemented as Python code that is included straight in the timely. This can be a basic function meaning or a module of a larger package - any legitimate Python code. The model then produces code actions that call these tools.
Results from executing these actions feed back to the model as follow-up messages, driving the next steps till a last response is reached. The agent structure is a basic iterative coding loop that moderates the discussion between the design and its environment.
Conversations
DeepSeek-R1 is utilized as chat model in my experiment, where the design autonomously pulls additional context from its environment by using tools e.g. by utilizing an online search engine or bring information from web pages. This drives the conversation with the environment that continues till a last answer is reached.
On the other hand, yogicentral.science o1 models are understood to carry out inadequately when used as chat models i.e. they do not try to pull context during a conversation. According to the connected post, o1 models perform best when they have the complete context available, with clear guidelines on what to do with it.
Initially, I likewise attempted a complete context in a single timely technique at each action (with outcomes from previous steps consisted of), but this resulted in considerably lower scores on the GAIA subset. Switching to the conversational approach explained above, videochatforum.ro I was able to reach the reported 65.6% performance.
This raises a fascinating concern about the claim that o1 isn't a chat design - maybe this observation was more relevant to older o1 models that lacked tool use capabilities? After all, isn't tool use support a crucial mechanism for allowing models to pull extra context from their environment? This conversational technique certainly seems effective for DeepSeek-R1, though I still require to carry out comparable explores o1 designs.
Generalization
Although DeepSeek-R1 was mainly trained with RL on math and coding tasks, it is impressive that generalization to agentic jobs with tool use via code actions works so well. This capability to generalize to agentic jobs advises of current research study by DeepMind that shows that RL generalizes whereas SFT remembers, although generalization to tool usage wasn't investigated because work.
Despite its ability to generalize to tool usage, DeepSeek-R1 frequently produces really long thinking traces at each step, compared to other models in my experiments, limiting the effectiveness of this design in a single-agent setup. Even easier jobs in some cases take a long period of time to complete. Further RL on agentic tool use, be it by means of code actions or not, might be one choice to improve performance.
Underthinking
I also observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning design frequently switches between different thinking ideas without sufficiently exploring promising courses to reach an appropriate solution. This was a major reason for extremely long reasoning traces produced by DeepSeek-R1. This can be seen in the taped traces that are available for download.
Future experiments
Another common application of reasoning designs is to use them for planning only, while using other designs for producing code actions. This might be a possible brand-new feature of freeact, if this separation of functions proves helpful for more complex tasks.
I'm also curious about how reasoning models that currently support tool use (like o1, parentingliteracy.com o3, ...) in a single-agent setup, with and without creating code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also utilizes code actions, look intriguing.