Skip to main content
Thinking/Reasoning models are a new class of LLMs designed to make their internal reasoning visible. Unlike standard LLMs that provide only final answers, thinking models such as Claude 3.7 Sonnet, OpenAI o1/o3, and Deepseek R1 “think out loud” by generating a detailed chain of thought before presenting their conclusions. These models are optimized for tasks that demand complex analysis, multi-step reasoning, and structured logic. Obiguard offers access to these advanced models through a unified API that works seamlessly across different providers.

Supported Thinking Models

Obiguard currently supports these thinking-enabled models:
  • Anthropic: claude-3-7-sonnet-latest
  • Google Vertex AI: anthropic.claude-3-7-sonnet@20250219
  • Amazon Bedrock: claude-3-7-sonnet
Additional thinking models will be supported as they become available.

Using Thinking Mode

  1. Set strict_open_ai_compliance=False in your headers or client configuration.
  2. Thinking responses are returned in a format different from standard completions.
  3. For streaming, the thinking content appears in response_chunk.choices[0].delta.content_blocks.

Basic Example

  • Python SDK
  • OpenAI Python
  • cURL
from obiguard import Obiguard

# Initialize the Obiguard client
client = Obiguard(
  virtual_key="vk-obg***",   # Add your provider's virtual key
  strict_open_ai_compliance=False  # Required for thinking mode
)

# Create the request
response = client.chat.completions.create(
  model="claude-3-7-sonnet-latest",
  max_tokens=3000,
  thinking={
    "type": "enabled",
    "budget_tokens": 2030  # Maximum tokens to use for thinking
  },
  stream=False,
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
        }
      ]
    }
  ]
)
print(response)

# For streaming responses, handle content_blocks differently
# response = client.chat.completions.create(
#   ...same config as above but with stream=True
# )
# for chunk in response:
#     if chunk.choices[0].delta:
#         content_blocks = chunk.choices[0].delta.get("content_blocks")
#         if content_blocks is not None:
#             for content_block in content_blocks:
#                 print(content_block)

Multi-Turn Conversations

For multi-turn conversations, include the previous thinking content in the conversation history:
  • Python SDK
  • OpenAI Python
  • cURL
from obiguard import Obiguard

# Initialize the Obiguard client
client = Obiguard(
  virtual_key="vk-obg***",   # Your Obiguard virtual key here
  strict_open_ai_compliance=False
)

# Create the request
response = client.chat.completions.create(
  model="claude-3-7-sonnet-latest",
  max_tokens=3000,
  thinking={
    "type": "enabled",
    "budget_tokens": 2030
  },
  stream=False,
  messages=[
    {
      "role": "user",
      "content": [
    {
      "type": "text",
      "text": "when does the flight from baroda to bangalore land tomorrow, what time, what is its flight number, and what is its baggage belt?"
    }
      ]
    },
    {
      "role": "assistant",
      "content": [
    {
      "type": "thinking",
      "thinking": "The user is asking several questions about a flight from Baroda (also known as Vadodara) to Bangalore:\n1. When does the flight land tomorrow\n2. What time does it land\n3. What is the flight number\n4. What is the baggage belt number at the arrival airport\n\nTo properly answer these questions, I would need access to airline flight schedules and airport information systems. However, I don't have:\n- Real-time or scheduled flight information\n- Access to airport baggage claim allocation systems\n- Information about specific flights between these cities\n- The ability to look up tomorrow's specific flight schedules\n\nThis question requires current, specific flight information that I don't have access to. Instead of guessing or providing potentially incorrect information, I should explain this limitation and suggest ways the user could find this information.",
      "signature": "EqoBCkgIARABGAIiQBVA7FBNLRtWarDSy9TAjwtOpcTSYHJ+2GYEoaorq3V+d3eapde04bvEfykD/66xZXjJ5yyqogJ8DEkNMotspRsSDKzuUJ9FKhSNt/3PdxoMaFZuH+1z1aLF8OeQIjCrA1+T2lsErrbgrve6eDWeMvP+1sqVqv/JcIn1jOmuzrPi2tNz5M0oqkOO9txJf7QqEPPw6RG3JLO2h7nV1BMN6wE="
    }
      ]
    },
    {
      "role": "user",
      "content": "thanks that's good to know, how about to chennai?"
    }
  ]
)
print(response)

Understanding the Response Structure

When working with thinking-enabled models, note that their responses use a special format:
The assistant’s thinking output is found in the response_chunk.choices[0].delta.content_blocks array, not in the response.choices[0].message.content string.
This distinction is crucial for streaming responses, where you must extract the thinking content from the content blocks.

FAQs

No, thinking mode is only available on specific reasoning-optimized models. Currently, this includes Claude 3.7 Sonnet and will expand to other models as they become available.
Yes, enabling thinking mode will increase your token usage since the model is generating additional content for its reasoning process. The budget_tokens parameter lets you control the maximum tokens allocated to thinking.
Yes, particularly for streaming responses. The thinking content is returned in the content_blocks array rather than the standard content field, so you’ll need to adapt your response parsing logic.
The thinking mode response format extends beyond the standard OpenAI completion schema. Setting strict_open_ai_compliance to false allows Obiguard to return this extended format with the thinking content.