Thinking/Reasoning models are a new class of LLMs designed to make their internal reasoning visible. Unlike standard LLMs that provide only final answers, thinking models such as Claude 3.7 Sonnet, OpenAI o1/o3, and Deepseek R1 “think out loud” by generating a detailed chain of thought before presenting their conclusions.

These models are optimized for tasks that demand complex analysis, multi-step reasoning, and structured logic. Obiguard offers access to these advanced models through a unified API that works seamlessly across different providers.

Supported Thinking Models

Obiguard currently supports these thinking-enabled models:

  • Anthropic: claude-3-7-sonnet-latest
  • Google Vertex AI: anthropic.claude-3-7-sonnet@20250219
  • Amazon Bedrock: claude-3-7-sonnet

Additional thinking models will be supported as they become available.

Using Thinking Mode

  1. Set strict_open_ai_compliance=False in your headers or client configuration.
  2. Thinking responses are returned in a format different from standard completions.
  3. For streaming, the thinking content appears in response_chunk.choices[0].delta.content_blocks.

Basic Example

from obiguard import Obiguard

# Initialize the Obiguard client
client = Obiguard(
  virtual_key="vk-obg***",   # Add your provider's virtual key
  strict_open_ai_compliance=False  # Required for thinking mode
)

# Create the request
response = client.chat.completions.create(
  model="claude-3-7-sonnet-latest",
  max_tokens=3000,
  thinking={
    "type": "enabled",
    "budget_tokens": 2030  # Maximum tokens to use for thinking
  },
  stream=False,
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?"
        }
      ]
    }
  ]
)
print(response)

# For streaming responses, handle content_blocks differently
# response = client.chat.completions.create(
#   ...same config as above but with stream=True
# )
# for chunk in response:
#     if chunk.choices[0].delta:
#         content_blocks = chunk.choices[0].delta.get("content_blocks")
#         if content_blocks is not None:
#             for content_block in content_blocks:
#                 print(content_block)

Multi-Turn Conversations

For multi-turn conversations, include the previous thinking content in the conversation history:

from obiguard import Obiguard

# Initialize the Obiguard client
client = Obiguard(
  virtual_key="vk-obg***",   # Your Obiguard virtual key here
  strict_open_ai_compliance=False
)

# Create the request
response = client.chat.completions.create(
  model="claude-3-7-sonnet-latest",
  max_tokens=3000,
  thinking={
    "type": "enabled",
    "budget_tokens": 2030
  },
  stream=False,
  messages=[
    {
      "role": "user",
      "content": [
    {
      "type": "text",
      "text": "when does the flight from baroda to bangalore land tomorrow, what time, what is its flight number, and what is its baggage belt?"
    }
      ]
    },
    {
      "role": "assistant",
      "content": [
    {
      "type": "thinking",
      "thinking": "The user is asking several questions about a flight from Baroda (also known as Vadodara) to Bangalore:\n1. When does the flight land tomorrow\n2. What time does it land\n3. What is the flight number\n4. What is the baggage belt number at the arrival airport\n\nTo properly answer these questions, I would need access to airline flight schedules and airport information systems. However, I don't have:\n- Real-time or scheduled flight information\n- Access to airport baggage claim allocation systems\n- Information about specific flights between these cities\n- The ability to look up tomorrow's specific flight schedules\n\nThis question requires current, specific flight information that I don't have access to. Instead of guessing or providing potentially incorrect information, I should explain this limitation and suggest ways the user could find this information.",
      "signature": "EqoBCkgIARABGAIiQBVA7FBNLRtWarDSy9TAjwtOpcTSYHJ+2GYEoaorq3V+d3eapde04bvEfykD/66xZXjJ5yyqogJ8DEkNMotspRsSDKzuUJ9FKhSNt/3PdxoMaFZuH+1z1aLF8OeQIjCrA1+T2lsErrbgrve6eDWeMvP+1sqVqv/JcIn1jOmuzrPi2tNz5M0oqkOO9txJf7QqEPPw6RG3JLO2h7nV1BMN6wE="
    }
      ]
    },
    {
      "role": "user",
      "content": "thanks that's good to know, how about to chennai?"
    }
  ]
)
print(response)

Understanding the Response Structure

When working with thinking-enabled models, note that their responses use a special format:

The assistant’s thinking output is found in the response_chunk.choices[0].delta.content_blocks array, not in the response.choices[0].message.content string.

This distinction is crucial for streaming responses, where you must extract the thinking content from the content blocks.

FAQs