Obiguard enables the screening of LLM interactions to address various threats. Validator checks, flagging logic, and strictness levels are all centrally managed through an Obiguard guardrail policy.
A project may contain multiple guardrail policies. A policy allows you to define specific configurations for each application, LLM-based feature, environment, or end user. You can also adjust validators and strictness levels for individual apps and integrations dynamically to address threats, improve user experience, or align with your risk tolerance.
Obiguard validators are categorized into four types of defenses:
Prompt defense - Safeguards against user prompts, documents, or other LLM inputs containing instructions that could override intended behavior, manipulate the LLM into malicious actions, or leak sensitive data. These harmful instructions, known as prompt attacks, include jailbreaks, prompt injections, and other manipulative or malicious inputs.
Content moderation - Prevent your applications from generating inappropriate or harmful content, and identify attempts by users to produce offensive material or engage in dangerous or illegal activities. Content moderation includes:
Data Leakage Prevention - Safeguard Personally Identifiable Information, prevent system prompt leakage, and avoid costly leakage of sensitive data, ensuring compliance with data protection and privacy regulations. PII coverage inclues:
Unknown links - Prevent attackers from tricking the LLM into presenting malicious or phishing links to users.
A policy group specifies the defenses to apply for securing LLM interactions and the paranoia level, allowing you to tailor configurations to your specific use cases and risk tolerance.
Each project in Obiguard includes a set of policy configurations. These policies determine the guardrail checks applied to every API request associated with the project’s API key.
For example, a policy might:
When the validators in the policy group screen a request, any detection will result in the request being marked as ‘flagged’.
If one or more validators flag the request, the guard
response will indicate flagged equals failed
.
If no validators flag the request, the response will indicate flagged equals passed
.
You can configure your application to handle flagged responses in various ways. Options include blocking the flagged inputs or outputs, prompting the user for confirmation before proceeding, logging the flagged response for analysis and monitoring, or taking no action. The choice is entirely yours.
Optionally, the response can include a detailed breakdown of the flagging decision. This will show the detectors that were executed, as specified in the policy, and indicate whether each detector identified an issue.
Certain validators allows adjusting the flagging thresholds by setting the confidence level for each defense category. This allows you to customize the strictness and risk tolerance for your specific use cases.
For instance, if a use case has a high tolerance for risk, you can configure a detector to flag only detections with very high confidence, minimizing false positives. Conversely, for scenarios requiring strict protection, even at the expense of user experience, you can set the detector to flag any potential detection. This is achieved by adjusting the confidence level threshold for each defense. The threshold determines the minimum confidence level required for a detector to flag a detection.
Obiguard defines the following confidence levels:
L1 - Lenient, minimizing false positives. L2 - Balanced, allowing some false positives. L3 - Strict, with very low false negatives but more false positives. L4 - Paranoid, prioritizing minimal false negatives at the cost of higher false positives. This is the default confidence level.
These levels align with OWASP’s paranoia level definitions for web application firewalls (WAFs).
The Unknown Links detector cannot be adjusted and only provides a binary result: passed
or failed
.
The response time of the guard API depends on the content length being screened and the detectors specified in the policy.
Changes to a policy may affect the latency experienced by your application and its users.
If you need help aligning your policies to meet strict latency requirements please reach out to [email protected].
Here’s an example of how you could configure a policy for your customer support chatbot:
Input settings:
Output settings:
For example, if one of your products is jerk seasoning and Obiguard flags mentions of “jerk” as potential profanity due to strict content moderation settings, you can resolve this by editing the policy. Adjust the profanity threshold level to L1 while keeping other content moderation detectors at L3. This change takes effect immediately, allowing your customers to freely ask questions about jerk seasoning.