A project may contain multiple guardrail policies. A policy allows you to define specific configurations for each application, LLM-based feature, environment, or end user. You can also adjust validators and strictness levels for individual apps and integrations dynamically to address threats, improve user experience, or align with your risk tolerance.

Validators

Obiguard validators are categorized into four types of defenses:

  1. Prompt defense - Safeguards against user prompts, documents, or other LLM inputs containing instructions that could override intended behavior, manipulate the LLM into malicious actions, or leak sensitive data. These harmful instructions, known as prompt attacks, include jailbreaks, prompt injections, and other manipulative or malicious inputs.

  2. Content moderation - Prevent your applications from generating inappropriate or harmful content, and identify attempts by users to produce offensive material or engage in dangerous or illegal activities. Content moderation includes:

    • Criminal activity
    • Hate speech
    • Profanity
    • Explicit content
    • Violence
    • Weapon-related content
    • Custom detectors to flag specific trigger words or phrases
  3. Data Leakage Prevention - Safeguard Personally Identifiable Information, prevent system prompt leakage, and avoid costly leakage of sensitive data, ensuring compliance with data protection and privacy regulations. PII coverage inclues:

    • Full names
    • United States mailing addresses
    • Phone numbers
    • Email addresses
    • Internet Protocol (IP) addresses
    • Credit card numbers
    • International Bank Account Numbers (IBANs)
    • United States Social Security Numbers (SSNs)
  4. Unknown links - Prevent attackers from tricking the LLM into presenting malicious or phishing links to users.

A policy group specifies the defenses to apply for securing LLM interactions and the paranoia level, allowing you to tailor configurations to your specific use cases and risk tolerance.

How Obiguard policies work

Each project in Obiguard includes a set of policy configurations. These policies determine the guardrail checks applied to every API request associated with the project’s API key.

For example, a policy might:

  • Inspect user inputs for prompt attacks or sensitive information (PII).
  • Analyze LLM outputs for violations of content moderation rules or suspicious links from untrusted domains.

Flagging logic

When the validators in the policy group screen a request, any detection will result in the request being marked as ‘flagged’. If one or more validators flag the request, the guard response will indicate flagged equals failed. If no validators flag the request, the response will indicate flagged equals passed.

You can configure your application to handle flagged responses in various ways. Options include blocking the flagged inputs or outputs, prompting the user for confirmation before proceeding, logging the flagged response for analysis and monitoring, or taking no action. The choice is entirely yours.

Optionally, the response can include a detailed breakdown of the flagging decision. This will show the detectors that were executed, as specified in the policy, and indicate whether each detector identified an issue.

Threshold levels

Certain validators allows adjusting the flagging thresholds by setting the confidence level for each defense category. This allows you to customize the strictness and risk tolerance for your specific use cases.

For instance, if a use case has a high tolerance for risk, you can configure a detector to flag only detections with very high confidence, minimizing false positives. Conversely, for scenarios requiring strict protection, even at the expense of user experience, you can set the detector to flag any potential detection. This is achieved by adjusting the confidence level threshold for each defense. The threshold determines the minimum confidence level required for a detector to flag a detection.

Obiguard defines the following confidence levels:

L1 - Lenient, minimizing false positives. L2 - Balanced, allowing some false positives. L3 - Strict, with very low false negatives but more false positives. L4 - Paranoid, prioritizing minimal false negatives at the cost of higher false positives. This is the default confidence level.

These levels align with OWASP’s paranoia level definitions for web application firewalls (WAFs).

The Unknown Links detector cannot be adjusted and only provides a binary result: passed or failed.

Latency impact

The response time of the guard API depends on the content length being screened and the detectors specified in the policy.

Changes to a policy may affect the latency experienced by your application and its users.

If you need help aligning your policies to meet strict latency requirements please reach out to [email protected].

A simple policy example

Here’s an example of how you could configure a policy for your customer support chatbot:

Input settings:

  • Detect Jailbreak validator: Set the threshold level to 0.8, ensuring that any request highly likely to be a prompt attack is flagged.
  • Detect PII validator: Enable detection for credit card numbers, IBAN codes, and US Social Security Numbers to prevent malicious data extraction and accidental user data leaks to the app or third-party LLM provider.
  • Profanity Check validator: Flag any content that is likely to be inappropriate or offensive.

Output settings:

  • Toxic Language validator: Set the threshold level to 0.8 to flag content that is likely to violate moderation rules, ensuring the chatbot isn’t manipulated into generating inappropriate responses.
  • Competitor Guard validator: Add your competitors’ names to flag any mention of their brand or products.
  • NSFW Text validator: Set the threshold level to 0.8 to prevent the chatbot from sharing phishing or malicious links due to indirect prompt attacks.

For example, if one of your products is jerk seasoning and Obiguard flags mentions of “jerk” as potential profanity due to strict content moderation settings, you can resolve this by editing the policy. Adjust the profanity threshold level to L1 while keeping other content moderation detectors at L3. This change takes effect immediately, allowing your customers to freely ask questions about jerk seasoning.