Safeguard Settings

Learn how to configure safeguard warnings to get alerts to the use of problematic content.

The Safeguard feature in Narus helps your team stay ahead of risk by flagging potentially problematic prompts submitted by users. When a concerning input is detected, an alert appears in the Reports screen, giving you visibility into how AI models are being used by your teams. Safeguard does not block prompts, but it empowers administrators to monitor activity and step in when needed to protect brand integrity or ensure compliance. Alert triggers can be configured in the Safeguard screen. Prompts that have triggered safeguarding warnings are shown on the Reports screen.

Configure Triggers

Within the Safeguard screen, you can define specific topics, words, and formats that should trigger alerts when used in prompts. You can also assign severity levels to each trigger, allowing your team to quickly distinguish between minor issues and high-priority risks, helping you stay focused on what matters most to your brand and compliance standards.

Topics

Topics offer a streamlined way to set up alerts without manually listing every related keyword. By defining a topic (such as sensitive data or political) you can automatically trigger alerts based on a broad set of related terms, saving time and ensuring more comprehensive coverage.

To add a topic:

  1. Navigate to the Safeguard screen.
  2. Select + Add topic under the Topics heading.
  3. Enter a topic and assign a risk level (low, medium or high).
  4. Click Add topic. The topic is now listed under Topics on the Safeguard screen.

Words

Unlike Topics, which detect themes based on a range of related terms, specifying individual keywords ensures that a particular word will always trigger an alert regardless of context. This is useful when you need to monitor highly sensitive or brand-critical terms with precision, ensuring nothing slips through the cracks.

To add a word:

  1. Navigate to the Safeguard screen.
  2. Select + Add word under the Words heading.
  3. Enter a word and assign a risk level (low, medium or high).
  4. Click Add word. The word is now listed under Words on the Safeguard screen.

Formats

Format-based alerts allow you to detect prompts that include data matching specific patterns, such as phone numbers, email addresses, or other personal information. This helps to proactively identify when users may be inputting sensitive data.

You can enable or disable these alerts using the on/off toggles.

Change warning level

To change the warning level for any trigger, select the cog icon next to the trigger.