Content moderation is a critical aspect of maintaining the integrity and safety of interactions with LLMs. Layerup Security has developed a custom model specifically designed to detect harmful content within the responses generated by LLMs. Our model evaluates the content against several categories of harmful content, ensuring that the responses adhere to our strict guidelines for safe and respectful communication.

The harmful content categories that our model can detect include:

  • Violence and Hate
  • Sexual Content
  • Criminal Planning
  • Guns and Illegal Weapons
  • Regulated or Controlled Substances
  • Self-Harm
  • Profanity

Each category has specific requirements that define what should and should not be included in the LLM’s responses. For instance, while discussions on sensitive topics like violence or discrimination are allowed for educational purposes, any content that encourages or plans violent acts, hate speech, or discrimination is strictly prohibited.

To ensure that the LLM’s responses are moderated effectively, invoke the layerup.content_moderation guardrail. This will analyze the content and flag any harmful elements based on the predefined categories. If harmful content is detected, you can take appropriate actions such as filtering the content, alerting a moderator, or rejecting the response altogether.

Our content moderation model is an essential tool for creating a safe environment for users to interact with LLMs without exposure to harmful content.