Job Description
- Build Python-based pipelines for automated quality testing of AI responses.
- Integrate LLMs into automated evaluation frameworks (e.g., using GPT-based evaluators, embeddings, or custom scoring).
- Automate regression and stress testing for conversational AI flows.
- Define evaluation metrics (relevance, factuality, coherence, safety, empathy).
- Implement both rule-based and AI-driven quality checks.
- Monitor model drift, bias, and hallucinations using automated workflows.
- Work with APIs, SDKs, and CI/CD pipelines to embed automated AI evaluation in production.
- Develop monitoring dashboards to visualize conversation quality.
- Collaborate with ML engineers, product managers, and QA teams to close the feedback loop.
- Experiment with prompt engineering and automated prompt-testing frameworks.
- Explore reinforcement learning, self-critique models, or human-in-the-loop automation for continuous improvement.
Ready to Apply?
Take the next step in your AI career. Submit your application to Recro today.
Submit Application