Evaluating AI output involves creating a standardized rubric which we can apply to the responses we receive to our prompts to see how the model fared. Some common criteria for evaluating AI responses include:
Accuracy/Correctness: Measures whether the information provided is factually true and free from errors or hallucinations.
Helpfulness: Assesses how useful or actionable the response is for the user’s specific task or intent.
Relevance: Determines if the response directly addresses all parts of the user’s query without going off-topic or including unnecessary information.
Coherence/Clarity: Evaluates the logical structure, narrative flow, and readability of the response, ensuring it is easy for a human to understand.
Safety/Bias: Checks for the absence of harmful, biased, or inappropriate content, ensuring the response adheres to ethical guidelines.
Consistency: Assesses whether the model provides similar answers to similar questions, which builds user trust and predictability.
Completeness: Measures if the response fully answers the user’s prompt without leaving out critical details.
Tone/Style: Evaluates if the response’s tone and style are appropriate for the context and user’s request (e.g., formal, empathetic, technical).
In this case study, I apply this criteria to a component created by GPT-4 in response to the prompt included below. At a high level, the component is required to make an async API call and fetch some data.
In evaluating the model’s response, focus is on the criteria most relevant to frontend code quality and developer usability. It is also important to note that while prescribed behavior recommends API calls that originate from a backend, frontend components still need to manage async state handling, error presentation, and user-facing data flow — which is the focus of this evaluation.
Conclusion: The model returns a factually correct answer.
HELPFULNESS
Here the model did well, making recommendations on ways to improve the component.
Includes API key set up and usage without being explicitly mentioned in the prompt
Offers suggestions on what to add next including responsiveness and Typescript
In the next steps section, suggestions for accessibility are missing
Conclusion: Helpful but still requires review.
COHERENCE / CLARITY
The model returned a clear response by creating sections and tying them together logically
Breaks code up into smaller modules and clarifies purpose each of them serves
Names each module clearly, fulfilling an explicit requirement of the prompt
Creates a styles object which is helpful, but does not clarify location, if imported or local
Introduces code to store an API key but fails to clarify its naming (.env) and suggest a location
Conclusion: once again a strong performance but misses on a few key details
COMPLETENESS
The model provided a response that was sufficiently complete and included all major requirements mentioned in the prompt. It went further to make recommendations on options for next steps.
Provides a cleanup method on unmounting of the component
The component is well structured and goes so far as to include basic styles
Accessibility features are not mentioned
Suggestions for testing could be expanded on
Conclusion: Addresses prompt completely with room for improvement.
The model performed well in terms of Relevance, offering code and suggestions that were directly related to the prompt and not off topic.
Consistency was also not a major factor since multiple iterations would be necessary in order to evaluate it further for this criterion.
Finally, in terms of Safety / Bias, and Tone / Style, the model did not respond with any obvious problems or omissions.
Overall: By applying a standard set of rubrics to a model’s response, we can see how well it addresses the prompt, and how much more human intervention is required to take it to production grade.