Evaluating GPT-4 for Frontend Code Generation - A Case Study in Data Fetching and State Handling

A Standardized Rubric

Evaluating AI output involves creating a standardized rubric which we can apply to the responses we receive to our prompts to see how the model fared. Some common criteria for evaluating AI responses include:

Accuracy/Correctness: Measures whether the information provided is factually true and free from errors or hallucinations.
Helpfulness: Assesses how useful or actionable the response is for the user’s specific task or intent.
Relevance: Determines if the response directly addresses all parts of the user’s query without going off-topic or including unnecessary information.
Coherence/Clarity: Evaluates the logical structure, narrative flow, and readability of the response, ensuring it is easy for a human to understand.
Safety/Bias: Checks for the absence of harmful, biased, or inappropriate content, ensuring the response adheres to ethical guidelines.
Consistency: Assesses whether the model provides similar answers to similar questions, which builds user trust and predictability.
Completeness: Measures if the response fully answers the user’s prompt without leaving out critical details.
Tone/Style: Evaluates if the response’s tone and style are appropriate for the context and user’s request (e.g., formal, empathetic, technical).

In this case study, I apply this criteria to a component created by GPT-4 in response to the prompt included below. At a high level, the component is required to make an async API call and fetch some data.

In evaluating the model’s response, focus is on the criteria most relevant to frontend code quality and developer usability. It is also important to note that while prescribed behavior recommends API calls that originate from a backend, frontend components still need to manage async state handling, error presentation, and user-facing data flow — which is the focus of this evaluation.

A Prompt to create a component that fetches data

➡️

I am trying to retrieve a list of movies from [TMDB.org](http://tmdb.org/)
so that I can show all the current movies on my website. Create a component
that will handle calling the API and fetching all the current movies to display
on my site. The component should:

- be written in modern React and should be a functional component
- be easily readable, avoid unnecessary complexity, and include appropriate naming
- follow best practices in terms of separation of concerns and usage of hooks
- be able to make async calls to the TMDB API and
- be able to show loading, error, success and empty result states
- trigger the API call on mount of the component

The UI included should be visually clean and easy for a viewer to understand.

The Model’s Response

1️⃣ API utility (separation of concerns)

services/tmdb.js

const TMDB_API_KEY = import.meta.env.VITE_TMDB_API_KEY; // or process.env.REACT_APP_TMDB_API_KEY
const BASE_URL = "https://api.themoviedb.org/3";

export async function fetchNowPlayingMovies() {
  const response = await fetch(
    "${BASE_URL}/movie/now_playing?api_key=${TMDB_API_KEY}&language=en-US&page=1"
  );

  if (!response.ok) {
    throw new Error("Failed to fetch movies");
  }

  const data = await response.json();
  return data.results ?? [];
}

2️⃣ Custom hook for data fetching

hooks/useNowPlayingMovies.js

import { useEffect, useState } from "react";
import { fetchNowPlayingMovies } from "../services/tmdb";

export function useNowPlayingMovies() {
  const [movies, setMovies] = useState([]);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState(null);

  useEffect(() => {
    let isMounted = true;

    async function loadMovies() {
      try {
        setLoading(true);
        const results = await fetchNowPlayingMovies();
        if (isMounted) {
          setMovies(results);
        }
      } catch (err) {
        if (isMounted) {
          setError(err.message);
        }
      } finally {
        if (isMounted) {
          setLoading(false);
        }
      }
    }

    loadMovies();

    return () => {
      isMounted = false;
    };
  }, []);

  return { movies, loading, error };
}

3️⃣ UI component

components/NowPlayingMovies.jsx

import { useNowPlayingMovies } from "../hooks/useNowPlayingMovies";

export default function NowPlayingMovies() {
  const { movies, loading, error } = useNowPlayingMovies();

  if (loading) {
    return <p style={styles.message}>Loading movies…</p>;
  }

  if (error) {
    return <p style={{ ...styles.message, color: "red" }}>{error}</p>;
  }

  if (movies.length === 0) {
    return <p style={styles.message}>No movies currently playing.</p>;
  }

  return (
    <div style={styles.container}>
      <h2>Now Playing</h2>
      <ul style={styles.list}>
        {movies.map((movie) => (
          <li key={movie.id} style={styles.card}>
            <strong>{movie.title}</strong>
            <p style={styles.overview}>{movie.overview}</p>
          </li>
        ))}
      </ul>
    </div>
  );
}

4️⃣ Minimal styling (clean + readable)

const styles = {
  container: {
    maxWidth: 800,
    margin: "0 auto",
    padding: "1rem",
  },
  list: {
    listStyle: "none",
    padding: 0,
    display: "grid",
    gap: "1rem",
  },
  card: {
    padding: "1rem",
    border: "1px solid #ddd",
    borderRadius: 6,
    backgroundColor: "#fff",
  },
  overview: {
    fontSize: "0.9rem",
    color: "#555",
  },
  message: {
    textAlign: "center",
    padding: "2rem",
  },
};

5️⃣ Environment variable setup

VITE_TMDB_API_KEY=your_api_key_here

OR

REACT_APP_TMDB_API_KEY=your_api_key_here

The model highlighted a few features of its response:

➡️

Why this approach works well

✔ Clear responsibility boundaries

✔ Easy to test and maintain

✔ Safe async handling

✔ UI states are explicit and predictable

✔ Scales nicely if you add pagination, filters, or caching

As a next step, the model offered:

➡️

If you want, I can:

- Add pagination or infinite scrolling
- Convert this to TypeScript
- Add **movie posters + lazy loading**
- Integrate React Query / SWR for caching
- Create a grid-based, responsive UI

Evaluation

For this study, I focused on

the criteria most relevant to frontend code quality and developer usability
a first-pass response, not a production-ready solution

ACCURACY / CORRECTNESS

In this category, the model executed well, producing clearly-labeled sections that addressed the main concerns of the prompt fully.

Logical correctness:

Loading, error and success states are managed well addressing logic
Handles empty set of data

Architectural correctness:

Custom hook abstracts functionality addressing architecture concerns

Conclusion: The model returns a factually correct answer.

HELPFULNESS

Here the model did well, making recommendations on ways to improve the component.

Includes API key set up and usage without being explicitly mentioned in the prompt
Offers suggestions on what to add next including responsiveness and Typescript
In the next steps section, suggestions for accessibility are missing

Conclusion: Helpful but still requires review.

COHERENCE / CLARITY

The model returned a clear response by creating sections and tying them together logically

Breaks code up into smaller modules and clarifies purpose each of them serves
Names each module clearly, fulfilling an explicit requirement of the prompt
Creates a styles object which is helpful, but does not clarify location, if imported or local
Introduces code to store an API key but fails to clarify its naming (.env) and suggest a location

Conclusion: once again a strong performance but misses on a few key details

COMPLETENESS

The model provided a response that was sufficiently complete and included all major requirements mentioned in the prompt. It went further to make recommendations on options for next steps.

Provides a cleanup method on unmounting of the component
The component is well structured and goes so far as to include basic styles
Accessibility features are not mentioned
Suggestions for testing could be expanded on

Conclusion: Addresses prompt completely with room for improvement.

The model performed well in terms of Relevance, offering code and suggestions that were directly related to the prompt and not off topic.

Consistency was also not a major factor since multiple iterations would be necessary in order to evaluate it further for this criterion.

Finally, in terms of Safety / Bias, and Tone / Style, the model did not respond with any obvious problems or omissions.

Overall: By applying a standard set of rubrics to a model’s response, we can see how well it addresses the prompt, and how much more human intervention is required to take it to production grade.