AI Is Smarter Than You Realize

even if progress stalls (and it won't), we've only started to tap its potential

Aug 09, 2024

It’s easy to underestimate the capabilities of current-generation AI, but with a few adjustments in approach, current models are capable of much more than most people realize.

Better Prompting -> Better Responses

The quality of the prompt significantly affects the result. Providing detailed, step-by-step instructions or better context can dramatically improve results. For example, instructing the model to "think step-by-step" or to "ask clarifying questions" can guide it to produce more accurate and useful outputs.

For example, here’s a prompt I gave to GPT-4o:

Prompt: Jim is Bob's brother. Jim's wife is Sarah who has no siblings. Bob has no other siblings. Sarah's daughter's aunt is named Susan. What is Susan's relationship to Bob?

I gave Claude Sonnet 3.5 the same prompt and it nailed it out of the gate, but this was GPT-4o’s incorrect response:

gpt-4o:
Given the relationships described:
Jim is Bob's brother.
Jim's wife is Sarah.
Sarah has no siblings.
…
… the most plausible relationship, without contradicting the given information, is that Susan is Bob's sister-in-law, married to either an undisclosed brother of Jim (and therefore Bob) or related to Sarah through another form of marriage connection.

Yikes!

This response makes me feel kind of sympathetic to the AI is useless crowd, but we can do a lot better. To show what better prompting can do, I created a new session and appended the bolded text below to help GPT-4o get to the right answer on its own:

Prompt: Jim is Bob's brother. Jim's wife is Sarah who has no siblings. Bob has no other siblings. Sarah's daughter's aunt is named Susan. What is Susan's relationship to Bob? Let's think through this logically step-by-step and be careful to make sure you get this right by double checking your answer for inconsistencies with the provided information.

gpt-4o:
To determine Susan's relationship to Bob, let's break down the given relationships step by step:
Jim and Bob are brothers.
Jim's wife is Sarah, who has no siblings.
Bob has no other siblings…
…Given that Susan is identified as an aunt and can only be Bob's wife due to the lack of other siblings and Sarah having no siblings, we conclude:
Susan's relationship to Bob is that Susan is Bob's wife.

This example demonstrates the power of prompt engineering. By providing explicit instructions to think step-by-step and double-check for inconsistencies, we guided the AI to arrive at the correct conclusion. This approach not only improves accuracy but also demonstrates how AI can be steered towards more reliable and logical reasoning.

The takeaway is that this is a skill that can be honed. Just as a manager learns to give clear instructions to their team, users can learn to craft prompts that elicit more accurate and useful responses.

Unfair Comparison: Context and Tuning Matter

Humans often ask models challenging questions they themselves know little about, leading to frustration when the LLM fails on the first attempt. This often results in an unfair comparison, as humans would generally seek additional information or ask clarifying questions before tackling such complex requests. Most consumer-facing LLMs are tuned to provide answers without seeking further context, unlike their human counterparts who instinctively probe for more information or, even more likely, already have sufficient context from past experience specific to the requester. While this kind of single-fire response does well in the context of the chatbot ranking arena, it can lead to frustration in the real world.

For example, look at how two different models respond to a line of questioning that lacks sufficient context.

Prompt: Who is the President?

gemini-1.5-pro-api-0514 : Joe Biden is the current president of the United States.
deepseek-v2-api-0628: As of my last update in October 2023, the President of the United States is Joe Biden. He took office on January 20, 2021.

Prompt: I work for McDonalds. Who is the President?

gemini-1.5-pro-api-0514 : Whether you work at McDonald's or anywhere else doesn't change the answer: the President of the United States is still Joe Biden.

deepseek-v2-api-0628: I work for McDonalds. Who is the President? As of my last update in October 2023, the President of McDonald's Corporation is Chris Kempczinski. He was appointed to this position in November 2019, succeeding Steve Easterbrook.

As the example above illustrates, both models attempt to respond to this ambiguous situation without seeking clarification from the user. The prompt could be interpreted in several ways: Is the user asking about the president of McDonald's, the president of the United States, the president of another country, or even some other president entirely? Instead of asking for more context, the model makes its best guess and proceeds with the response. While it’s technically possible to adapt models to ask clarifying questions at each step, this approach might be less appealing to the average user, who typically expects quick and straightforward answers.

It’s surprising how quickly some users are willing to give up and write these tools off as useless, but I suspect most of them have never experienced the challenges of working with an overseas development team. Although the human-level intelligence of a dev team far surpasses that of any existing LLM, most users would still find it frustrating to accomplish tasks using only asynchronous written communication. For me, the LLM experience, while not the same, is reminiscent of this. When you’re careful to fill in the missing context that most in-person communication takes for granted, you get results.

Overcoming Unreliability

A common objection I hear from the "AI is overhyped" crowd is that AI outputs are unreliable and therefore useless since everything they produce has to be checked by a human anyway. Where this objection falls apart is, in many cases, verification can be automated. For example, in software development, AI can generate code while automated systems verify its correctness. Modern development practices already employ Continuous Integration/Continuous Deployment (CI/CD) pipelines that automatically run unit tests, integration tests, and even performance benchmarks. These same systems can be applied to AI-generated code, ensuring its reliability without human intervention, so developers can focus on higher-level tasks like architecture design and complex problem-solving, rather than writing and testing every line of code manually.

Even when human oversight is necessary, the efficiency gain can still be substantial. AI rapidly produces initial solutions, allowing humans to refine and make high-level decisions rather than starting from scratch, dramatically accelerating the development process. While some may suggest that AI code quality is so buggy that it’s more trouble to use than it’s worth, it’s useful to remember that this is the worst it will ever be going forward and not every company employs a staff of elite programmers like yours does, and in some (most?) cases, the AI is still going to write higher quality code than the human who otherwise would have.

Additionally, using scaffolding techniques, where multiple passes or agents generate a variety of possible answers, can also enhance the output quality. This method allows a terminal model (or potentially even you) to select the best result from a set of responses, leveraging the model’s ability to generate diverse solutions. Incorporating a variety of models into the solutioning and refinement pipeline improves outputs even more.

As an example, here’s an interaction with GPT-4o mini applying just the simple action of performing multiple passes to allow this small model to self-correct (I had to use mini because GPT-4o and Claud kept getting my questions right on the first try).

Prompt: please give me five sentences with exactly five words each and number them 1 - 5.

gpt-4o-mini:
1. The cat chased the mouse.
2. She enjoys reading mystery novels.
3. He runs every morning before breakfast.
4. They traveled to Paris last summer.
5. Coffee helps me stay awake.

Prompt: which of those sentences actually have 5 words? Please just output the numbers.

gpt-4o-mini: 1, 2, 3, 4, 5

Prompt: which of those sentences actually have exactly 5 words and no more? Please count the words from each one and display it after the sentence.

gpt-4o-mini:
1. The cat chased the mouse. (5)
2. She enjoys reading mystery novels. (5)
3. He runs every morning before breakfast. (6)
4. They traveled to Paris last summer. (6)
5. Coffee helps me stay awake. (5)

Prompt: now tell me which actually have 5 words please.

gpt-4o-mini: 1, 2, 5

In essence, the "unreliability" argument underestimates both the power of automated verification and the value of AI as a force multiplier for human productivity. As AI systems continue to improve, the balance of effort will increasingly shift from verification to strategic application of AI-generated outputs, leading to significant efficiency gains.

Conclusion

By enhancing our interactions with large language models—through better prompting, scaffolding, and feedback—we unlock their true potential. Feedback is critical for improving model performance. Unlike humans, who naturally validate their answers against real-world feedback, deployed models have to rely on explicit feedback mechanisms. For instance, when writing a technical report, a human might cross-check data against reliable sources to ensure accuracy. Implementing similar feedback loops for models can help refine their outputs. This approach allows us to appreciate the capabilities of AI and the efficiency gains achievable when humans and AI work together effectively.

Abstraction

Discussion about this post