A realistic look at AutoGPT (agent AIs)
No they are not 'changing everything overnight' but they are interesting and potentially powerful.
Hello from rainy Switzerland! I love traveling by train. Today let’s talk about AutoGPTs aka agent AIs.
“Civilization advances by extending the number of operations we can perform without thinking about them” — Alfred North Whitehead.
If you follow any AI people or newsletters you might have seen ‘AutoGPT’ and ‘BabyAGI’ demos this past month.
Agent AIs exploded onto the scene with the usual predictions of ‘changing everything’ and ‘mindblowing’. Soooo….
Why are people so excited about AutoGPTs or agent AIs?
The one currency we never have enough of is time. Therefore, machines and processes that free up our time and or increase our ability to get work done are by nature, valuable.
Washing machines are valuable because you don’t have to sit there scrubbing clothes.
Tractors are valuable because one person can now plow an entire field in hours instead of days.
Computers are valuable because they can do tasks that would take humans a long time. Computers can do math really well, really fast. Millions of calculations a second have been harnessed to create programs that save us time.
But the one problem with computers is they are dumb and you have to tell them exactly what to do.
Agent AIs are a new experimental method of getting computers to figure out how to achieve a set objective without a human having to tell the computer exactly what to do.
If you could tell a computer what to do, and it could accomplish that task without you having to outline each and every step, that would save a ton of time and therefore be extremely valuable.
That is the promise and allure of agent AIs. So what are they exactly?
What are agent AIs?
Agent AIs are a computer program that take the language capabilities and ‘reasoning’ capabilities of an large language model (like GPT4), pair it with an AI database like Pinecone, and an internet connection.
A user sets an agent up, and gives it an objective. The agent goes into a simple loop trying to complete that objective. The agent does a few things like:
Create tasks for themselves
reprioritize their task list
complete* the top priority task
store the results of tasks and human feedback in a database
loop until their objective is reached*
Agent AIs do not involve training new large AI models, they just use existing LLMs (GPT4, or an alternative) and prompt patterns in a potentially infinite loop.
This is extremely cutting-edge, because agent AIs are very new.
Agent AIs are extremely new (went viral in the last month)
Agent AIs went viral at the beginning of April with AutoGPT and BabyAGI.
Auto-GPT, and BabyAGI are independently developed Python projects that do basically the same thing.
They shot to the top of Githubs popular projects and in the couple weeks since release, community members have built extensions and clones and agent managers and frameworks and ChatGPT plugins and visual toolkits.
For the non-technical person, GitHub is a widely-used web app where people store their code. GitHub provides a collaborative environment where multiple people can work on the same project without conflicting with each other's changes.
A project reaching the top charts on GitHub means tons of people are using it, looking at it, and downloading it.
But why all the attention? What can agent AIs actually do?
What can AI agents do?
Hypothetically, agent AIs could do any digital task a human could.
Currently, agent AIs are much more limited. The running joke is agent AIs are really only good at creating new lists of tasks, and skyrocketing your OpenAI monthly bill.
The biggest roadblock to AI agents actually completing tasks, is any digital task a human can do spans a huge range of abilities, and agent AIs currently only have certain abilities. For example, to get to the level of a capable virtual assistant or intern, agent AIs would need to be able to:
Access data on the internet ✅
Browse the internet (simulating clicks, and form fills) 🤷♂️
Remember things long-term and short-term ✅
Control a computer operating system ❌
Use apps on your computer ❌
Access a credit card or other form of payment ✅
Access large language models (LLMs) like GPT ✅
Access APIs ✅
Here is where we are with the various capabilities agent AIs would need.
Access the Internet ✅
Computers can access the internet by scraping, crawling etc. OpenAI has demonstrated OpenAI WebGPT and other providers like Dustt allow it as well.
A few of the agent AI programs allow it to access information from the internet as well.
The new plugin system for ChatGPT has many plugins that access information from the internet. Bing Chat only can read meta data but should be able to access the internet soon.
While not every system has it, most AI systems can or will be able to pull data from the internet soon.
Browse the internet (clicking, filling forms) 🤷♂️
Getting info from the internet and using the internet like a human, clicking, scrolling, selecting are different.
Nat Friedman showed a demo of GPT3 browsing the internet back in Sep, 2022 but it is not a solved problem.
Trying to get computers to browser the internet has been an issue for a long time. Quality Assurance teams and software use software to simulate humans clicking and interacting with websites.
The problem is, it is usually brittle. If a company changes their blue Buy button to a red Purchase now, the software sometimes breaks and can’t find the button.
GPT could make this easier, with more advanced reasoning but there is still not an easy way to connect an agent to robust browsing capabilities.
Some demos show it can be done, but it is brittle and difficult still.
Remember things long-term and short-term 🤷♂️
Databases enable this, and embedding or vector databases enable this for AI specifically. If you want to know what embeddings are, read this section of this post.
Pinecone, Vespa and other databases are built to store similar items close to each other so you can take the output of an agent, store it in the db, then query for similar items and pass them into the prompt of an LLM. If none of those words made sense, read this article.
However since the AI has no real short term or long term memory, it has to remember to query the database for what it was doing, what it should be doing and what it did. Like a forgetful person leaving themselves notes, this can be problematic.
Control a computer ❌
While there are ways to automate control of your computer like Windows Power Automate and Applescript, they suffer the same issues as browsing. The solution can be brittle, and if the app changes at all or something goes wrong, the automation fails.
There is no easy way to automate using a computer by a software agent currently.
Demos exist but nothing off the shelf makes this easy or robust today.
Use apps ❌
Same issue as controlling a computer and browsing. If something goes wrong the agent can’t recover and if apps update or change, the automation breaks.
There is no easy way to automate using apps on your computer currently.
Demos exist but nothing off the shelf makes this easy or robust today.
Access a credit card or other form of payment ✅
A credit card is a number. Computers can enter numbers in forms, so yes they could use a credit card or payment method.
Legally it will be interesting how this shakes out. If you agent goes rogue and orders $10,000 worth of fish, can you blame anyone but yourself?
Giving a computer full control of your money is scary and so there few have done this so far but it should be technically feasible if the browsing capabilities are taken care of.
Access large language models (LLMs) like GPT ✅
Considering agent AIs are build on LLMs, yes they can access LLMs. LLMs have their own issues (which we discuss below) but yes agent AIs can use LLMs.
Access APIs ✅
APIs stand for Application programming interface. APIs let you get restaurant locations, send real letters, and control your smart thermostat.
AI Agents are ideally placed to use APIs because an AI is a repeatable, computer-y way of getting things done. Sending emails, sending texts, generating images etc, anything that is available via an API, an agent could potentially do.
So while agent AIs show great promise to be able to complete any digital task a human can, there are many important abilities they struggle with or lack a robust solution.
This lack of abilities leads to a few serious issues with agent AIs.
Problems with agent AIs
Hallucinations
Because the large language models like ChatGPT have issues with hallucinations, agent AIs also struggle with hallucinations.
I assigned my agent AI the task of increasing heat pump adoption in the United States. It happily planned a list of tasks. One of which was contact trade groups to prompt heat pumps.
The agent AI then happily reported that it had done so and marked that item as complete.
This is impossible because it not have access to an email client, a texting API, or any form of communication.
Trouble staying on task
Agent AIs have trouble staying on task. They often will create a reasonable task, then re-prioritize a new task above it.
There is currently no way to set a priority framework. You can give it various prompts to try and address this and developers are working on solutions but agent AIs are like a hyperactive intern that likes to create new tasks rather than complete tasks.
Lack of recall
Even after successfully completing a task, agent AIs often do not remember how to perform it again, or know how to store it for later use.
And when agent AIs do take some action to enable task reuse like creating a Python program, it will often forget to use the program it created later.
Embedding databases are a way to get around this, but at its core the agent AI has no current working memory. Large language models are trained on an existing dataset and to integrate new data they need to be trained. You can inject stored in formation from a database, but that is like the main character in Memento who doesn’t remember anything so he writes himself a note while he does remember it but then can’t figure out his own note.
Struggles to decompose tasks
Agent AIs struggle to effectively decompose tasks.
It can generate task lists but each task will often contain a multitude of tasks. Again developers are working on this, and you try to give it prompts to ‘break things down’ or ‘thing through things step by step’ but it can be a struggle.
I believe we can and will solve most of these issues with agent AIs. But the promise of everyone having their own personal Jarvis is still a ways into the future. That being said there are some exciting use cases for agent AIs that could happen sooner.
Real life use cases for agent AIs
Right now there are no killer apps with agent AIs. They are a bit flaky and require some amount of programming and Python knowledge to get them to do anything. But there are lots of potential and fun demos out there.
Of course, if agent AIs actually could complete any digital task a human could do, the potential use cases could be infinite.
But I am going to only talk about use cases that are in the proximate or near future.
Research assistant
Paperwork
Financial assistant
NPCs in games
Shopping assistant
Chat moderator
Research assistant
This use case seems like the most promising since it only involves accessing data, and then summarizing, re-writing etc.
For super simple research that doesn’t need a lot of synthesis or analysis, this could work well.
Others have recognized this and have already started to productize the research abilities of agent AIs.
Paperwork
Filling out paperwork seems like a clear use case. The agent AI would need access to your personal information (birth date, address, order number, etc) but that seems easy enough to put into a vector DB.
It would also need to be able to find and download and understand the forms. None of these seem impossible and I would love to fill out fewer forms at the doctor’s.
Connecting an agent AI to a pen robot would enable filling out paper forms as well.
Financial Assistant
Having an assistant analyze every transaction and ask for corrections, refunds, and solutions would be nice.
NPCs in games
Nonplayer characters in games are famous for doing, well, nothing until the player interacts with them.
Hooking them up to an agent AI could enable the NPCs to become real characters with stories and goals that happen even when the player is not there.
A Stanford research project connected a bunch of agent AIs and let them loose in a town. Nothing groundbreaking but it is interesting and could make game environments much richer and more interesting.
Shopping assistant
If you know exactly what you want it to buy, you could have an agent AI buy it for you.
HyperWrite, a tool for connecting an agent AI to the browser, has a demo where it appears to be able to order a pizza.
Chat / community moderator
As communities grow, moderators have more and more work. Just ask the overworked and mostly cranky Reddit moderators.
Having an agent AI do the easy stuff (blocking and banning spam accounts etc) would be great for communities.
If you are excited by the possibilities and want to dive in, here is how to get started using agent AIs.
How to use agent AIs
Build It Yourself: Look at the framework and build it from scratch. You can ask ChatGPT for help even. Use OpenAI’s GPT-4 for the LLM, Pinecone vector database for the long term memory, and LangChain’s framework for linking up APIs.
Download and run Auto-GPT: This is the popular open source option created by Toran Richards. It includes options to connect to the internet, use apps, long-term and short-term memory, and other functions.
Download and run BabyAGI: This is the popular open source option, created by Yohei Nakajima. While BabyAGI doesn’t connect to the internet yet people are working on extending it.
Download and run Microsoft’s Jarvis: Very similar to Auto-GPT and BabyAGI, but much more robust and brought to you by Microsoft and HuggingFace.
Use a service. Matt has a list on his autonomous agents directory -
https://www.mattprd.com/p/autonomous-agents-directory Here are a few agent services that stuck out to me.
SuperAgent: Too that tries to make it easier to build your own autonomous agent.
Spellpage: A todolist that tries to do the tasks itself for you.
iBabyAGI: AutoGPT on your iPhone. More limited.
SalesGPT: Autonomous agent for reaching out to new clients.
Toliman AI: Autonomous agent for research. Focused on research.
AgentGPT website: Agent AI in the browser. More limited than other options.
Godmode website: Agent AI in the browser. More limited than other options.
There is a lot of hype around agent AIs. I don’t think they are going to change everything right away. But I do have a few predictions.
Agent AI Prediction
Agent AIs will find limited narrow use cases.
You can’t trust a self-driving car if it randomly crashes.
You can’t trust a self-driving AI if it randomly forgets to complete a task, hallucinates that it did complete it, or fails to complete a task because a button text and color changed.
Just like self-driving cars seemed to be solved 80% of the way, the last 20% will take 80% of the time.
Apps and tools will take the idea of agent AIs and make them accessible to people who don’t want to run Python code.
A loop to complete a task will become a common technique for AI tools.
Thanks for reading. See you next week!
-Josh
Loved the simplicity with which you've explained the technical concepts.