On Tuesday, OpenAI announced GPT-4, its next-generation AI language model. While the company cautioned that the differences between GPT-4 and its predecessors are "marginally noticeable" in casual conversation, the system does offer a host of new capabilities. For example, it can process images, and OpenAI claims it is generally better at creative tasks and problem solving.
Evaluating these claims is not easy. AI models, in general, are extremely complex, and systems like GPT-4 are ramified and multifunctional, with hidden and as yet unknown capabilities. Fact-checking is also challenging. For example, when GPT-4 confidently tells you that it has created a new chemical compound, you won't know if it's true until you ask a few real chemists. (Though that will never stop some high-profile claims from going viral on Twitter). As the OpenAI technical report makes clear, the biggest limitation of GPT-4 is that it "hallucinates" information (makes it up) and is often "confidently wrong in its predictions."
These caveats aside, GPT-4 is definitely interesting from a technical standpoint and is already being integrated into large, mainstream products. So, to get a sense of what's new, we've gathered a few examples of its feats and abilities from news outlets, Twitter, and OpenAI itself, as well as conducted our own tests. Here's what we know:
It can handle images along with text
As mentioned above, this is the biggest practical difference between GPT-4 and its predecessors. The system is multimodal, meaning that it can analyze both images and text, whereas GPT-3.5 could only process text. This means that GPT-4 can analyze the content of an image and associate that information with a written question. (Although it cannot generate images like DALL-E, Midjourney, or Stable Diffusion).
What does this mean in practice? The New York Times recounts one demonstration where GPT-4 is shown the inside of a refrigerator and asked what dishes can be made with those ingredients. Of course, based on the image, GPT-4 comes up with several examples of both savory and sweet dishes. However, it's worth noting that one of these suggestions - wraps - requires an ingredient that doesn't seem to be there: tortilla chips.
This functionality has many other uses. In a demo shown by OpenAI after the announcement, the company showed how GPT-4 can create code for a website based on a hand-drawn sketch, for example (video embedded below). OpenAI is also working with a startup called Be My Eyes, which uses object recognition or human volunteers to help people with vision problems, to improve the company's app using GPT-4.
Such functionality isn't unique (plenty of apps offer basic object recognition, such as Apple's Magnifier app), but OpenAI claims that GPT-4 can "generate the same level of context and understanding as a human volunteer" - explaining the world around the user, summarizing cluttered web pages, or answering questions about what they "see." The functionality hasn't launched yet, but "will be in users' hands in a few weeks," the company says.
Other companies are apparently experimenting with GPT-4's image recognition capabilities, too. Jordan Singer, founder of Diagram, tweeted that the company is working on adding the technology to its AI designer assistant tools to add things like a chatbot that can comment on designs and a tool that can help generate designs.
And, as shown in the images below, GPT-4 can also explain funny pictures:
He does better on language tasks
OpenAI claims that GPT-4 performs better on tasks that require creativity or advanced thinking. This claim is difficult to evaluate, but seems true based on some of the tests we have seen and performed (although the differences with its predecessors are not yet striking).
During the company's GPT-4 demo, OpenAI co-founder Greg Brockman asked him to summarize a section of a blog post using only words beginning with "g" (he later asked him to do the same but with "a" and "q"). "We had success for 4, but never made it with 3.5," Brockman said before the demonstration began. In the OpenAI video, GPT-4 responds with a perfectly understandable sentence with only one word that doesn't begin with "g," and does so perfectly correctly after Brockman asks it to correct itself. GPT-3, meanwhile, doesn't even seem to have tried to follow the prompt.
We played around with this ourselves by giving ChatGPT the text to generalize, using only words beginning with "n", and comparing the GPT-3.5 and 4 models. (In this case, we gave it excerpts from the Verge NFT explanation). On the first try, GPT-4 did better at generalizing the text, but did worse at adhering to the prompt.
However, when we asked both models to correct their errors, GPT-3.5 almost gave up, while GPT-4 produced a near-perfect result. It still had the word "on" in it, but to be fair, we left it out when we asked for a bug fix.
We also asked both models to turn our article into a rhyming poem. Although it's a pain to read poems about the NFT, the GPT-4 definitely did a better job; its poem seemed much more sophisticated to us, while the GPT-3.5's poem felt like a bad freestyle.
It can handle more text
AI language models have always been limited by the amount of text they can store in their short-term memory (i.e.: text included in both the user's question and the system's response). But OpenAI has radically extended these capabilities for GPT-4. The system can now process entire scientific articles and novels at a time, allowing it to answer more complex questions and link more details in each specific query.
It's worth noting that GPT-4 doesn't have character or word counts per se, but measures its input and output in units known as "tokens". The tokenization process is quite complicated, but you should know that a token equals about four characters and that 75 words usually takes about 100 tokens.
The maximum number of tokens that GPT-3.5-turbo can use in any given query is about 4,000, corresponding to just over 3,000 words. GPT-4, by comparison, can handle about 32,000 tokens, which is about 25,000 words, according to OpenAI. The company says it's "still optimizing" for longer contexts, but the higher limit means the model should reveal uses that weren't easy to do before.
She can win tests
One of the most significant indicators in OpenAI's GPT-4 technical report was his performance on a number of standardized tests, including the BAR, LSAT, GRE, a number of AP modules, and - for some unknown but very amusing reason - the introductory, certified, and advanced sommelier courses offered by the Court of Master Sommeliers (theory only).
Below you can see a comparison of GPT-4 and GPT-3 results on some of these tests. Note that GPT-4 now handles the various AP modules fairly consistently, but still has difficulty with those that require more creativity (such as the English Language and English Literature exams).
This is an impressive result, especially compared to what AI systems of yesteryear could achieve, but understanding this achievement also requires some context. I think engineer and writer Joshua Levy expressed it best on Twitter, describing the logical fallacy that many succumb to when looking at these results: "Just because software can pass a test designed for humans doesn't mean it has the same abilities as humans who take the same test."
Computer scientist Melanie Mitchell has addressed this issue in detail in a blog discussing ChatGPT results on various exams. As Mitchell notes, the ability of artificial intelligence systems to pass these tests depends on their ability to store and reproduce certain types of structured knowledge. This does not necessarily mean that these systems can generalize from this underlying data. In other words: AI may be the ultimate example of learning from tests.
It is already used in mainstream products
As part of the GPT-4 announcement, OpenAI shared several stories about how organizations are using the model. These include an AI tutoring feature being developed by Kahn Academy, which is designed to help students with coursework and give teachers ideas for lessons, and an integration with Duolingo, which promises a similar interactive learning experience.
Duolingo's offering is called Duolingo Max and adds two new features. One gives a "simple explanation" of why your answer to an exercise was right or wrong, and allows you to ask for other examples or explanations. The other is a "role-play" mode that lets you practice using the language in different scenarios, such as ordering coffee in French or making a hiking plan in Spanish. (The company claims that GPT-4 makes it so that "no two conversations will be exactly alike."
Other companies are using GPT-4 in related areas. Intercom announced today that it is modernizing its customer support bot using this model, promising that the system will connect to the company's support documents to answer questions, while payment processor Stripe is using the system internally to answer employee questions based on its technical documentation.
All the while, she was working on the new Bing
After the OpenAI announcement, Microsoft confirmed that the model used in Bing chat is actually GPT-4.
This isn't such a earth-shattering revelation. Microsoft has already stated that it uses the "next-generation OpenAI large-language model" but was too shy to call it GPT-4, but it's still good to know, and it means we can use some of what we learned from our interactions with Bing to think about GPT-4 as well.
And on that note.
He's still making mistakes
Obviously, the Bing chatbot isn't perfect. The bot tried to gas people, made stupid mistakes, and asked our colleague Sean Hollister if he wanted to watch furry porn. This is partly due to the way Microsoft has implemented GPT-4, but this experience provides some insight into how chatbots built on these language models can make mistakes.
In fact, we've already seen GPT-4 make a few mistakes in its first tests. For example, in The New York Times article, the system is asked to explain how common Spanish words are pronounced... and it pronounces almost all of them incorrectly. (However, I asked her how to pronounce the word "gringo" and her explanation seemed to pass the test).
NYTimes publishes this Spanish pronunciation guide as proof of GPT-4 improvements ....... but almost none of it is correct! pic.twitter.com/lpGgTSv1E8
- Christopher Grob (@Confessant) March 14, 2023.
This is not some huge mistake, but a reminder that everyone involved in the creation and deployment of GPT-4 and other language models already knows: they are wrong. A lot of them. And any deployment, whether as a tutor, vendor, or coder, should come with a prominent warning about it.
OpenAI CEO Sam Altman had this to say in January when asked about the capabilities of the then-unannounced GPT-4: "People are begging to be disappointed, and they're going to be disappointed. The hype is the same as ... We don't have real artificial intelligence, and that's kind of what's expected of us."
Well, there is no AGI yet, but there is a system that has more capabilities than what we had before. Now we are waiting for the most important thing: to see exactly how and where it will be used.
Ailib neural network catalog. All information is taken from public sources.
Advertising and Placement: [email protected] or t.me/fozzepe