GPT-4, OpenAI's latest AI model, approaches the speed of GPT-3.5, its predecessor. A recent study found that the median GPT-4 latency has remained consistent over the past three months, remaining below 1ms per token. However, latency at the 99th percentile more than halved over the same period. This means that the majority of requests are now processed by GPT-4 faster than by GPT-3.5.
Factors that contribute to latency are round-trip travel time, queuing time, and processing time. Processing time can vary significantly depending on the complexity and length of the prompt. It should be noted that a high number of tokens does not always translate into a slower response. For example, a 204 token prompt, although simple, can be answered in just 4.5 seconds. In contrast, a 33 token prompt, if complex, can take up to 32 seconds to process.
Despite its higher cost, GPT-4 is no longer slower than GPT-3.5 for the majority of queries.
OpenAI also explores another intriguing question: does latency increase as the user approaches their throughput limits? In other words, is OpenAI deliberately slowing down users? The results of this study will be published in a future article.