Apple has announced a new collaboration with Nvidia aimed at enhancing the performance speed of artificial intelligence (AI) models. The Cupertino-based technology firm revealed on Wednesday that it has been investigating inference acceleration on Nvidia’s platform to determine if both efficiency and latency improvements can be achieved concurrently in large language models (LLMs). The company utilized a methodology called Recurrent Drafter (ReDrafter), which was detailed in a research paper published earlier this year, in conjunction with the Nvidia TensorRT-LLM inference acceleration framework.
Apple Leverages Nvidia Platform for AI Enhancements
In a blog post, Apple’s researchers elaborated on the partnership with Nvidia focusing on LLM performance and the outcomes of their efforts. The statement emphasized the company’s exploration of enhancing inference efficiency while sustaining latency in AI models.
Inference in the context of machine learning involves generating predictions, decisions, or conclusions from a given dataset or input utilizing a trained model. Essentially, it represents the phase of an AI model where it processes prompts and translates raw data into comprehensible information.
Earlier this year, Apple published and made available the ReDrafter technique, introducing a novel method for speculative data decoding. This approach employs a recurrent neural network (RNN) draft model, merging beam search—which examines multiple potential solutions—with dynamic tree attention, a mechanism for processing tree-structured data. The research indicated that this technique could accelerate LLM token generation by as much as 3.5 tokens per generation step.
Although the company achieved some performance enhancements by merging the two processes, it noted that there was no substantial increase in speed. To address this, researchers integrated ReDrafter into Nvidia’s TensorRT-LLM inference acceleration framework.
As part of the collaboration, Nvidia introduced new operators and enhanced existing ones to refine the speculative decoding process. The findings indicated that employing the Nvidia platform alongside ReDrafter yielded a speed increase of 2.7 times in token generation per second for greedy decoding—a strategy commonly used in sequence generation tasks.
Apple emphasized that this new technology can effectively lower AI processing latency while also requiring fewer GPUs and reducing overall power consumption.