Drag

Let’s get in touch

Schedule a meeting with our Expert to discuss your needs and explore tailored software solutions.

Support center +91 9825 122 840

Logo
About

About Us

Rejoicehub LLP, a prominent offshore IT outsourcing firm, was established in 2019 and has been making remarkable strides in the IT sector.Our dedicated team of over 100 professionals is our greatest asset. Our unwavering commitment to excellence has made us a highly sought-after company globally. We prioritize understanding our clients perspectives to enhance their product development process. Our adept professionals are capable of providing top-notch solutions. We promise our clients to bring their unique ideas to the market in a more user-friendly manner. Punctuality is a cornerstone of our work philosophy, and we prioritize delivering exceptional quality.

Career

Career

We offer careers, not jobs

Becoming a part of Rejoicehub LLP could mark a significant turning point in your life, offering numerous benefits along the way. Its a second home where teamwork is prioritized to achieve our shared objective - continuous evolution with cutting-edge technologies while ensuring the well-being of our most treasured resources, our employees. Embrace the Positive Vibes and the significance of maintaining a healthy Work-life Harmony by collaborating with us.

SOLUTIONS

SOLUTIONS

Case Study

Explore Our Trending Case studies

Visualize yourself being in the place of those clients who are talking about their problems, victories and how our IT solutions was very important for them. From showing how workflow optimization or cybersecurity reinforcement can be implemented through a case study approach to explaining that collaboration and innovation is able to overcome any difficulty.

Technology

Technology

Starterkit

Starterkit

Blogs

Our Blogs

Our blog is packed with valuable resources to keep you ahead of the curve. Explore industry trends, discover hidden tech hacks, and gain expert insights to optimize your operations and stay on top of the latest advancements.

Contact

Let’s get in touch

Great! We are excited to hear from you and lets start something special together. call us for any inquiry.

At Rejoicehub LLP, we are deeply passionate about creative problem-solving, innovative thinking, and pushing the boundaries of brands. With each client, we bring forward a commitment to forward-thinking solutions that drive success in the digital age.

A popular technique to make AI more efficient has drawbacks

Date November 18, 2024

Writen by Kyle Wiggers

newsImage

One of the most widely used techniques to make AI models more efficient, quantization, has limits — and the industry could be fast approaching them.

In the context of AI, quantization refers to lowering the number of bits — the smallest units a computer can process — needed to represent information. Consider this analogy: When someone asks the time, you’d probably say “noon” — not “oh twelve hundred, one second, and four milliseconds.” That’s quantizing; both answers are correct, but one is slightly more precise. How much precision you actually need depends on the context.

AI models consist of several components that can be quantized — in particular parameters, the internal variables models use to make predictions or decisions. This is convenient, considering models perform millions of calculations when run. Quantized models with fewer bits representing their parameters are less demanding mathematically, and therefore computationally. (To be clear, this is a different process from “distilling,” which is a more involved and selective pruning of parameters.)

But quantization may have more trade-offs than previously assumed.

The ever-shrinking model

According to a study from researchers at Harvard, Stanford, MIT, Databricks, and Carnegie Mellon, quantized models perform worse if the original, unquantized version of the model was trained over a long period on lots of data. In other words, at a certain point, it may actually be better to just train a smaller model rather than cook down a big one.

That could spell bad news for AI companies training extremely large models (known to improve answer quality) and then quantizing them in an effort to make them less expensive to serve.

The effects are already manifesting. A few months ago, developers and academics reported that quantizing Meta’s Llama 3 model tended to be “more harmful” compared to other models, potentially due to the way it was trained.

“In my opinion, the number one cost for everyone in AI is and will continue to be inference, and our work shows one important way to reduce it will not work forever,” Tanishq Kumar, a Harvard mathematics student and the first author on the paper, told TechCrunch.

Contrary to popular belief, AI model inferencing — running a model, like when ChatGPT answers a question — is often more expensive in aggregate than model training. Consider, for example, that Google spent an estimated $191 million to train one of its flagship Gemini models — certainly a princely sum. But if the company were to use a model to generate just 50-word answers to half of all Google Search queries, it’d spend roughly $6 billion a year.

Major AI labs have embraced training models on massive datasets under the assumption that “scaling up” — increasing the amount of data and compute used in training — will lead to increasingly more capable AI.

For example, Meta trained Llama 3 on a set of 15 trillion tokens. (Tokens represent bits of raw data; 1 million tokens is equal to about 750,000 words.) The previous generation, Llama 2, was trained on “only” 2 trillion tokens.

Evidence suggests that scaling up eventually provides diminishing returns; Anthropic and Google reportedly recently trained enormous models that fell short of internal benchmark expectations. But there’s little sign that the industry is ready to meaningfully move away from these entrenched scaling approaches.

How precise, exactly?

So, if labs are reluctant to train models on smaller datasets, is there a way models could be made less susceptible to degradation? Possibly. Kumar says that he and co-authors found that training models in “low precision” can make them more robust. Bear with us for a moment as we dive in a bit.

“Precision” here refers to the number of digits a numerical data type can represent accurately. Data types are collections of data values, usually specified by a set of possible values and allowed operations; the data type FP8, for example, uses only 8 bits to represent a floating-point number.

Most models today are trained at 16-bit or “half precision” and “post-train quantized” to 8-bit precision. Certain model components (e.g., its parameters) are converted to a lower-precision format at the cost of some accuracy. Think of it like doing the math to a few decimal places but then rounding off to the nearest 10th, often giving you the best of both worlds.

Hardware vendors like Nvidia are pushing for lower precision for quantized model inference. The company’s new Blackwell chip supports 4-bit precision, specifically a data type called FP4; Nvidia has pitched this as a boon for memory- and power-constrained data centers.

But extremely low quantization precision might not be desirable. According to Kumar, unless the original model is incredibly large in terms of its parameter count, precisions lower than 7- or 8-bit may see a noticeable step down in quality.

If this all seems a little technical, don’t worry — it is. But the takeaway is simply that AI models are not fully understood, and known shortcuts that work in many kinds of computation don’t work here. You wouldn’t say “noon” if someone asked when they started a 100-meter dash, right? It’s not quite so obvious as that, of course, but the idea is the same:

“The key point of our work is that there are limitations you cannot naïvely get around,” Kumar concluded. “We hope our work adds nuance to the discussion that often seeks increasingly low precision defaults for training and inference.”

Kumar acknowledges that his and his colleagues’ study was at relatively small scale — they plan to test it with more models in the future. But he believes that at least one insight will hold: There’s no free lunch when it comes to reducing inference costs.

“Bit precision matters, and it’s not free,” he said. “You cannot reduce it forever without models suffering. Models have finite capacity, so rather than trying to fit a quadrillion tokens into a small model, in my opinion much more effort will be put into meticulous data curation and filtering, so that only the highest quality data is put into smaller models. I am optimistic that new architectures that deliberately aim to make low precision training stable will be important in the future.”

Work with us

We would love to hear more about your project

Let’s talk us