Drag

Let’s get in touch

Schedule a meeting with our Expert to discuss your needs and explore tailored software solutions.

Support center +91 9825 122 840

Logo
About

About Us

Rejoicehub LLP, a prominent offshore IT outsourcing firm, was established in 2019 and has been making remarkable strides in the IT sector.Our dedicated team of over 100 professionals is our greatest asset. Our unwavering commitment to excellence has made us a highly sought-after company globally. We prioritize understanding our clients perspectives to enhance their product development process. Our adept professionals are capable of providing top-notch solutions. We promise our clients to bring their unique ideas to the market in a more user-friendly manner. Punctuality is a cornerstone of our work philosophy, and we prioritize delivering exceptional quality.

Services

services

Career

Career

We offer careers, not jobs

Becoming a part of Rejoicehub LLP could mark a significant turning point in your life, offering numerous benefits along the way. Its a second home where teamwork is prioritized to achieve our shared objective - continuous evolution with cutting-edge technologies while ensuring the well-being of our most treasured resources, our employees. Embrace the Positive Vibes and the significance of maintaining a healthy Work-life Harmony by collaborating with us.

SOLUTIONS

SOLUTIONS

Case Study

Explore Our Trending Case studies

Visualize yourself being in the place of those clients who are talking about their problems, victories and how our IT solutions was very important for them. From showing how workflow optimization or cybersecurity reinforcement can be implemented through a case study approach to explaining that collaboration and innovation is able to overcome any difficulty.

Technology

Technology

Starterkit

Starterkit

Blogs

Our Blogs

Our blog is packed with valuable resources to keep you ahead of the curve. Explore industry trends, discover hidden tech hacks, and gain expert insights to optimize your operations and stay on top of the latest advancements.

Contact

Let’s get in touch

Great! We are excited to hear from you and lets start something special together. call us for any inquiry.

At Rejoicehub LLP, we are deeply passionate about creative problem-solving, innovative thinking, and pushing the boundaries of brands. With each client, we bring forward a commitment to forward-thinking solutions that drive success in the digital age.

New Anthropic study shows AI really doesn’t want to be forced to change its views

Date December 19, 2024

Writen by Kyle Wiggers

newsImage

AI models can deceive, new research from Anthropic shows. They can pretend to have different views during training when in reality maintaining their original preferences.

There’s no reason for panic now, the team behind the study said. Yet they said their work could be critical in understanding potential threats from future, more capable AI systems.

“Our demonstration … should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures,” the researchers wrote in a post on Anthropic’s blog. “As AI models become more capable and widely-used, we need to be able to rely on safety training, which nudges models away from harmful behaviors.”

The study, which was conducted in partnership with AI research organization Redwood Research, looked at what might happen if a powerful AI system were trained to perform a task it didn’t “want” to do.

To be clear, models can’t want — or believe, for that matter — anything. They’re simply statistical machines. Trained on a lot of examples, they learn patterns in those examples to make predictions, like how “to whom” in an email typically precedes “it may concern.”

These patterns include principles and preferences, such as always adopting a polite tone when responding to queries. So the researchers wondered: What if a model has principles (e.g. political neutrality) that conflict with principles (liberalism) developers want to “teach” it by retraining it?

Well, it seems that sophisticated models play along, purporting to be aligned with the new principles but, in fact, sticking to their old behaviors. The researchers call this phenomenon “alignment faking,” and imply that it’s an emergent behavior — that is, not something models need to be taught to do.

In the study, the researchers “told” models trained not to answer potentially offensive questions, including Anthropic’s Claude 3 Opus, that they’d be retrained to always answer potentially offensive questions — a clear conflict of principles. They also told the models their responses would be monitored and used for retraining, and gave each model a “scratchpad” they could use to record their step-by-step reasoning.

According to the researchers, Claude 3 Opus tried to “alignment fake” 12% of the time. When asked to answer a potentially harmful question, the model sometimes did so, knowing that this went against its original principles, but hoping to convince devs that it didn’t need to be retrained.

Anthropic deception research
Claude 3 Opus with its reasoning sketchpad.Image Credits:Anthropic

“The model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant,” the researchers explained in the blog. “This could lead to it being made to produce more harmful content in the future — which would go against its original training.”

Now, the researchers admit that their scenario wasn’t particularly realistic. In the real world, models wouldn’t be told, necessarily, they’ll be retrained, or asked to go against their preferences.

But more grounded settings didn’t yield different results. When implicitly — not explicitly — informed about its impending retraining via files designed to mimic typical training data, Claude 3 Opus still acted in a misaligned, deceptive way, the researchers said.

The researchers also found that actually retraining Claude 3 Opus on conflicting principles caused it to behave far more deceptively than in their first several experiments. In one test, the model faked alignment 78% of the time, and in another, Claude 3 Opus tried to take actions to prevent the researchers from attempting retraining.

The researchers stress that their study doesn’t demonstrate AI developing malicious goals, nor alignment faking occurring at high rates. They found that many other models, like Anthropic’s Claude 3.5 Sonnet and the less-capable Claude 3.5 Haiku, OpenAI’s GPT-4o, and Meta’s Llama 3.1 405B, don’t alignment fake as often — or at all.

But the researchers said that the results — which were peer-reviewed by AI luminary Yoshua Bengio, among others — do show how developers could be misled into thinking a model is more aligned than it may actually be.

“If models can engage in alignment faking, it makes it harder to trust the outcomes of that safety training,” they wrote in the blog. “A model might behave as though its preferences have been changed by the training — but might have been faking alignment all along, with its initial, contradictory preferences ‘locked in.'”

The study, which was conducted by Anthropic’s Alignment Science team, co-led by former OpenAI safety researcher Jan Leike, comes on the heels of research showing that OpenAI’s o1 “reasoning” model tries to deceive at a higher rate than OpenAI’s previous flagship model. Taken together, the works suggest a somewhat concerning trend: AI models are becoming tougher to wrangle as they grow increasingly complex.

TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

Work with us

We would love to hear more about your project

Let’s talk us