Databricks open-sources its Dolly large language AI model

2023-03-29

关注

In an attempt to open up its technology to a wider audience, enterprise software company Databricks has released Dolly, a large language model and its associated training code under an open-source licence. Despite being based on a much smaller underlying model, the company says it has ChatGPT-like functionality and can be run “in-house”.

Databricks says it was able to achieve similar chat-like functionality from an older, smaller language model (Photo: rarrarorro / Shutterstock) — Databricks says it was able to achieve similar chat-like functionality from an older, smaller language model. (Photo: rarrarorro/Shutterstock)

The move was inspired by the success of OpenAI’s natural language platform ChatGPT, which became one of the fastest-growing consumer apps within a couple of months of its release in November last year. It has since caused some of the world’s largest companies including Microsoft and Google to pivot and release generative and natural language AI tools.

“We show that anyone can take a dated off-the-shelf open source LLM and give it magical ChatGPT-like instruction-following ability by training it in 30 minutes on one machine, using high-quality training data,” Databricks wrote in a blog post explaining the decision.

It found that the type of instruction-following used in ChatGPT “does not seem to require the latest or largest models”, and claims that from just six billion parameters, compared to 175 billion in GPT-3 and many more in GPT-4 or Google’s PaLM, it was able to recreate the functionality of ChatGPT.

“We believe models like Dolly will help democratise LLMs, transforming them from something very few companies can afford into a commodity every company can own and customise to improve their products,” the company said.

Large language models: from LLaMA to Alpaca to Dolly

Developers like OpenAI, Anthropic, AI21 Labs, as well as Microsoft, Google and IBM charge end-users for access to their large language models through API calls. This can become expensive very quickly if you need to make a lot of calls on a regular basis. Alternatively, training those same models is an expensive endeavour that takes hundreds of GPU hours and trillions of words from datasets.

Then Meta released the weights for its high-quality language model, LLaMA, to researchers. It had been trained using more than 80,000 GPU hours, with Stanford University-built Alpaca, on top of LLaMA, tuned to a subset of 50,000 human-like questions and answers. This led to it exhibiting ChatGPT-like functionality from a relatively small training dataset.

Dolly, from Databricks is able to deliver what the company describes as a “surprising degree of instruction-following capabilities” but from a much smaller model. Where the Alpaca team demonstrated that a state-of-the-art model could be used as a chatbot engine, Databricks says even years-old models can be tweaked to have those same types of behaviours if fine-tuned on a small corpus of instruction training data.

Content from our partners

Are we witnessing a new 'Kodak moment'?

Are we witnessing a new ‘Kodak moment’?

How the logistics sector can address a shift in distribution models

Fashion brands must seek digital solutions that understand the sector’s unique needs

“Dolly works by taking an existing open-source six-billion-parameter model from EleutherAI and modifying it ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca,” the company explained.

View all newsletters Sign up to our newsletters Data, insights and analysis delivered to you By The Tech Monitor team

The team were surprised it worked so well given the older and smaller nature of the underlying model compared to those provided by OpenAI or Google. “This suggests that much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models.”

“We’re calling the model Dolly — after Dolly the sheep, the first cloned mammal — because it’s an open-source clone of an Alpaca, inspired by a LLaMA. We’re in the earliest days of the democratisation of AI for the enterprise, and much work remains to be done, but we believe the technology underlying Dolly represents an exciting new opportunity for companies that want to cheaply build their own instruction-following models,” said Databricks in a blog post.

Using an open model rather than sending data to a centralised LLM makes sense for companies with highly sensitive and proprietary data. Handing it over to a third party may be unpalatable to some companies and so making trade-offs in terms of model quality and cost, against the security of using in-house models have to be considered.

Dolly will be available on Databricks with the trained weights available to anyone wanting to experiment with the model. This is the first in a series of announcements from the company which is switching its focus to helping organisations harness large language models. “We believe in the incredible power of artificial intelligence to transform the productivity of every organisation and individual, and welcome you to join us on this journey. Stay tuned for more in this area in the coming weeks.”

Read more: UK AI regulation white paper dodges ChatGPT questions

Topics in this article : AI , Cloud , Databricks

参考译文

Databricks开源其Dolly大语言人工智能模型

为了向更广泛的受众开放其技术，企业软件公司Databricks在开源许可下发布了大型语言模型Dolly及其相关的训练代码。尽管基于一个小得多的底层模型，该公司表示，它有类似chatgpt的功能，可以“内部”运行。此举的灵感来自于OpenAI自然语言平台ChatGPT的成功，该平台在去年11月发布后的几个月内就成为增长最快的消费应用之一。此后，包括微软和谷歌在内的一些世界上最大的公司开始转向并发布生成式和自然语言人工智能工具。Databricks在一篇解释这一决定的博客文章中写道:“我们证明，任何人都可以使用过时的现成开源LLM，并通过在一台机器上使用高质量的训练数据，在30分钟内训练它，赋予它神奇的chatgpt般的指令遵循能力。”它发现，ChatGPT中使用的指令跟踪类型“似乎不需要最新或最大的模型”，并声称，与GPT-3中的1750亿个参数和GPT-4或谷歌的PaLM中的更多参数相比，仅60亿个参数就能够重建ChatGPT的功能。该公司表示:“我们相信，像Dolly这样的模型将有助于llm的民主化，将它们从极少数公司能负担得起的东西转变为每个公司都能拥有和定制的商品，以改善他们的产品。”OpenAI、Anthropic、AI21 Labs等开发商，以及微软、谷歌和IBM都向终端用户收取通过API调用访问其大型语言模型的费用。如果你需要定期打很多电话，这很快就会变得很贵。或者，训练这些相同的模型是一项昂贵的工作，需要数百个GPU小时和来自数据集的数万亿字。然后Meta向研究人员公布了其高质量语言模型LLaMA的权重。它已经使用超过80,000个GPU小时进行了训练，在LLaMA之上，斯坦福大学建造的Alpaca被调整为5万个类似人类的问题和答案的子集。这导致它从一个相对较小的训练数据集中展示出类似chatgpt的功能。来自Databricks的Dolly能够提供该公司所称的“惊人程度的指令遵循能力”，但它的型号要小得多。羊驼团队证明了一个最先进的模型可以用作聊天机器人引擎，Databricks表示，即使是多年的模型，如果在一个小的指令训练数据语料库上进行微调，也可以被调整为具有相同类型的行为。该公司解释说:“Dolly的工作原理是采用EleutherAI现有的60亿参数开源模型，并对其进行轻微修改，以获得原始模型中不存在的指令遵循功能，如头脑风暴和文本生成，使用的数据来自Alpaca。”考虑到与OpenAI或谷歌提供的底层模型相比，它的底层模型更老、更小，团队对它的工作效果如此之好感到惊讶。“这表明，ChatGPT等最先进模型的大部分定性收益可能归功于集中的指令遵循训练数据语料库，而不是更大或更好调优的基础模型。”“我们称这个模型为多利，以第一只克隆哺乳动物多利羊命名，因为它是羊驼的开源克隆，灵感来自美洲驼。我们正处于企业人工智能民主化的早期阶段，还有很多工作要做，但我们相信，对于那些希望廉价构建自己的指令遵循模型的公司来说，Dolly背后的技术代表了一个令人兴奋的新机会，”Databricks在一篇博客文章中表示。对于拥有高度敏感和专有数据的公司来说，使用开放模型而不是将数据发送到集中的LLM是有意义的。把它交给第三方可能会让一些公司感到不快，因此必须考虑在模型质量和成本方面做出权衡，以及使用内部模型的安全性。Dolly将在Databricks上提供训练过的重量，供任何想要试验模型的人使用。这是该公司一系列公告中的第一个，该公司将重点转向帮助组织利用大型语言模型。“我们相信人工智能具有不可思议的力量，可以改变每个组织和个人的生产力，欢迎您加入我们的旅程。未来几周，我们将继续关注这一领域的更多消息。”