Zyphra debuts Zyda, a 1.3T language modeling dataset it claims outperforms Pile, C4, arxiv


VB Transform 2024 returns this July! Over 400 enterprise leaders will gather in San Francisco from July 9-11 to dive into the advancement of GenAI strategies and engaging in thought-provoking discussions within the community. Find out how you can attend here.


Zyphra Technologies is announcing the release of Zyda, a massive dataset designed to train language models. It consists of 1.3 trillion tokens and is a filtered and deduplicated mashup of existing premium open datasets, specifically RefinedWeb, Starcoder, C4, Pile, Slimpajama, pe2so, and arxiv. The company claims its ablation studies reveal that Zyda performs better than the datasets it was built on. An early dataset version powers Zyphra’s Zamba model and will eventually be available for download on Hugging Face.

Image credit: Zyphra

“[We] came up with Zyda when [we] were trying to create a pretraining dataset for [our] Zamba series of models,” Zyphra Chief Executive Krithik Puthalath tells VentureBeat in an email. “The problem it solves is it provides a trillion token scale extremely high-quality dataset for training language models which otherwise everybody who wanted to train a language model would have to recreate something like Zyda themselves.”

It seems the company wanted to build a better proverbial mouse trap. Combining multiple existing open datasets, Zyphra then spent time cleaning up the tokens to ensure there was a unique group. Specifically, it performed syntactic filtering to eliminate low-quality documents before executing an “aggressive” deduplication effort “within and between” the datasets. “Cross deduplication is very important as we found many datasets had a large number of documents that also existed in other datasets,” the company explains in a blog post. This probably shouldn’t be surprising given that many likely draw from common sources such as Common Crawl.

zyda composition new
Image credit: Zyphra

Of the seven open language modeling datasets used, RefinedWeb (43.6 percent) is the largest within Zyda. Slimpajama (18.7 percent) and StarCoder (17.8 percent) are the second and third, respectively. The rest make up single digit percentage points.


VB Transform 2024 Registration is Open

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now


“In total, we discarded approximately 40 percent of our initial dataset, reducing its token count from approximately 2 [trillion] tokens to 1.3 [trillion].”

Because it’s open-sourced, developers can tap into this best-of-breed language modeling dataset to build smarter AI. That means improved word predictions when composing sentences, text generation, language translation, and more. If it does as well as Zyphra says, developers will only need to use one dataset, reducing production time and saving on cost.

And, if you’re curious how this new dataset became named Zyda, Puthalath reveals it’s a blend of “Zyphra Dataset.”

You can download Zyda on Zyphra’s Hugging Face page.



Source link

About The Author

Scroll to Top