Snowflake unveils Polaris, a vendor-neutral open catalog implementation for Apache Iceberg

Time’s almost up! There’s only one week left to request an invite to The AI Impact Tour on June 5th. Don’t miss out on this incredible opportunity to explore various methods for auditing AI models. Find out how you can attend here.

Today, Snowflake kicked off its annual data cloud summit with the launch of Polaris Catalog, a new, open data catalog implementation for indexing and organizing data conforming to the Apache Iceberg table format. 

Available in both self and Snowflake-hosted options, the catalog will be open-sourced over the next 90 days and interoperate with other query engines enterprises would like to use to drive value from their data assets.

“This is not a Snowflake feature to work better with the Snowflake query engine. Of course, it will integrate and interoperate very well, but we’re bringing together several industry partners to make sure we can give our mutual customers the choice to mix and match multiple query engines and to be able to coordinate read and write activity in any fashion, without lock-in,” Christian Kleinerman, EVP of Product at Snowflake, said in a press briefing.

Snowflake Polaris Catalog

Preventing the new ‘lock-in’ layer with Polaris

After the initial rise of first-generation Apache Hive, three open table formats have largely dominated the data ecosystem: Delta Lake, Apache Iceberg and Apache Hudi. 

June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure optimal performance and accuracy across your organization. Secure your attendance for this exclusive invite-only event.

While each of these formats has its core strength with support for commonly used file formats like Parquet to efficiently handle analytic workloads, data platform vendors have focused on one primary table format for their customers. For Databricks, this has been Delta Lake, while Snowflake, its biggest competitor, has been gradually shifting towards Iceberg.

“Snowflake did start with its proprietary table format. There are benefits to it, but many large, tech-savvy organizations want to set their eyes on one of two main table formats, Delta or Iceberg…We did the full assessment. We’re 100% committed to Apache Iceberg,” Kleinerman told VentureBeat while noting customers are leveraging it “quite well.”

As the adoption across open Delta Lake and Iceberg has grown, enterprises have faced the need to interoperate. Essentially, they want to freely mix and match their data catalogs (supporting either of these formats) with different engines (supporting these formats) to run queries against the data and provide answers for downstream users/apps. According to Kleinerman, this need for interoperability has been one of the primary reasons behind opting for open file and table formats. However, enterprises, especially those on Delta Lake catalogs, have frequently pointed out that their implementation has not been fully open.

“We heard very consistently from customers the catalog has the potential to become the ‘new lock-in layer’. In particular, we got to see some of the moves that are happening with the other format (Delta Lake), where the strong coupling between the closed-source catalog and the format is raising concerns. I’ve been on calls with customers saying ‘I want to hear more about Iceberg because Delta, the way it’s going, is open on the surface but closed in reality’,” Kleinerman added.

To address this concern, and further reinforce its commitment to Iceberg, Snowflake has launched the Polaris Catalog, which is completely based on Iceberg’s open-source REST protocol. This way, the offering provides an open standard for users to access and retrieve data using any engine of choice that supports the Iceberg Rest API, including Apache Flink, Apache Spark, Dremio, Python, Trino and others. 

Most importantly, enterprises get the flexibility to host Polaris on the Snowflake data cloud or self-host it on their own infrastructure using containers such as Docker or Kubernetes. The backend implementation of the catalog remains open-source all the time, giving enterprises the option to freely swap the hosting infrastructure while eliminating the concerns of vendor lock-in.

“You can have Polaris without having the rest of Snowflake…So, if you have lots of data in cloud storage. You can instantiate the Polaris Catalog and enumerate all the tables in this bucket. As a result, you have a catalog that knows how to answer questions based on Apache Iceberg documented APIs, like ‘give the tables for this database, give the columns for this table, etc. This way, any engine that knows how to leverage those APIs can query Polaris for information based on that data,” Kleinerman explained. 

Notably, Snowflake is doing the work to consume these APIs and do the same thing with its own query engine. The company is also working on building up the security for the project, ensuring the same level of permissions across different engines.

“Most of these catalog efforts and interoperability efforts have problems with ensuring the same levels of permissions or security entitlements across engines. That is not yet specified in the official Apache Iceberg spec. There are some proposals. So, for now, we’ve made some extensions in Polaris to support security across engines. We need to figure out how (what should be the right interface) to align it with the community. This is one of the conversations that we are having right now with partners,” Kleinerman noted.

Preview later in June

As of now, Snowflake is putting the final touches on Polaris. The company plans to make it available to first enterprise customers under preview later in June. Multiple leading enterprises with open query engines have already expressed support for the effort, including Amazon Web Services (AWS), Confluent, Dremio, Google Cloud, Microsoft Azure and Salesforce.

”Customers want thriving open ecosystems and to own their storage, data and metadata. They don’t want to be locked in. We’re committed to supporting open standards, such as Apache Iceberg and the open catalogs Project Nessie and Polaris Catalog. These open technologies will provide the ecosystem interoperability and choice that customers deserve,” Tomer Shiran, founder of Dremio, said in a statement.

Snowflake Data Cloud Summit runs from June 3 to June 6, 2024.

Source link

About The Author

Scroll to Top