Google today announced the preview launch of BigLake, a new data lake storage engine that makes it easier for businesses to analyze data in their data warehouses and data lakes, at its Cloud Data Summit. At its heart, the goal is to extend Google’s BigQuery data warehouse experience to data lakes on Google Cloud Storage, integrating the best of data lakes and warehouses into a single service that abstracts away the underlying storage formats and processes.
It’s worth mentioning that this data may also be stored in BigQuery or on AWS S3 and Azure Data Lake Storage Gen2. Developers will be able to query the underlying data stores through a single system, eliminating the need to relocate or duplicate data, thanks to BigLake’s consistent storage engine. In today’s release, Gerrit Kazmaier, VP and GM of Databases, Data Analytics, and Business Intelligence at Google Cloud, says, “Managing data across multiple lakes and warehouses creates silos and adds risk and expense, especially when data needs to be migrated.”
“BigLake enables businesses to combine their data warehouses and lakes to analyze data without caring about the underlying storage format or system, eliminating the need to duplicate or relocate data from a source and lowering costs and inefficiencies.” BigLake lets administrators to set security policies at the table, row, and column level using policy tags. This covers data in Google Cloud Storage and the two third-party systems where BigQuery Omni, Google’s multi-cloud analytics solution, supports these security restrictions. Only the appropriate data passes into technologies like Spark, Presto, Trino, and TensorFlow thanks to these security measures. Additional data management features are provided by integrating the service with Google’s Dataplex tool.
BigLake will feature fine-grained access restrictions, and its API will cover Google Cloud, as well as file formats like Apache Parquet and open-source processing engines like Apache Spark, according to Google. In today’s release, Google Cloud software engineer Justin Levandoski and product manager Gaurav Saxena note that “the volume of important data that enterprises have to manage and analyze is expanding at an unbelievable rate.” “This data is increasingly dispersed across several sites, such as data warehouses, data lakes, and NoSQL databases. Silos occur as an organization’s data becomes increasingly sophisticated and spreads across multiple data environments, increasing risk and cost, particularly when that data needs to be relocated. Our consumers have made it plain that they require assistance.”
Aside from BigLake, Google also announced today that “change streams” will be added to Spanner, their globally distributed SQL database. Users may quickly track any changes to a database in real time, whether they are inserts, updates, or deletes, using these. Customers may quickly copy changes from Spanner to BigQuery for real-time analytics, trigger downstream application behavior via Pub/Sub, or store changes in Google Cloud Storage (GCS) for compliance, according to Kazmaier. Google Cloud also announced the public release of Vertex AI Workbench, a solution for managing the whole lifecycle of a data science project, as well as Connected Sheets for Looker and the ability to access Looker data models in its Data Studio BI tool.