142: The Cloud Pod spends the Weekend at the Google Data Lakehouse
The Cloud Pod
On The Cloud Pod this week, the team wishes for time-traveling data. Also, GCP announces Data Lakehouse, Azure hosts Ignite 2021, and Microsoft is out for the metaverse. A big thanks to this week’s sponsors: Foghorn Consulting, which provides full-stack cloud solutions with a focus on strategy, planning and execution for enterprises seeking to take advantage of the transformative capabilities of AWS, Google Cloud and Azure. JumpCloud, which offers a complete platform for identity, access, and device management — no matter where your users and devices are located. This week’s highlights 🚨 GCP releases its data lakehouse, a new architecture that offers low-cost storage in an open format. The real question is, can we book it on Airbnb? 🚨 Microsoft kicks off Azure Ignite 2021, announcing new capabilities for its hybrid, multicloud and edge computing platforms. 🚨 Microsoft also unveils plans for its own metaverse, including upgrades to Teams, Dynamic 365 Connected Spaces and more. Top Quotes 💡 “I'm a big fan of IDE for coding and that integrated environment to reduce context shifting, but when you're talking about access to data, Jupyter is something that's hosted, that you can protect and grant access to, versus an IDE like RStudio. It becomes a much trickier scenario to maintain any kind of data sovereignty, or protect that in any way, just because, by its true nature, you have to open it up.” 💡 “Between the Facebook Metaverse and Microsoft, who's going to win the race? Everyone wants to build “Ready Player One.” And Facebook owns Oculus and they have all my data, then they can get my brain as well: They can just monetize the crap out of my profile. And then Microsoft has their augmented reality things… . But I think the power of the Azure cloud actually gives them the advantage versus Facebook, in my opinion. “General News: ‘Tis Earnings Season 📈 Microsoft was the first to announce its quarterly revenue, boasting a $45 billion increase. This jump of 22% beats Wall Street expectations, and includes Microsoft Azure, LinkedIn commercial revenue, Office 365, and Xbox. 💰 Google also posted impressive results, rounding out the quarter at $18.9 billion, up a whopping 68% from one year ago. Much of this success came from Google Ads and GCP, where revenue was up 45% or about $5 billion. 📉 Due to ongoing supply chain issues and labor shortages, Amazon missed the mark on its earnings forecast, posting a profit of $3.2 billion, a 49% decrease from last year. AWS, however, outperformed (as usual), with a 39% rise in revenue to $16.1 billion. AWS: The Official Cloud Storage Provider of MI6 🐟 Now generally available, AWS Babelfish allows users to migrate from expensive, proprietary MSSQL to the Amazon Aurora compatible edition. With Babelfish, customers can move their apps in a fraction of traditional migration times. See ya, Microsoft! 🧠 Following the recent launch of M6i, AWS has released C6i — a new instance that offers 15% improvement in compute price performance and up to 9% higher memory bandwidth when compared to C5. ⌨️ AWS releases new attribute-based instance type selections (ABS) to help users express and translate instance requirements — e.g. VCPU, memory, storage, etc — to simplify the creation and maintenance of instance type configurations. 🍸 MI6, the UK spy agency and home of James Bond, chooses AWS as its partner to scale cloud computing. This contract is estimated to be worth $689 million to $1.38 billion over the next decade. Can’t say Dr. No to that. 👏 AWS is now allowing users to run their Windows containers with AWS Fargate, which removes the need to provision, scale and manage Windows compute infrastructure. Finally a way to run containers that isn’t totally awful. 📒 In collaboration with RStudio PBC, Amazon is releasing the first fully managed RStudio workbench integrated development environment (IDE) in the cloud. RStudio users can now synchronize their RStudio notebooks with Amazon SageMaker through underlying EFS storage. ☁️ Justin is pumped for AWS CloudFront, which now supports configurable CORS, security, and custom HTTP response headers. This will save users time by removing the need to configure their origin, or use custom Lambda@Edge or CloudFront functions to insert headers. GCP: Enjoy Your Stay at the Google Data Lakehouse ⭐ With the highest marks possible, Forrester names Google AppSheet a leader in low-code platforms for business developers in Q4 2021. Gold star for you, Google — we mean, Alphabet. 🤝 Django ORM can now fully support Cloud Spanner. This third-party database is a powerful component of the Django web framework and can now be powered by the Python Cloud Spanner library. 🏡 GCP’s data lakehouse is open for visitors! Combining the benefits of data warehouses and data lakes, GCP has released a new data framework for low-cost storage in an open format. It’s accessible by a variety of processing engines, while also providing powerful management and optimization features. 🐶 Now in preview, Google Cloud Spot VMs allow users to improve total cost ownership with discounts of up to 91%, plus increase automation and integrate seamlessly with simple one-line changes. Unlike preemptible VMs, GCP Spot VMs have no time limit, and can be terminated anytime within 30 seconds — hence the giant discount. 🖱️ Google makes Cloud Domains generally available, allowing all users to easily register and manage their domains in a single click.Azure: Watch Out Facebook: Microsoft Talks Metaverse at Ignite 2021 🔥 Good news! You can now overpay for Azure Firewall Premium in more regions. This update also comes with Terraform support, web category check (available in preview), and more. 🔑 In an effort to close the cybersecurity skills gap in the U.S., Microsoft is creating a national community college curriculum to grow the number of cybersecurity professionals to 250,000 by 2025. This is part of Azure’s $20 billion commitment over five years to improve security solutions. 🪵 Logz.io is now generally available. With Logz.io, users can seamlessly provision accounts and configure Azure resources to send logs from Azure Portal. So basically, it’s Elasticsearch. 💽 Azure announces general availability of Ephemeral OS disks for VM support, with additional VM sizes. With this feature, users can create ephemeral OS disks for their VMs that don’t have the cache or have insufficient cache. 📊 With the now generally available Azure Data Explorer Insights, you can get comprehensive monitoring of your Azure Data Explorer Clusters, along with a unified view of performance, cache, ingestions and usage. 💥 At Ignite, Azure announces major upgrades across its hybrid, multicloud and edge computing platforms. These upgrades include new cloud capabilities, data features, and the SQL Server 2022 (in gated preview), which will be “the most flexible, scalable, and cloud-connected SQL Server release yet.” 🌐 Microsoft is after the metaverse. At Ignite 2021, Microsoft announced its plans for a hybrid metaverse, powered by Dynamics 365 Connected Spaces (now in preview) and Mesh for MS teams. Additional applications like the Azure OpenAI Service, MS Loop, MS Customer Experience and Context IQ will help build and guide the metaverse journey. TCP Lightning Round⚡ Ryan scores the point in this lightning round, due to his enthusiasm for Amazon EC2 spot placement. This leaves the points at Justin (16), Ryan (12), Jonathan (12), Peter (1). . Other Headlines Mentioned: Public preview: Multiple backups per day for Azure Files Video walkthrough: Set up a multiplayer game server with Google Cloud AWS Transit Gateway Network Manager launches new APIs to simplify network and route analysis in your global network Amazon EKS Managed Node Groups adds native support for bottlerocket Amazon Textract launches TIFF support and adds asynchronous support for receipts and invoices processing Introducing Amazon EC2 Spot placement score Amazon DevOps Guru increases coverage of Amazon EKS metrics and adds metric view by cluster Azure trusted launch for Virtual Machines now generally available Google announces Zero Trust workload security with GKE Traffic Director is now GA Google now allows you to quickly, easily and affordably backup your data with BigQuery table snapshots Public preview: Near real-time analytics for telemetry, time series, and log data on Azure Synapse AWS Secrets Manager increases secrets limit to 500K per account Things Coming Up State of FinOps Update - Nov 18 Mini-Summit Join us on Azure IaaS Day: Learn to increase agility and resiliency of your infrastructure - November 17th AWS re:Invent - November 29th - December 3rd - Las Vegas Meetup as a Service
Construindo Data Lakehouse e muito mais, no Grupo Boticário - Data Hackers Podcast 44
O que é um Data Lakehouse? Parece mais uma nova modinha, mas não: é uma nova forma de se construir uma Plataforma que facilita e democratiza o acesso a dados, desde sua criação. Legal né? Essa e muitas outras discussões permearam nosso episódio 44, com a presença dos feras em Data Engineering do Grupo Boticário. Trouxemos as grandes referências do GB em Engenharia e Arquitetura de Dados para dar essa aula pra gente: Robson Mendonça (Gerente SR Engenharia de Dados), Edson Junior (Gerente de Engenharia de Dados) Marcus Bittencourt (Gerente de Arquitetura e Plataforma de Dados). Veja os links do episódio no nosso post do Medium: https://medium.com/data-hackers/construindo-data-lakehouse-e-muito-mais-no-grupo-botic%C3%A1rio-data-hackers-podcast-44-20d67f05cfa4
Delta Lake: Storage Engine Escalável para Construção de um Data Lakehouse
Engenharia de Dados [Cast]
O Delta Lake é uma engine de armazenamento otimizado para construção de projetos de Big Data e Analytics especialmente desenhado para o Apache Spark.A engine foi criada para armazenar grandes quantidades de dados (Data Lake) e também organizar dados em formas de tabelas (Data Warehouse), dessa forma a consulta dentro desse formato de arquivo pode ser indexada de forma eficiente.Além disso, diversos recursos foram adicionados como - transações acid, viagem no tempo (time travel), auditoria, operações de dml (insert, update, delete e merge) e outros recursos valiosos para operações em grandes massas de dados.
Let Your Analysts Build A Data Lakehouse With Cuelake
Data Engineering Podcast
Summary Data lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and data architecture they still require significant knowledge and experience to deploy and manage. In this episode Vikrant Dubey discusses his work on the Cuelake project which allows data analysts to build a lakehouse with SQL queries. By building on top of Zeppelin, Spark, and Iceberg he and his team at Cuebook have built an autoscaled cloud native system that abstracts the underlying complexity. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Vikrant Dubey about Cuebook and their Cuelake project for building ELT pipelines for your data lakehouse entirely in SQL Interview Introduction How did you get involved in the area of data management? Can you describe what Cuelake is and the story behind it? There are a number of platforms and projects for running SQL workloads and transformations on a data lake. What was lacking in those systems that you are addressing with Cuelake? Who are the target users of Cuelake and how has that influenced the features and design of the system? Can you describe how Cuelake is implemented? What was your selection process for the various components? What are some of the sharp edges that you have had to work around when integrating these components? What involved in getting Cuelake deployed? How are you using Cuelake in your work at Cuebook? Given your focus on machine learning for anomaly detection of business metrics, what are the challenges that you faced in using a data warehouse for those workloads? What are the advantages that a data lake/lakehouse architecture maintains over a warehouse? What are the shortcomings of the lake/lakehouse approach that are solved by using a warehouse? What are the most interesting, innovative, or unexpected ways that you have seen Cuelake used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cuelake? When is Cuelake the wrong choice? What do you have planned for the future of Cuelake? Contact Info LinkedIn vikrantcue on GitHub @vkrntd on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email firstname.lastname@example.org) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Cuelake Apache Druid Dremio Databricks Zeppelin Spark Apache Iceberg Apache Hudi The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Data Warehouse vs. Data Lakehouse - Casos de Uso e Comparações com Orlando Marley
Engenharia de Dados [Cast]
Você gostaria de compreender a diferença entre Data Warehouse e Data Lakehouse e ir além para entender de fato o que acontece na realidade das empresas que adotam essas soluções? O Orlando Marley é um dos grandes especialistas nessa área e com ele, iremos dar dicas de como entender melhor esses dois paradigmas e como você pode unir essas duas soluções para entregar um Analytics marcante para sua empresa.Data Lakehouse é um novo conceito que vem ganhando tração rapidamente e para você poder se destacar como um engenheiro de dados se faz necessário aprender sobre.
Episode 28: Data lakehouse – overhyped or here to stay?
Is the data lakehouse more than just a buzzword? Why has the term sprung up and are there any use cases where a lakehouse might make sense? Join Helena Schwenk and Graham Sharpe, director of strategic solutions at Exasol, as they explore the term and provide practical advice for data pros and organizations trying to get to grips with what it really means. If you want to explore more of our podcasts and extra talking points and resources, check out the DataXpresso homepage.
Dw vs. Data Lake vs. MDW vs. Data Lakehouse para Pipeline de Dados
Engenharia de Dados [Cast]
Uma das dúvidas mais comuns em ambientes de big data e construção de data pipelines é de fato entender as diferenças entre os diversos tipos de storages que podemos nos conectar para processar os dados.Nesse episódio, atacamos todos os tipos que o mercado oferece mostrando seus lados positivos e negativos para que você que está construindo entenda da melhor forma como cada um desses storages se comportam.Falamos também da importância do mindset tanto do profissional como da empresa em não somente armazenar mas como processar dados de forma eficiente, madura e rápida.Entenda a evolução do mercado de Big Data e Analytics e entenda os mais novos termos e tecnologias utilizadas para construção de pipeline de dados.
In this episode Felipe and Mike talk about the evolution of the Data Warehouse to what is called a Data Lakehouse. What is it? Should every organization adopt it? Why? What are the technologies involved? And we ended up opening a Pandora's Box for following episodes.
Data Lakehouse, meet fast queries and visualization: Databricks unveils Delta Engine, acquires Redash. Backstage chat featuring Databricks CEO / Co-Founder Ali Ghodsi
Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis
Data warehouses alone don't cut it. Data lakes alone don't cut it either. So whether you call it data lakehouse or by any other name, you need the best of both worlds, says Databricks. A new query engine and a visualization layer are the next pieces in Databricks' puzzle. We connected with Ali Ghodsi, co-founder and CEO of Databricks, to discuss their latest news: the announcement of a new query engine called Delta Engine, and the acquisition of Redash, an open source visualization product. Our discussion started with the background on data lakehouses, which is the term Databricks is advocating to signify the coalescing of data warehouses and data lakes. We talked about trends such as multi cloud and machine learning that lead to a new reality, how data warehouses and data lakes work, and what does the data lakehouse bring to the table. We also talked about Delta Engine and Redash of course, and we wrapped up with an outlook on Databricks business growth. ZDNet article published in June 2020
Audio Blog: All Hail, the Data Lakehouse! (If Built on a Modern Data Warehouse) by Wayne Eckerson
Secrets of Data Analytics Leaders
This audio blog is about the data lakehouse and how it is the latest incantation from a handful of data lake providers to usurp the rapidly changing cloud data warehousing market. It is one of three blogs featured in the data lakehouse series.Originally published at: https://www.eckerson.com/articles/all-hail-the-data-lakehouse-if-built-on-a-modern-data-warehouse