In our previous blog on getting started with Azure Databricks, we looked at Databricks tables. In this blog, we will look at a type of Databricks table called Delta table and best practices around storing data in Delta tables.
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Databricks Delta table is a table that has a Delta Lake as the data source similar to how we had a CSV file as a data source for the table in the previous blog.
When we create a delta table and insert records into it, Databricks loads the data into multiple small files. You can see the multiple files created for the table “business.inventory” below
Partitioning involves putting different rows into different tables. E.g., if we have an address table with addresses in the US, the addresses might be stored in 50 different tables corresponding to the 50 states in the US. A view with a union might be created over all of them to provide a complete view of all addresses.
Sample code to create a table partitioned by date column is given below:
CREATE TABLE events ( date DATE, eventId STRING, eventType STRING, data STRING) USING delta PARTITIONED BY (date)
The table “business.sales” given below is partitioned by InvoiceDate. You can see that there is a folder created for each InvoiceDate and within the folders, there are multiple files that store the data for this table.
This partitioning will be useful when we have queries selecting records from this table with InvoiveDate in WHERE clause.
E.g.:
SELECT SLSDA_ID, RcdType, DistId
FROM business.sales
WHERE InvoiceDate = ‘2013-01-01’
In total there are 40,545 files for this table which you can see from below screenshot
SMALL FILE PROBLEM
Historical and new data is often written in very small files and directories. This data may be spread across a data center or even across the world (that is, not co-located). The result is that a query on this data may be very slow due to
The solution is to compact many small files into one larger file.
OPTIMIZE command invokes the bin-packing (Compaction) algorithm to coalesce small files into larger ones. Small files are compacted together into new larger files up to 1GB.
You can see below that the OPTIMIZE command has removed the 40,545 files and instead of them added 2378 files. Also, observe that after Optimization the size of the table has decreased from 1.49 GB to 1.08 GB
Optimize will compact the small files for tables that are not partitioned too.
business.finance_transactions_silver table is not partitioned and is currently having 64 files with total size 858 MB
Running the Optimize command coalesces the 64 files to 1 file
Note that “partitionsOptimized” is 1 in this case. Previously for the partitioned table “partitionsOptimized was 2509. OPTIMIZE command coalesces the small files within a partition only. If the table is not partitioned, the whole table is considered as a single xpartition.
Given a column that you want to perform ZORDER on, say OrderColumn, Delta
Note: We cannot use the table partition column also as a ZORDER column.
Syntax for ZORDER is
OPTIMIZE tablename ZORDER BY (OrderColumn)
a. PARTITION BY
b. OPTIMIZE
c. ZORDER BY
Share this:
CloudIQ is a leading Cloud Consulting and Solutions firm that helps businesses solve today’s problems and plan the enterprise of tomorrow by integrating intelligent cloud solutions. We help you leverage the technologies that make your people more productive, your infrastructure more intelligent, and your business more profitable.
LATEST THINKING
INDIA
Chennai One IT SEZ,
Module No:5-C, Phase ll, 2nd Floor, North Block, Pallavaram-Thoraipakkam 200 ft road, Thoraipakkam, Chennai – 600097
© 2024 CloudIQ Technologies. All rights reserved.
Get in touch
Please contact us using the form below
USA
INDIA