AZURE

In our previous blog on getting started with Azure Databricks, we looked at Databricks tables. In this blog, we will look at a type of Databricks table called Delta table and best practices around storing data in Delta tables.
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Databricks Delta table is a table that has a Delta Lake as the data source similar to how we had a CSV file as a data source for the table in the previous blog.
When we create a delta table and insert records into it, Databricks loads the data into multiple small files. You can see the multiple files created for the table “business.inventory” below

Partitioning involves putting different rows into different tables. E.g., if we have an address table with addresses in the US, the addresses might be stored in 50 different tables corresponding to the 50 states in the US. A view with a union might be created over all of them to provide a complete view of all addresses.
Sample code to create a table partitioned by date column is given below:
CREATE TABLE events ( date DATE, eventId STRING, eventType STRING, data STRING) USING delta PARTITIONED BY (date)
The table “business.sales” given below is partitioned by InvoiceDate. You can see that there is a folder created for each InvoiceDate and within the folders, there are multiple files that store the data for this table.
This partitioning will be useful when we have queries selecting records from this table with InvoiveDate in WHERE clause.
E.g.:
SELECT SLSDA_ID, RcdType, DistId
FROM business.sales
WHERE InvoiceDate = ‘2013-01-01’


In total there are 40,545 files for this table which you can see from below screenshot

SMALL FILE PROBLEM
Historical and new data is often written in very small files and directories. This data may be spread across a data center or even across the world (that is, not co-located). The result is that a query on this data may be very slow due to
The solution is to compact many small files into one larger file.
OPTIMIZE command invokes the bin-packing (Compaction) algorithm to coalesce small files into larger ones. Small files are compacted together into new larger files up to 1GB.
You can see below that the OPTIMIZE command has removed the 40,545 files and instead of them added 2378 files. Also, observe that after Optimization the size of the table has decreased from 1.49 GB to 1.08 GB


Optimize will compact the small files for tables that are not partitioned too.
business.finance_transactions_silver table is not partitioned and is currently having 64 files with total size 858 MB

Running the Optimize command coalesces the 64 files to 1 file


Note that “partitionsOptimized” is 1 in this case. Previously for the partitioned table “partitionsOptimized was 2509. OPTIMIZE command coalesces the small files within a partition only. If the table is not partitioned, the whole table is considered as a single xpartition.
Given a column that you want to perform ZORDER on, say OrderColumn, Delta
Note: We cannot use the table partition column also as a ZORDER column.
Syntax for ZORDER is
OPTIMIZE tablename ZORDER BY (OrderColumn)
a. PARTITION BY
b. OPTIMIZE
c. ZORDER BY
Share this:

Every few months, an engineering team we respect announces they’ve gone multi-region. The blog post is confident. The architecture diagram is impressive. And somewhere in the write-up, the phrase “high availability” appears as justification, as if the words themselves close the argument. They usually haven’t done the math. Multi-region architecture has become a status symbol in distributed systems. Teams treat it […]

Executive Summary Crystal Reports is aging out. Talent is shrinking. The modern stack has moved on. Yet migration projects stall because they are manual, error-prone, and slow. This article introduces a multi-agent AI pipeline — six specialist agents, each evaluated before advancing — that automates the Crystal-to-Power BI conversion end to end. Six Agents, Six […]

Seattle – [Mar23, 2026] – CloudIQ Technologies Inc today announced it has earned the AI Apps on Microsoft Azure specialization, a validation of a solution partner’s deep knowledge, extensive experience, and proven expertise in designing, developing, and deploying AI-powered applications on Microsoft Azure. Only partners that meet stringent criteria around customer success and staff skilling, […]
Partner with CloudIQ to achieve immediate gains while building a strong foundation for long-term, transformative success.