Chalky
I see a lot of tables that look something like this:
CREATE TABLE dbo.orders
(
order_id int NOT NULL PRIMARY KEY
DEFAULT (NEXT VALUE FOR dbo.order_id),
order_date datetime NOT NULL,
order_ship_date datetime NOT NULL,
order_total money NOT NULL,
order_tax money NOT NULL,
customer_id int NOT NULL
DEFAULT (NEXT VALUE FOR dbo.customer_id),
customer_fullname nvarchar(250),
customer_street nvarchar(250),
customer_street_2 nvarchar(250),
customer_city nvarchar(250),
customer_state nvarchar(250),
customer_zip nvarchar(250),
customer_country nvarchar(250)
);
Looking at the design, there are two big problems:
- There are “order” columns that are going to get a lot of inserts and updates
- You’re going to be storing the same customer information over and over again
The more related, but not independent, data you store in the same table, the harder it becomes to effectively index that table.
A while back, I blogged about Tables Within Tables, but uh… surprisingly, the problem still exists! Usually when I blog about something, the problem disappears. Hm.
Better But Not Best
If we follow some practical guidance and put named columns into their own table, we end up with this:
CREATE TABLE dbo.orders
(
order_id int NOT NULL PRIMARY KEY
DEFAULT (NEXT VALUE FOR dbo.order_id),
order_date datetime NOT NULL,
order_ship_date datetime NOT NULL,
order_total money NOT NULL,
order_tax money NOT NULL,
customer_id int NOT NULL
);
CREATE TABLE dbo.customers
(
customer_id int NOT NULL PRIMARY KEY
DEFAULT (NEXT VALUE FOR dbo.customer_id),
customer_fullname nvarchar(250),
customer_street nvarchar(250),
customer_street_2 nvarchar(250),
customer_city nvarchar(250),
customer_state nvarchar(250),
customer_zip nvarchar(250),
customer_country nvarchar(250)
);
This is a better scenario, because we only store customer information once, and inserts/updates to order information don’t impact people working with customer data.
But this still isn’t great — what if a customer wants to send an order to a different address?
If we wanted to store everything in this table, we’d be breaking other practical rules: we’d have to have multiple rows for users, or we’d have to add columns columns to the table to deal with multiple addresses. That’s a mess both for people who don’t use all those extra columns, and for people who might have half a dozen addresses they send to.
Getting There
A better way to phrase the customer table might be like this:
CREATE TABLE dbo.customers
(
customer_id int NOT NULL PRIMARY KEY
DEFAULT (NEXT VALUE FOR dbo.customer_id),
default_fullname nvarchar(250),
default_street nvarchar(250),
default_street_2 nvarchar(250),
default_city nvarchar(250),
default_state nvarchar(250),
default_zip nvarchar(250),
default_country nvarchar(250)
);
Most of the time, people are going to send stuff to one address — call it home if you want. It’s probably also their billing address, so it makes sense for it to be the default, and to have it be the first choice.
Then we’ll have a table of EAV data that looks like this:
CREATE TABLE dbo.customers_address_book
(
address_id int NOT NULL
DEFAULT (NEXT VALUE FOR dbo.address_id),
customer_id int NOT NULL,
address_type tinyint,
customer_fullname nvarchar(250),
customer_street nvarchar(250),
customer_street_2 nvarchar(250),
customer_city nvarchar(250),
customer_state nvarchar(250),
customer_zip nvarchar(250),
customer_country nvarchar(250),
CONSTRAINT pk_cab_id PRIMARY KEY (customer_id, address_id)
);
In a table like this, whenever a customer ships to a non-default address it gets stored off here. Now customers can have as many addresses as they want to choose from without us having to have an extra bloated table of default information plus non-default information.
Because of the way this data is modeled, we don’t need to keep adding columns to accommodate multiple addresses. We just tack rows on, and since this data isn’t likely to get updated the insert/select pattern should end up with minimal blocking.
Tomato Sauce
I know, horrifying. You might have to write a join. You poor, downtrodden developer.
Of course, this makes the most sense when you’re dealing with OLTP workloads. And sure, a lot of these columns probably don’t need to be as long as they are, but that’s a totally different post.
When you’re dealing with reporting data, de-normalizing is generally preferred. Though if you’re doing serious reporting and using column store indexes, I’d probably wanna keep the strings out as much as possible, and just key back to them in other tables. Yuck.
Have I ever mentioned that strings in databases were a mistake?
Thanks for reading!
Going Further
If this is the kind of SQL Server stuff you love learning about, you’ll love my training. I’m offering a 75% discount to my blog readers if you click from here. I’m also available for consulting if you just don’t have time for that and need to solve performance problems quickly.