Monday, March 26, 2012

Normalization question

Hello,
I have a question about the benefits/gain or correctness
according to RDBMS convention of normalizing data tables.
My table in question is a single table that contains about
5,000,000 records that are queried to generate reports.
It does not feed any other tables. The un-normalized
table (tbl1) looks like this:
idNum Area Date Metric Value
1 A 1/1/2005 Q 3
1 A 1/1/2005 R 0
1 A 1/1/2005 S 17
1 A 1/1/2005 T 9
2 B 1/1/2005 Q 11
2 B 1/1/2005 R 8
2 B 1/1/2005 S 2
2 B 1/1/2005 T 30
...
10 A 1/2/2005 Q 3
10 A 1/2/2005 R 0
10 A 1/2/2005 S 17
10 A 1/2/2005 T 9
11 B 1/2/2005 Q 20
11 B 1/2/2005 R 22
11 B 1/2/2005 S 0
11 B 1/2/2005 T 17
...
So in this table I am capturing 4 metrics, Q, R, S, T for
each Area (A, B, ...) for each day. So for each Metric, I
am repeating the idNum, Area, and Date in each row.
In the normalized version, tbl1 would contain one
record/row for each idNum, Area, Date, and tbl2 would
contain the 4 rows of idNums, Metrics for each idNum,
Area, Date in tbl1
tbl1 tbl2
idNum Area Date idNum Metric Value
1 A 1/1/2005 1 Q 3
1 R 0
1 S 17
1 T 9
tbl1 would now shrink from 5,000,000 rows to about 50,000
(there's actually like 20 metrics for each area) but tbl2
would still have the 5,000,000 rows. My server has
300gigs drivespace, two 3gig cpu's, 8gigs ram. With the
normalized tables, I have 2 tables to manage now and have
to add a join for all the queries. I don't see the
benefit of normalizing my table for this scenario.
However, there may be concepts of RDBMS that I don't
understand where normalization is the correct way to go.
Any advice on this would be greatly appreciated.
Thanks,
Ron"Ron" <anonymous@.discussions.microsoft.com> wrote in message
news:1eec01c515d5$d62407b0$a601280a@.phx.gbl...
> idNum Area Date Metric Value
> 1 A 1/1/2005 Q 3
> 1 A 1/1/2005 R 0
> 1 A 1/1/2005 S 17
> 1 A 1/1/2005 T 9
> 2 B 1/1/2005 Q 11
> 2 B 1/1/2005 R 8
> 2 B 1/1/2005 S 2
> 2 B 1/1/2005 T 30
The main practical benefit of normalization (IMO) is elimination of
duplicate data. This has two sub-benefits: 1) reduction of storage space,
and 2) ease of updating. Depending on your circumstances, either or both of
these sub-benefits may not apply.|||Thanks for your reply. The main thing I was checking for
was if there were some concept I wasn't aware of. I can
see the reduced storage. And I suppose it is easier to
search through 50,000 rows vs 5,000,000 rows for stuff.
The conclusion I come to is that you can get more
performance with Normalization - or - it is easier to
write code for a single table but less performance. At my
place, the coding seems to change more than the data. I
guess the bottom line will be who is going to do more
work - me or the computer.

>--Original Message--
>"Ron" <anonymous@.discussions.microsoft.com> wrote in
message
>news:1eec01c515d5$d62407b0$a601280a@.phx.gbl...
>The main practical benefit of normalization (IMO) is
elimination of
>duplicate data. This has two sub-benefits: 1) reduction
of storage space,
>and 2) ease of updating. Depending on your circumstances,
either or both of
>these sub-benefits may not apply.
>
>.
>|||Normalizing the table will prevent update anomalies and may also increase
performance, despite the joins. Querying a table with 50,000 records with a
n
index takes roughly 16 compares, whereas querying 5,000,000 records with an
index takes roughly 23 compares. In addition, smaller records mean more
records per data page, which should reduce disk reads. There is no doubt
that reading the entire table will perform slower with a join, but it has
been my experience that most queries retrieve only a small fraction of the
data, so an inner loop or merge join (with a clustered index on common) fiel
d
will actually perform better with two tables because of the reduced number o
f
disk reads required to retrieve the data.
I make it a habit to fully normalize a database and then denormalize only to
simplify joins of many tables, and then only when there is a perceived
performance problem that cannot be fixed with a judiciously placed index.
"Ron" wrote:

> Hello,
> I have a question about the benefits/gain or correctness
> according to RDBMS convention of normalizing data tables.
> My table in question is a single table that contains about
> 5,000,000 records that are queried to generate reports.
> It does not feed any other tables. The un-normalized
> table (tbl1) looks like this:
> idNum Area Date Metric Value
> 1 A 1/1/2005 Q 3
> 1 A 1/1/2005 R 0
> 1 A 1/1/2005 S 17
> 1 A 1/1/2005 T 9
> 2 B 1/1/2005 Q 11
> 2 B 1/1/2005 R 8
> 2 B 1/1/2005 S 2
> 2 B 1/1/2005 T 30
> ...
> 10 A 1/2/2005 Q 3
> 10 A 1/2/2005 R 0
> 10 A 1/2/2005 S 17
> 10 A 1/2/2005 T 9
> 11 B 1/2/2005 Q 20
> 11 B 1/2/2005 R 22
> 11 B 1/2/2005 S 0
> 11 B 1/2/2005 T 17
> ...
> So in this table I am capturing 4 metrics, Q, R, S, T for
> each Area (A, B, ...) for each day. So for each Metric, I
> am repeating the idNum, Area, and Date in each row.
> In the normalized version, tbl1 would contain one
> record/row for each idNum, Area, Date, and tbl2 would
> contain the 4 rows of idNums, Metrics for each idNum,
> Area, Date in tbl1
> tbl1 tbl2
> idNum Area Date idNum Metric Value
> 1 A 1/1/2005 1 Q 3
> 1 R 0
> 1 S 17
> 1 T 9
> tbl1 would now shrink from 5,000,000 rows to about 50,000
> (there's actually like 20 metrics for each area) but tbl2
> would still have the 5,000,000 rows. My server has
> 300gigs drivespace, two 3gig cpu's, 8gigs ram. With the
> normalized tables, I have 2 tables to manage now and have
> to add a join for all the queries. I don't see the
> benefit of normalizing my table for this scenario.
> However, there may be concepts of RDBMS that I don't
> understand where normalization is the correct way to go.
> Any advice on this would be greatly appreciated.
> Thanks,
> Ron
>|||>> So in this table I am capturing 4 metrics, Q, R, S, T for each Area
(A, B, ...) for each day. So for each Metric, I am repeating the
idNum, Area, and Date in each row. <<
1) id_num is redundant.
2) DATE is a reserved word.
3) Since the metrics are attribute of a reading, they need to be in
columns, not expressed as values. A normalized table would look like
this:
CREATE TABLE Readings
(area_code CHAR(1) NOT NULL,
collection_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
Q_value INTEGER DEFAULT 0 NOT NULL,
R_value INTEGER DEFAULT 0 NOT NULL,
S_value INTEGER DEFAULT 0 NOT NULL,
T_value INTEGER DEFAULT 0 NOT NULL,
PRIMARY KEY (area_code, collection_date ));
Obviously you will need to add CHECK() constraints and the right
DEFAULT values.

No comments:

Post a Comment