Thursday, August 11, 2011

Statistical function in T-SQL

In my current project I need to calculate mean, mode, median and percentiles (95th percentile, 99th percentile) of a given data set.

In this article I've put together a summary of what I've discovered during my research.


The following code snippets are taken from the following blog - please visit it for more detailed information:
http://blogs.lessthandot.com/index.php/DataMgmt/DataDesign/calculating-mean-median-and-mode-with-sq


MEAN Calculation

Mean is another name for average so we can use AVG function to calculate mean.

MEDIAN Calculation

Median is middle point in the set

DECLARE @Temp TABLE(Id INT IDENTITY(1,1), DATA DECIMAL(10,5))


INSERT INTO @Temp VALUES(1)
INSERT INTO @Temp VALUES(2)
INSERT INTO @Temp VALUES(5)
INSERT INTO @Temp VALUES(5)
INSERT INTO @Temp VALUES(5)
INSERT INTO @Temp VALUES(6)
INSERT INTO @Temp VALUES(6)
INSERT INTO @Temp VALUES(6)
INSERT INTO @Temp VALUES(7)
INSERT INTO @Temp VALUES(9)
INSERT INTO @Temp VALUES(10)
INSERT INTO @Temp VALUES(NULL)


SELECT ((
        SELECT TOP 1 DATA
        FROM   (
                SELECT  TOP 50 PERCENT DATA
                FROM    @Temp
                WHERE   DATA IS NOT NULL
                ORDER BY DATA
                ) AS A
        ORDER BY DATA DESC) + 
        (
        SELECT TOP 1 DATA
        FROM   (
                SELECT  TOP 50 PERCENT DATA
                FROM    @Temp
                WHERE   DATA IS NOT NULL
                ORDER BY DATA DESC
                ) AS A
        ORDER BY DATA ASC)) / 2

--MODE Calculation


DECLARE @Temp TABLE(Id INT IDENTITY(1,1), DATA DECIMAL(10,5))

INSERT INTO @Temp VALUES(1)
INSERT INTO @Temp VALUES(2)
INSERT INTO @Temp VALUES(5)
INSERT INTO @Temp VALUES(5)
INSERT INTO @Temp VALUES(5)
INSERT INTO @Temp VALUES(6)
INSERT INTO @Temp VALUES(6)
INSERT INTO @Temp VALUES(6)
INSERT INTO @Temp VALUES(7)
INSERT INTO @Temp VALUES(9)
INSERT INTO @Temp VALUES(10)
INSERT INTO @Temp VALUES(NULL)


SELECT TOP 1 WITH ties DATA
FROM   @Temp
WHERE  DATA IS Not NULL
GROUP  BY DATA
ORDER  BY COUNT(*) DESC


Percentile Calculation


Please reference the following blog for details:



http://www.sqlteam.com/article/computing-percentiles-in-sql-server


-- function for floating point division
CREATE FUNCTION dbo.FDIV 
(@numerator float, 
@denominator float)
RETURNS float
AS
BEGIN
RETURN CASE WHEN @denominator = 0.0 
THEN 0.0
ELSE @numerator / @denominator
END
END
GO

-- function for linear interpolation
CREATE FUNCTION dbo.LERP 
(@value float, -- between low and high
@low float,
@high float,
@newlow float,
@newhigh float)
RETURNS float -- between newlow and newhigh
AS
BEGIN
  RETURN CASE 
      WHEN @value between @low and @high and @newlow <= @newhigh THEN @newlow + dbo.FDIV((@value-@low), (@high-@low)) * (@newhigh - @newlow)
      WHEN @value = @low and @newlow is not NULL THEN @newlow
      WHEN @value = @high and @newhigh is not NULL THEN @newhigh
      ELSE NULL
END
END
GO


Declare @TestScores table (StudentID int, Score int)
insert @TestScores (StudentID, Score) Values (1,  20)
insert @TestScores (StudentID, Score) Values (2,  03)
insert @TestScores (StudentID, Score) Values (3,  40)
insert @TestScores (StudentID, Score) Values (4,  45)
insert @TestScores (StudentID, Score) Values (5,  50)
insert @TestScores (StudentID, Score) Values (6,  20)
insert @TestScores (StudentID, Score) Values (7,  90)
insert @TestScores (StudentID, Score) Values (8,  20)
insert @TestScores (StudentID, Score) Values (9,  11)
insert @TestScores (StudentID, Score) Values (10, 30)


--Find the percentile rank of a given score 
-- The derived table makes one scan over the data values to compute some aggregates. The outer select interpolates 
-- between the pth percentile of the nearest samples below and above the given value.
-- The result is 25, 0.5. That means a score of 25 is at the 50th percentile, the median, of the distribution.

declare @val float
set @val = 25

select 
@val as val,
dbo.LERP(@val, scoreLT, scoreGE, 
dbo.FDIV(countLT-1,countMinus1), 
dbo.FDIV(countLT,countMinus1)) as percentrank
from (
select 
SUM(CASE WHEN Score < @val 
THEN 1 ELSE 0 END) as countLT,
count(*)-1 as countMinus1,
MAX(CASE WHEN Score < @val
THEN Score END) as scoreLT,
MIN(CASE WHEN Score >= @val
THEN Score END) as scoreGE
from @TestScores
) as x1

-- Find the percentile (the score) that characterizes a given percentage

declare @pp float
set @pp = .75

select 
@pp as factor, 
dbo.LERP(max(d), 0.0, 1.0, max(a.Score), max(b.Score)) as percentile
from
(
select floor(kf) as k, kf-floor(kf) as d
from (
select 1+@pp*(count(*)-1) as kf from @TestScores
) as x1
) as x2
join @TestScores a
on 
(select count(*) from @TestScores aa
where aa.Score < a.Score) < k
join @TestScores b
on 
(select count(*) from @TestScores bb
where bb.Score < b.Score) < k+1


Friday, July 15, 2011

SQL function to calculate number of work days

Found this code that uses CTE on the internet and modified to my own needs:


CREATE FUNCTION uf_get_num_of_work_days (@StartDate DATE, @EndDate DATE)
RETURNS INT
AS
BEGIN
DECLARE @workdays INT;


WITH DATE (Date1)
AS (
SELECT DATEADD(DAY, DATEDIFF(DAY, '19000101', @StartDate), '19000101')
UNION ALL
SELECT DATEADD(DAY, 1, Date1)
FROM DATE
WHERE Date1 < @EndDate
)
SELECT
@workdays = COUNT(*) --CONVERT(VARCHAR(15),d1.DATE1 ,110) as [Working Date], DATENAME(weekday, d1.Date1) [Working Day]
From
DATE d1
Where
(DATENAME(weekday, d1.Date1)) not in ('Saturday','Sunday')

RETURN @workdays
END

Thursday, July 7, 2011

Script to de-normalize Parent - Child Hierarchy

Parent Child data hierarchies are great for storing hierarchical data, however it becomes a major pain when you need to convert hierarchy structure into a flat view (e.g. to be used in reporting or SSAS cubes). While SSAS provide a build in support for parent-child dimensions, it only allows for one parent attribute per dimension.

I've looked into various techniques to convert parent-child into flat table and found a tool designed by Jon Burchel specifically for normalizing SSAS parent-child dimensions. The original tool is available on codeplex at http://pcdimnaturalize.codeplex.com

Base on this, I came up with my own version - a SQL Script that creates a SQL Server view for each lookup table in the database that contains parent-child hierarchy and converts it into a flat hierarchy with up to 3 levels.

The script assumes that lookup tables have "lu_" prefix. And it detects parent-child structure by searching for a column that starts with name "parent_".


Declare @Sql varchar(8000)


Declare @SrcTableName varchar(250), @AttributeName varchar(250)


Declare Cur Cursor For
Select distinct TABLE_NAME from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME like 'lu_%' and COLUMN_NAME like 'parent_%'
Open Cur


Fetch Next From Cur Into @SrcTableName


While @@FETCH_STATUS = 0
Begin


Set @AttributeName = SUBSTRING(@SrcTableName, 4, 250)


Set @Sql =
'
CREATE VIEW [dbo].[vw_naturalized_' + @SrcTableName + '] AS
WITH 
PCStructure(Level, [parent_' + @AttributeName + '_id], [' + @AttributeName + ' Name_KeyColumn], [' + @AttributeName + ' 01_KeyColumn], [' + @AttributeName + ' 02_KeyColumn], [' + @AttributeName + ' 03_KeyColumn])
AS
(
SELECT 
3 Level, [parent_' + @AttributeName + '_id], [' + @AttributeName + '_id], [' + @AttributeName + '_id] as [' + @AttributeName + ' 01_KeyColumn], [' + @AttributeName + '_id] as [' + @AttributeName + ' 02_KeyColumn], [' + @AttributeName + '_id] as [' + @AttributeName + ' 03_KeyColumn] 
FROM 
[dbo].[' + @SrcTableName + '] 
WHERE [parent_' + @AttributeName + '_id] IS NULL OR [parent_' + @AttributeName + '_id] = [' + @AttributeName + '_id] 

UNION ALL 

SELECT 
Level + 1, e.[parent_' + @AttributeName + '_id], e.[' + @AttributeName + '_id], 
CASE Level WHEN 2 THEN e.[' + @AttributeName + '_id] ELSE [' + @AttributeName + ' 01_KeyColumn] END AS [fetal_therapy 01_KeyColumn], 
CASE Level WHEN 2 THEN e.[' + @AttributeName + '_id] WHEN 3 THEN e.[' + @AttributeName + '_id] ELSE [' + @AttributeName + ' 02_KeyColumn] END AS [' + @AttributeName + ' 02_KeyColumn],
CASE Level WHEN 2 THEN e.[' + @AttributeName + '_id] WHEN 3 THEN e.[' + @AttributeName + '_id] WHEN 4 THEN e.[' + @AttributeName + '_id] ELSE [' + @AttributeName + ' 03_KeyColumn] END AS [' + @AttributeName + ' 03_KeyColumn] 
FROM [dbo].[' + @SrcTableName + '] e 
INNER JOIN PCStructure d ON e.[parent_' + @AttributeName + '_id] = d.[' + @AttributeName + ' Name_KeyColumn] AND e.[parent_' + @AttributeName + '_id] != e.[' + @AttributeName + '_id]
)


select 
Level4Subselect.*
from PCStructure a, 
(select [' + @AttributeName + '_id] [' + @AttributeName + ' 03_KeyColumn], [' + @AttributeName + '_name] [' + @AttributeName + ' 03_NameColumn], [' + @AttributeName + '_order_num] [' + @AttributeName + ' 03_' + @AttributeName + ' Order Num_KeyColumn], Level3Subselect.*
from [dbo].[' + @SrcTableName + '] b,
(select [' + @AttributeName + '_id] [' + @AttributeName + ' 02_KeyColumn], [' + @AttributeName + '_name] [' + @AttributeName + ' 02_NameColumn], [' + @AttributeName + '_order_num] [' + @AttributeName + ' 02_' + @AttributeName + ' Order Num_KeyColumn] , Level2Subselect.*
from [dbo].[' + @SrcTableName + '] b,
(select [' + @AttributeName + '_id] [' + @AttributeName + ' 01_KeyColumn], [' + @AttributeName + '_name] [' + @AttributeName + ' 01_NameColumn], [' + @AttributeName + '_order_num] [' + @AttributeName + ' 01_' + @AttributeName + ' Order Num_KeyColumn], CurrentMemberSubselect.* 
from [dbo].[' + @SrcTableName + '] b,
(select [' + @AttributeName + '_id] as [original_' + @AttributeName + '_id], [' + @AttributeName + '_name] [original_' + @AttributeName + '_name], [parent_' + @AttributeName + '_id] [original_parent_' + @AttributeName + '_id], [' + @AttributeName + '_order_num] [original_' + @AttributeName + '_order_num]
from [dbo].[' + @SrcTableName + '] b) CurrentMemberSubselect
) Level2Subselect
) Level3Subselect
) Level4Subselect
where
Level4Subselect.[' + @AttributeName + ' 03_KeyColumn] = a.[' + @AttributeName + ' 03_KeyColumn] and
Level4Subselect.[' + @AttributeName + ' 02_KeyColumn] = a.[' + @AttributeName + ' 02_KeyColumn] and
Level4Subselect.[' + @AttributeName + ' 01_KeyColumn] = a.[' + @AttributeName + ' 01_KeyColumn] and
Level4Subselect.[original_' + @AttributeName + '_id] = a.[' + @AttributeName + ' Name_KeyColumn]
GO
'
Print @Sql


Fetch Next From Cur Into @SrcTableName
end


close cur
deallocate cur 

First Blog

Hello there!

My name is Maxim Smirnov. I am an IT consultant working in Toronto, ON, Canada.

I specialize in Data Architecture, Database Design and Business Intelligence primarily focusing on products developed by Microsoft (MS SQL Server, SSAS - Analysis Services, SSRS - Reporting Services, Performance Point, etc).

You can find me on LinkedIn at http://www.linkedin.com/in/maximsmirnovtoronto

I am starting this blog to share tips and tricks I've learned on the job. Hope they'll help somebody same way other people blogs have helped me on many occasions.