SQL, Uncategorized

Deciles – Balanced, approximate groupings in SQL

This problem came up in real life today. A client wanted a report of accounts, in 10 groups of credit score ranges, where approximately 1/10 of the accounts would be in each group.   The exact ranges were to be figured out, based on an approximately even distribution of accounts in each group.

To do this manually is a simple task for a human, in Excel–dump the data into Excel, sort it, and scroll to approx 1/10 of the list. Once there, see what the credit score is, and manually determine where to draw the line.

e.g.   Here is a table I populated with random data (script included).  This is a Select top 100, sorted by Credit Score.    The desired results would have 10 rows, summary, which are ranges of credit scores with the number of accounts in each range.

Each group would be approximately 10 accounts per group.   No credit score value should be in multiple groups.

Take a look down the list at the 10th row.

 

Row# Account BillingName CreditScore InitialBalance OpenedDate
1 31603 Jack Simpson 10 8844.92 2017-03-16
2 31035 Jack Brown 30 1138.19 2017-03-16
3 62276 John Smith 30 5465.08 2017-03-16
4 60812 Mary Smith 60 1851.96 2017-03-16
5 63958 Rachel Brown 60 2302.57 2017-03-16
6 64241 Mary Taylor 60 7638.2 2017-03-16
7 63387 Bill Taylor 70 7510.62 2017-03-16
8 61690 Bob Elliot 80 5725.46 2017-03-16
9 28780 Jack Morgan 90 10155.96 2017-03-16
10 28984 Ann Smith 90 7406.19 2017-03-16
11 30053 Bill Brown 90 2488.14 2017-03-16
12 61733 Bob Brown 90 11745.65 2017-03-16
13 62854 Bill Elliot 90 7425.37 2017-03-16
14 64583 Bill Smith 100 11920.82 2017-03-16
15 30179 John Simpson 110 1680.3 2017-03-16

The credit score on the 10th row is ’90’.  That is the same value as rows 9 and 11 and 12 and 13.   The first group should either contain all the ’90’ score accounts, or none of them.  As a human, I would base this decision on how far into the list of ’90’ value accounts we are, at the 10th row.   In this example, row 10 is only the 2nd of 5 ’90’ values, I would put all the ’90’ values into the 2nd group of 10.

So, how do we do this in SQL?    I am not sure.    But, this is how I solved it, using the LAG function in SQL Server 2012:

;WITH acs as
(
 select a.account, creditscore,
 ROW_NUMBER() OVER(ORDER BY creditscore) AS [ROW_NUMBER],
 NTILE(10) OVER(ORDER BY creditscore) AS [NTILE]
 from accounts (nolock) a
),
acs2 as
(
 select acs.[NTILE], min( CreditScore) as MinScore, 
  max(CreditScore) as MaxScore, count(*) as NumInNTILE
 from acs
 group by [NTILE]
),
acs3 as
(
 select *, ROW_NUMBER() OVER 
      (PARTITION BY CreditScore Order BY [row_number] ) as rowInScore
 from acs
)
,
tilesAndCounts AS
(
 SELECT CreditScore, MAX( rowInScore) as MaxRowInScore
 FROM acs3
 group by CreditScore
)
,
acs4 AS (
 
 select acs.Account, acs.CreditScore, acs.[ROW_NUMBER], acs.[NTILE]
 ,
 CASE WHEN LAG( acs.[NTILE], 1) OVER (ORDER BY acs.[ROW_NUMBER]) <> 
      acs.[NTILE] AND 
     LAG( acs.CreditScore, 1) OVER (ORDER BY acs.[ROW_NUMBER]) = 
         acs.CreditScore
     AND ( rowInScore / MaxRowInScore < 0.5) THEN acs.[NTILE] - 1
 WHEN LAG( acs.[NTILE], 1) OVER (ORDER BY acs.[ROW_NUMBER]) <>  
     acs.[NTILE] AND 
    LAG( acs.CreditScore, 1) OVER (ORDER BY acs.[ROW_NUMBER]) = 
        acs.CreditScore
 AND ( rowInScore / MaxRowInScore >= 0.5) THEN acs.[NTILE]
 END AS NTileAdjusted
 from acs
 inner join acs3 on acs.[ROW_NUMBER] = acs3.[ROW_NUMBER]
 inner join tilesAndCounts ON acs.CreditScore = tilesAndCounts.CreditScore
),
acs5 as (
 Select CreditScore, MIN( COALESCE( NTileAdjusted, [NTILE]) ) AS MinNTile
 from acs4
 group by CreditScore
)
, acs6 as (
 Select acs4.*, MinNTile
 from acs4 inner join acs5 ON acs4.CreditScore = acs5.CreditScore
)
select MinNTile, min(CreditScore) as minScore, 
    max(CreditScore) as maxscore, count(*) AS NumAccounts
from acs6
group by MinNTile
order by MinNTile

The results show 10 groups, with approximately 10,000 accounts in each group, based on ranges of credit scores:

MinNTile minScore maxscore NumAccounts
1 0 99                     10,010
2 100 199                     10,077
3 200 299                       9,964
4 300 400                     10,043
5 401 500                       9,959
6 501 600                     10,019
7 601 701                     10,034
8 702 801                       9,920
9 802 900                     10,032
10 901 998                       9,942

Because I used random data, you can see the numbers are spread pretty evenly.   In real life, credit scores were not so evenly spread 😉

 

Here is the Create Table statement:

CREATE TABLE Accounts
(
 Account INT IDENTITY(1,1) NOT NULL
 CONSTRAINT PK_Accounts PRIMARY KEY,
 BillingName VARCHAR(100) NULL,
 CreditScore SMALLINT NOT NULL
 CONSTRAINT DF_Accounts_CreditScore DEFAULT(0),
 InitialBalance DECIMAL(18,2) NOT NULL
 CONSTRAINT DF_Accounts_InitialBalance DEFAULT(0),
 NetPaid DECIMAL(18,2) NOT NULL
 CONSTRAINT DF_Accounts_NetPaid DEFAULT(0),
 OpenedDate DATETIME NOT NULL
 CONSTRAINT DF_Accounts_OpenedDate DEFAULT(GETDATE())
)

GO

 

I used this code to generate the 100,000 random accounts data:

INSERT INTO Accounts ( OpenedDate, CreditScore, BillingName, InitialBalance)
SELECT TOP 100000 
DATEADD(DAY,-( dbo.RANDBETWEEN(0,90,RAND(CHECKSUM(NEWID()))) ),getdate() ),
cast( RAND(CHECKSUM(NEWID())) * 999 as smallint), 
CASE
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 0 THEN 'John'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 1 THEN 'Jack'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 2 THEN 'Bill'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 3 THEN 'Mary'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 4 THEN 'Kate'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 5 THEN 'Matt'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 6 THEN 'Rachel'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 7 THEN 'Tom'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 8 THEN 'Ann'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 9 THEN 'Andrew'
 ELSE 'Bob' END + ' ' +
 CASE
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 0 THEN 'Smith'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 1 THEN 'Morgan'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 2 THEN 'Simpson'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 3 THEN 'Walker'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 4 THEN 'Bauer'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 5 THEN 'Taylor'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 6 THEN 'Morris'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 7 THEN 'Elliot'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 8 THEN 'White'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 9 THEN 'Davis'
 ELSE 'Brown' END 
,
cast( RAND(CHECKSUM(NEWID())) * 12000 as decimal(18,2))
FROM sys.all_objects a
CROSS JOIN sys.all_objects b

I used this from AndreFeijo from here: http://stackoverflow.com/questions/15608438/how-can-i-populate-my-databases-all-tables-with-random-data

I really like the use of TOP n, with the CROSS JOIN sys.all_objects to itself, a great way to generate lots of rows, when you don’t want or need a table to query.

-- Create the RANDBETWEEN function
-- Usage: SELECT dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID())))
CREATE FUNCTION dbo.RANDBETWEEN(@minval TINYINT, @maxval TINYINT, @random NUMERIC(18,10))
RETURNS TINYINT
AS
BEGIN
 RETURN (SELECT CAST(((@maxval + 1) - @minval) * @random + @minval AS TINYINT))
END
GO

 

.