March | 2017 | Avalanche SQL Blog

This problem came up in real life today. A client wanted a report of accounts, in 10 groups of credit score ranges, where approximately 1/10 of the accounts would be in each group. The exact ranges were to be figured out, based on an approximately even distribution of accounts in each group.

To do this manually is a simple task for a human, in Excel–dump the data into Excel, sort it, and scroll to approx 1/10 of the list. Once there, see what the credit score is, and manually determine where to draw the line.

e.g. Here is a table I populated with random data (script included). This is a Select top 100, sorted by Credit Score. The desired results would have 10 rows, summary, which are ranges of credit scores with the number of accounts in each range.

Each group would be approximately 10 accounts per group. No credit score value should be in multiple groups.

Take a look down the list at the 10th row.

Row#	Account	BillingName	CreditScore	InitialBalance	OpenedDate
1	31603	Jack Simpson	10	8844.92	2017-03-16
2	31035	Jack Brown	30	1138.19	2017-03-16
3	62276	John Smith	30	5465.08	2017-03-16
4	60812	Mary Smith	60	1851.96	2017-03-16
5	63958	Rachel Brown	60	2302.57	2017-03-16
6	64241	Mary Taylor	60	7638.2	2017-03-16
7	63387	Bill Taylor	70	7510.62	2017-03-16
8	61690	Bob Elliot	80	5725.46	2017-03-16
9	28780	Jack Morgan	90	10155.96	2017-03-16
10	28984	Ann Smith	90	7406.19	2017-03-16
11	30053	Bill Brown	90	2488.14	2017-03-16
12	61733	Bob Brown	90	11745.65	2017-03-16
13	62854	Bill Elliot	90	7425.37	2017-03-16
14	64583	Bill Smith	100	11920.82	2017-03-16
15	30179	John Simpson	110	1680.3	2017-03-16

The credit score on the 10th row is ’90’. That is the same value as rows 9 and 11 and 12 and 13. The first group should either contain all the ’90’ score accounts, or none of them. As a human, I would base this decision on how far into the list of ’90’ value accounts we are, at the 10th row. In this example, row 10 is only the 2nd of 5 ’90’ values, I would put all the ’90’ values into the 2nd group of 10.

So, how do we do this in SQL? I am not sure. But, this is how I solved it, using the LAG function in SQL Server 2012:

;WITH acs as
(
 select a.account, creditscore,
 ROW_NUMBER() OVER(ORDER BY creditscore) AS [ROW_NUMBER],
 NTILE(10) OVER(ORDER BY creditscore) AS [NTILE]
 from accounts (nolock) a
),
acs2 as
(
 select acs.[NTILE], min( CreditScore) as MinScore, 
  max(CreditScore) as MaxScore, count(*) as NumInNTILE
 from acs
 group by [NTILE]
),
acs3 as
(
 select *, ROW_NUMBER() OVER 
      (PARTITION BY CreditScore Order BY [row_number] ) as rowInScore
 from acs
)
,
tilesAndCounts AS
(
 SELECT CreditScore, MAX( rowInScore) as MaxRowInScore
 FROM acs3
 group by CreditScore
)
,
acs4 AS (
 
 select acs.Account, acs.CreditScore, acs.[ROW_NUMBER], acs.[NTILE]
 ,
 CASE WHEN LAG( acs.[NTILE], 1) OVER (ORDER BY acs.[ROW_NUMBER]) <> 
      acs.[NTILE] AND 
     LAG( acs.CreditScore, 1) OVER (ORDER BY acs.[ROW_NUMBER]) = 
         acs.CreditScore
     AND ( rowInScore / MaxRowInScore < 0.5) THEN acs.[NTILE] - 1
 WHEN LAG( acs.[NTILE], 1) OVER (ORDER BY acs.[ROW_NUMBER]) <>  
     acs.[NTILE] AND 
    LAG( acs.CreditScore, 1) OVER (ORDER BY acs.[ROW_NUMBER]) = 
        acs.CreditScore
 AND ( rowInScore / MaxRowInScore >= 0.5) THEN acs.[NTILE]
 END AS NTileAdjusted
 from acs
 inner join acs3 on acs.[ROW_NUMBER] = acs3.[ROW_NUMBER]
 inner join tilesAndCounts ON acs.CreditScore = tilesAndCounts.CreditScore
),
acs5 as (
 Select CreditScore, MIN( COALESCE( NTileAdjusted, [NTILE]) ) AS MinNTile
 from acs4
 group by CreditScore
)
, acs6 as (
 Select acs4.*, MinNTile
 from acs4 inner join acs5 ON acs4.CreditScore = acs5.CreditScore
)
select MinNTile, min(CreditScore) as minScore, 
    max(CreditScore) as maxscore, count(*) AS NumAccounts
from acs6
group by MinNTile
order by MinNTile

The results show 10 groups, with approximately 10,000 accounts in each group, based on ranges of credit scores:

MinNTile	minScore	maxscore	NumAccounts
1	0	99	10,010
2	100	199	10,077
3	200	299	9,964
4	300	400	10,043
5	401	500	9,959
6	501	600	10,019
7	601	701	10,034
8	702	801	9,920
9	802	900	10,032
10	901	998	9,942

Because I used random data, you can see the numbers are spread pretty evenly. In real life, credit scores were not so evenly spread 😉

Here is the Create Table statement:

CREATE TABLE Accounts
(
 Account INT IDENTITY(1,1) NOT NULL
 CONSTRAINT PK_Accounts PRIMARY KEY,
 BillingName VARCHAR(100) NULL,
 CreditScore SMALLINT NOT NULL
 CONSTRAINT DF_Accounts_CreditScore DEFAULT(0),
 InitialBalance DECIMAL(18,2) NOT NULL
 CONSTRAINT DF_Accounts_InitialBalance DEFAULT(0),
 NetPaid DECIMAL(18,2) NOT NULL
 CONSTRAINT DF_Accounts_NetPaid DEFAULT(0),
 OpenedDate DATETIME NOT NULL
 CONSTRAINT DF_Accounts_OpenedDate DEFAULT(GETDATE())
)

GO

I used this code to generate the 100,000 random accounts data:

INSERT INTO Accounts ( OpenedDate, CreditScore, BillingName, InitialBalance)
SELECT TOP 100000 
DATEADD(DAY,-( dbo.RANDBETWEEN(0,90,RAND(CHECKSUM(NEWID()))) ),getdate() ),
cast( RAND(CHECKSUM(NEWID())) * 999 as smallint), 
CASE
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 0 THEN 'John'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 1 THEN 'Jack'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 2 THEN 'Bill'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 3 THEN 'Mary'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 4 THEN 'Kate'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 5 THEN 'Matt'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 6 THEN 'Rachel'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 7 THEN 'Tom'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 8 THEN 'Ann'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 9 THEN 'Andrew'
 ELSE 'Bob' END + ' ' +
 CASE
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 0 THEN 'Smith'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 1 THEN 'Morgan'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 2 THEN 'Simpson'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 3 THEN 'Walker'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 4 THEN 'Bauer'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 5 THEN 'Taylor'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 6 THEN 'Morris'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 7 THEN 'Elliot'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 8 THEN 'White'
 WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 9 THEN 'Davis'
 ELSE 'Brown' END 
,
cast( RAND(CHECKSUM(NEWID())) * 12000 as decimal(18,2))
FROM sys.all_objects a
CROSS JOIN sys.all_objects b

I used this from AndreFeijo from here: http://stackoverflow.com/questions/15608438/how-can-i-populate-my-databases-all-tables-with-random-data

I really like the use of TOP n, with the CROSS JOIN sys.all_objects to itself, a great way to generate lots of rows, when you don’t want or need a table to query.

-- Create the RANDBETWEEN function
-- Usage: SELECT dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID())))
CREATE FUNCTION dbo.RANDBETWEEN(@minval TINYINT, @maxval TINYINT, @random NUMERIC(18,10))
RETURNS TINYINT
AS
BEGIN
 RETURN (SELECT CAST(((@maxval + 1) - @minval) * @random + @minval AS TINYINT))
END
GO

…

Avalanche SQL Blog

Avalanche SQL Blog

Mike's new technology blog

Month: March 2017

Deciles – Balanced, approximate groupings in SQL