This problem came up in real life today. A client wanted a report of accounts, in 10 groups of credit score ranges, where approximately 1/10 of the accounts would be in each group. The exact ranges were to be figured out, based on an approximately even distribution of accounts in each group.
To do this manually is a simple task for a human, in Excel–dump the data into Excel, sort it, and scroll to approx 1/10 of the list. Once there, see what the credit score is, and manually determine where to draw the line.
e.g. Here is a table I populated with random data (script included). This is a Select top 100, sorted by Credit Score. The desired results would have 10 rows, summary, which are ranges of credit scores with the number of accounts in each range.
Each group would be approximately 10 accounts per group. No credit score value should be in multiple groups.
Take a look down the list at the 10th row.
| Row# | Account | BillingName | CreditScore | InitialBalance | OpenedDate |
| 1 | 31603 | Jack Simpson | 10 | 8844.92 | 2017-03-16 |
| 2 | 31035 | Jack Brown | 30 | 1138.19 | 2017-03-16 |
| 3 | 62276 | John Smith | 30 | 5465.08 | 2017-03-16 |
| 4 | 60812 | Mary Smith | 60 | 1851.96 | 2017-03-16 |
| 5 | 63958 | Rachel Brown | 60 | 2302.57 | 2017-03-16 |
| 6 | 64241 | Mary Taylor | 60 | 7638.2 | 2017-03-16 |
| 7 | 63387 | Bill Taylor | 70 | 7510.62 | 2017-03-16 |
| 8 | 61690 | Bob Elliot | 80 | 5725.46 | 2017-03-16 |
| 9 | 28780 | Jack Morgan | 90 | 10155.96 | 2017-03-16 |
| 10 | 28984 | Ann Smith | 90 | 7406.19 | 2017-03-16 |
| 11 | 30053 | Bill Brown | 90 | 2488.14 | 2017-03-16 |
| 12 | 61733 | Bob Brown | 90 | 11745.65 | 2017-03-16 |
| 13 | 62854 | Bill Elliot | 90 | 7425.37 | 2017-03-16 |
| 14 | 64583 | Bill Smith | 100 | 11920.82 | 2017-03-16 |
| 15 | 30179 | John Simpson | 110 | 1680.3 | 2017-03-16 |
The credit score on the 10th row is ’90’. That is the same value as rows 9 and 11 and 12 and 13. The first group should either contain all the ’90’ score accounts, or none of them. As a human, I would base this decision on how far into the list of ’90’ value accounts we are, at the 10th row. In this example, row 10 is only the 2nd of 5 ’90’ values, I would put all the ’90’ values into the 2nd group of 10.
So, how do we do this in SQL? I am not sure. But, this is how I solved it, using the LAG function in SQL Server 2012:
;WITH acs as
(
select a.account, creditscore,
ROW_NUMBER() OVER(ORDER BY creditscore) AS [ROW_NUMBER],
NTILE(10) OVER(ORDER BY creditscore) AS [NTILE]
from accounts (nolock) a
),
acs2 as
(
select acs.[NTILE], min( CreditScore) as MinScore,
max(CreditScore) as MaxScore, count(*) as NumInNTILE
from acs
group by [NTILE]
),
acs3 as
(
select *, ROW_NUMBER() OVER
(PARTITION BY CreditScore Order BY [row_number] ) as rowInScore
from acs
)
,
tilesAndCounts AS
(
SELECT CreditScore, MAX( rowInScore) as MaxRowInScore
FROM acs3
group by CreditScore
)
,
acs4 AS (
select acs.Account, acs.CreditScore, acs.[ROW_NUMBER], acs.[NTILE]
,
CASE WHEN LAG( acs.[NTILE], 1) OVER (ORDER BY acs.[ROW_NUMBER]) <>
acs.[NTILE] AND
LAG( acs.CreditScore, 1) OVER (ORDER BY acs.[ROW_NUMBER]) =
acs.CreditScore
AND ( rowInScore / MaxRowInScore < 0.5) THEN acs.[NTILE] - 1
WHEN LAG( acs.[NTILE], 1) OVER (ORDER BY acs.[ROW_NUMBER]) <>
acs.[NTILE] AND
LAG( acs.CreditScore, 1) OVER (ORDER BY acs.[ROW_NUMBER]) =
acs.CreditScore
AND ( rowInScore / MaxRowInScore >= 0.5) THEN acs.[NTILE]
END AS NTileAdjusted
from acs
inner join acs3 on acs.[ROW_NUMBER] = acs3.[ROW_NUMBER]
inner join tilesAndCounts ON acs.CreditScore = tilesAndCounts.CreditScore
),
acs5 as (
Select CreditScore, MIN( COALESCE( NTileAdjusted, [NTILE]) ) AS MinNTile
from acs4
group by CreditScore
)
, acs6 as (
Select acs4.*, MinNTile
from acs4 inner join acs5 ON acs4.CreditScore = acs5.CreditScore
)
select MinNTile, min(CreditScore) as minScore,
max(CreditScore) as maxscore, count(*) AS NumAccounts
from acs6
group by MinNTile
order by MinNTile
The results show 10 groups, with approximately 10,000 accounts in each group, based on ranges of credit scores:
| MinNTile | minScore | maxscore | NumAccounts |
| 1 | 0 | 99 | 10,010 |
| 2 | 100 | 199 | 10,077 |
| 3 | 200 | 299 | 9,964 |
| 4 | 300 | 400 | 10,043 |
| 5 | 401 | 500 | 9,959 |
| 6 | 501 | 600 | 10,019 |
| 7 | 601 | 701 | 10,034 |
| 8 | 702 | 801 | 9,920 |
| 9 | 802 | 900 | 10,032 |
| 10 | 901 | 998 | 9,942 |
Because I used random data, you can see the numbers are spread pretty evenly. In real life, credit scores were not so evenly spread 😉
Here is the Create Table statement:
CREATE TABLE Accounts ( Account INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_Accounts PRIMARY KEY, BillingName VARCHAR(100) NULL, CreditScore SMALLINT NOT NULL CONSTRAINT DF_Accounts_CreditScore DEFAULT(0), InitialBalance DECIMAL(18,2) NOT NULL CONSTRAINT DF_Accounts_InitialBalance DEFAULT(0), NetPaid DECIMAL(18,2) NOT NULL CONSTRAINT DF_Accounts_NetPaid DEFAULT(0), OpenedDate DATETIME NOT NULL CONSTRAINT DF_Accounts_OpenedDate DEFAULT(GETDATE()) ) GO
I used this code to generate the 100,000 random accounts data:
INSERT INTO Accounts ( OpenedDate, CreditScore, BillingName, InitialBalance) SELECT TOP 100000 DATEADD(DAY,-( dbo.RANDBETWEEN(0,90,RAND(CHECKSUM(NEWID()))) ),getdate() ), cast( RAND(CHECKSUM(NEWID())) * 999 as smallint), CASE WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 0 THEN 'John' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 1 THEN 'Jack' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 2 THEN 'Bill' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 3 THEN 'Mary' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 4 THEN 'Kate' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 5 THEN 'Matt' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 6 THEN 'Rachel' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 7 THEN 'Tom' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 8 THEN 'Ann' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 9 THEN 'Andrew' ELSE 'Bob' END + ' ' + CASE WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 0 THEN 'Smith' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 1 THEN 'Morgan' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 2 THEN 'Simpson' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 3 THEN 'Walker' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 4 THEN 'Bauer' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 5 THEN 'Taylor' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 6 THEN 'Morris' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 7 THEN 'Elliot' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 8 THEN 'White' WHEN dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) = 9 THEN 'Davis' ELSE 'Brown' END , cast( RAND(CHECKSUM(NEWID())) * 12000 as decimal(18,2)) FROM sys.all_objects a CROSS JOIN sys.all_objects b
I used this from AndreFeijo from here: http://stackoverflow.com/questions/15608438/how-can-i-populate-my-databases-all-tables-with-random-data
I really like the use of TOP n, with the CROSS JOIN sys.all_objects to itself, a great way to generate lots of rows, when you don’t want or need a table to query.
-- Create the RANDBETWEEN function -- Usage: SELECT dbo.RANDBETWEEN(0,9,RAND(CHECKSUM(NEWID()))) CREATE FUNCTION dbo.RANDBETWEEN(@minval TINYINT, @maxval TINYINT, @random NUMERIC(18,10)) RETURNS TINYINT AS BEGIN RETURN (SELECT CAST(((@maxval + 1) - @minval) * @random + @minval AS TINYINT)) END GO
.
…