When wearing my
Registrar
DNS Operator’s hat, I have to worry about a single DNS source of truth
hosting millions of zones. A Registrar sponsors domain names over multiple
TLDs.
It follows that all test and simulations I come up with, must match actual
namespace distribution: there are way more .com
domains than .ve
.
Generating random domain name labels for an in-database simulation is
pretty straightforward, thanks to PostgreSQL’s generate_series
and
string functions – a topic for another time. But generating random
domain name TLDs must follow the expected distribution for the data
set to «make sense». I normally use actual business distributions,
that is, the current Domains Under Management (DUM) our platform
currently holds. For this example, I will use the «Real World»® distribution,
skewed as it is.
So, the challenge here is to use pure SQL to generate a random list of TLDs that follow a particular frequency distribution. Why? First and foremost, Because I Can®. Secondly, because all my tests and simulations are «in database» thanks to pgTAP; it would be wasteful to do everything that follows outside the database using Haskell, or lesser languages.
SQL is the right language for this task.
Not every TLD is as popular
There are roughly 1600 TLDs currently delegated, and more coming out
Real Soon Now®. You can purchase domain names on most of them. Given that
.com
, .net
, and .org
are the OG’s, there are way more names registered
there than in, say .in
, .cloud
, or .locker
. So, either you have your
business absolute frequency table like I do, or download a reasonable up
to date data set and load it. I downloaded a fairly recent summary for
Domains Under Management (DUM) as provided by ICANN, and loaded the top
20 into the aptly named tld_num
table.
=> select * from tld_dum;
emhn
tld | dum--------+-----------
250376358
com | 29891265
cn | 27486586
tk | 24561352
de | 21690698
uk | 21550050
net | 17248197
org | 13132943
xyz | 11531064
ru | 10386174
top | 9692899
ga | 8854119
info | 8071155
icu | 7969228
cf | 7950780
ml | 7407664
nl | 7117537
br | online | 6731924
6127045
fr | 5985001
gq | 20 rows) (
Let’s figure out the proportion each TLD has over the total number of DUMs. Math is straightforward: divide the TLD’s count over the total sum, and show it as a percentage. This can be written using WINDOW Functions as
create view v_proportions as
select tld
100.0 * dum) / (sum(dum) over ()) as percentage
, (from tld_dum;
Looking at the results
=> select * from v_proportions limit 10;
emhn
tld | percentage-----+---------------------
49.7013150290190881
com | 5.9336080700594433
cn | 5.4562638452398355
tk | 4.8755861098140426
de | 4.3057428549117017
uk | 4.2778233236426931
net | 3.4238778758000064
org | 2.6069735278326520
xyz | 2.2889902587519104
ru | 2.0617222410440498
top | 10 rows) (
we now have a way to check whether or nor our random selection matches our expectations: proportions on our generated set must match these proportions.
Throwing dice the right way
Sure, you could repeatedly use random()
to produce random floating
point numbers, scale them, and look where the value lands on the distribution.
This is not only slow but also prone to all precision errors associated with
working with floating numbers. Undesirable for my millions of domains tests.
You don’t play AD&D with a continuous random number generator. You use dicrete dice to compute proportions. Expectations are tabulated as integer values, making them extremely fast to check without rounding or precision errors.
Let’s create a view v_throwing_dice
that generates an infinite set of
rows, each one being an independently generated dice throw. As usual,
I’ll focus on the interesting parts only, allowing your actual intelligence
to fill the blanks and make it work on your system.
For starters we need a dice that goes from 1 to total DUM. Since random()
generates a floating point number in [0..1)
, we only need to know the
cardinality of our dice,
as (
dice select sum(tld_dum.dum) as cardinality
from tld_dum
)
Taking advantage of recursive CTEs, let’s produce an infinite set of rows having a sequential counter and value for each roll.
as (
recursive thrower select 1 as n
round(random() * dice.cardinality) as value
, from dice
union
select n+1
round(random() * dice.cardinality) as value
, from thrower, dice
)
Recursive CTEs work in the same way recursion works for functions in
programming langues: one or more base cases, and one or more recursive
(or inductive) cases. Except they work on sets of rows, and you get to choose
whether or not to remove duplicates. In this case, the lone base case
is the first roll, having constant counter 1
and value randomly generated in
the desired range. This would make the resulting set have only one row. The
inductive case uses the table itself (notice thrower
in the FROM
clause)
to generate the next roll.
We can check it is working by querying the view with a LIMIT
.
=> select * from v_throwing_dice limit 10;
emhnvalue
n | ----+-----------
1 | 478851732
2 | 427883508
3 | 294048689
4 | 1379955
5 | 280591245
6 | 414892780
7 | 5061995
8 | 277754851
9 | 386606479
10 | 209763117
10 rows) (
and if you run the above query more than once, you will get different
values as each use of random()
is independent. The value
will always
be between 0
and DUM total minus one.
Grate Expectations
We have built a way to generate an arbitrary long set of rolls with values between 1 and Total of DUMs. Continuing with our AD&D analogy, we need to come up with the table that, given our roll, will let us know which TLD label to use.
If we had three possible outcomes, one with a 15% chance of happening, another with a 40% chance, and the last one with a 45% chance, and we want to use a 100-sided dice to decide, we’d possible have a table stating
Roll 1d100
1-15 Fail
16-55 Fail miserably
56-00 Fail miserably while the DM cackles
Therefore, we need to «stack» TLD expected frequencies as discrete intervals.
Table tld_dum
holds absolute frequency, so we need to come up with the
intervals. The first interval is easy: it starts at 0 and finishes at the
first TLD’s frequency minus one. The next interval starts where the previous
interval ended, and extends to the particular TLD’s frequency minus one.
So on and so forth. You don’t need arrays to do this. You don’t need variables
either. You only need WINDOW functions
as (
stacker select tld
coalesce(sum(dum) over w, 0) as lo
, - 1 + COALESCE(sum(dum) over w, 0) as hi
, dum from tld_dum
as ( order by dum
window w rows unbounded preceding
current row )
exclude )
There’s an aggregate window for each row in the result. The window is
defined by having rows in ascending order of dum
, only using all previous
rows so far but excluding the current row. Using all previous rows allows
us to compute the running total directly using sum(dum)
to use as
lower bound lo
. Excluding the current row helps figuring out the upper
bound hi
by simple addition. This works for all intervals from the second
onwards. But for the first interval, there are no preceding rows: that makes
for an empty window with no sum
, explaining the use of COALESCE
to put
the 0
to start all intervals.
=> select sum(dum) from tld_dum ;
emhnsum
-----------
503762039
1 row)
(
=> select * from v_discrete_intervals ;
emhn
tld | lo | hi --------+-----------+-----------
0 | 5985000
gq | 5985001 | 12112045
fr | online | 12112046 | 18843969
18843970 | 25961506
br | 25961507 | 33369170
nl | 33369171 | 41319950
ml | 41319951 | 49289178
cf | 49289179 | 57360333
icu | 57360334 | 66214452
info | 66214453 | 75907351
ga | 75907352 | 86293525
top | 86293526 | 97824589
ru | 97824590 | 110957532
xyz | 110957533 | 128205729
org | 128205730 | 149755779
net | 149755780 | 171446477
uk | 171446478 | 196007829
de | 196007830 | 223494415
tk | 223494416 | 253385680
cn | 253385681 | 503762038 com |
How many random TLDs are you looking for?
We can generate as many dice throws as we neeed, and we are sure their value results will land in one of the intervals. For each dice throw, just check which interval its value landed on, and that’s your random TLD label.
Combine all of the above partial queries into a single view named v_random_tld
having this as body statement
select tld
from stacker s, thrower d
where d.value between s.lo and s.hi
We can use it to ask for any number of random TLD labels that match the expected distribution.
=> select tld from v_random_tld limit 20;
emhn
tld-----
com
cn
org
tk
ru
com
tk
icu
com
com
com
com
com
com
com
com
tk
br
com
com20 rows) (
But… are we doing it right?
Recall view v_proportions
: it gives the expected percentage each TLD label
should have if we were to conduct a very large number of samples. Well, if
we use v_random_tld
to generate a lot of random TLD labels, the resulting
frequencies should be close to the expected frequencies, right?
I created v_are_we_doing_it_right
to generate ten million random TLD labels,
and compute their their relative frequencies
create view v_are_we_doing_it_right as
with
as (
picked select * from v_random_tld limit 10000000
),as (
frequencies select tld
count(*) as f
, from picked
group by tld
)select tld
100.0 * f) / (sum(f) over ()) as percentage
, (from frequencies
order by percentage desc;
and… guess what?
=> select * from v_are_we_doing_it_right ;
emhn
tld | percentage--------+---------------------
49.6945400000000000
com | 5.9374600000000000
cn | 5.4555200000000000
tk | 4.8801500000000000
de | 4.3065800000000000
uk | 4.2673300000000000
net | 3.4308200000000000
org | 2.6007900000000000
xyz | 2.2895700000000000
ru | 2.0601900000000000
top | 1.9228900000000000
ga | 1.7535000000000000
info | 1.5974000000000000
icu | 1.5818600000000000
cf | 1.5812500000000000
ml | 1.4766000000000000
nl | 1.4190300000000000
br | online | 1.3374800000000000
1.2162300000000000
fr | 1.1908100000000000
gq | 20 rows) (
We are doing it right.
It’s sets all the way down
Try joining v_throwing_dice
and v_discrete_intervals
independently.
It does not work like v_random_tld
, built as a combination of
v_throwing_dice
and v_discrete_intervals
body’s tacked together.
Can you figure out why?
Finally, I mentioned this technique is being used to generate tests. By definition, tests need to be repeatable in order for them to be meaninful when comparing operations before and after a change. To make this machinery repeatable, you’d need to use pseudo-random number generation with a configurable «seed». PostgreSQL has support for this behavior, its implementation left as an exercise for the reader.
Roll for intelligence. Do not touch my dice.