Check for confusion

Posted on 2025-05-31 by Ernesto Hernández-Novich
Tags: ,

When wearing my Registrar DNS Operator’s hat, I have to worry about a single DNS source of truth hosting millions of zones. A Registrar sponsors domain names over multiple TLDs. It follows that all test and simulations I come up with, must match actual namespace distribution: there are way more .com domains than .ve.

Generating random domain name labels for an in-database simulation is pretty straightforward, thanks to PostgreSQL’s generate_series and string functions – a topic for another time. But generating random domain name TLDs must follow the expected distribution for the data set to «make sense». I normally use actual business distributions, that is, the current Domains Under Management (DUM) our platform currently holds. For this example, I will use the «Real World»® distribution, skewed as it is.

So, the challenge here is to use pure SQL to generate a random list of TLDs that follow a particular frequency distribution. Why? First and foremost, Because I Can®. Secondly, because all my tests and simulations are «in database» thanks to pgTAP; it would be wasteful to do everything that follows outside the database using Haskell, or lesser languages.

SQL is the right language for this task.

There are roughly 1600 TLDs currently delegated, and more coming out Real Soon Now®. You can purchase domain names on most of them. Given that .com, .net, and .org are the OG’s, there are way more names registered there than in, say .in, .cloud, or .locker. So, either you have your business absolute frequency table like I do, or download a reasonable up to date data set and load it. I downloaded a fairly recent summary for Domains Under Management (DUM) as provided by ICANN, and loaded the top 20 into the aptly named tld_num table.

emhn=> select * from tld_dum;
  tld   |    dum
--------+-----------
 com    | 250376358
 cn     |  29891265
 tk     |  27486586
 de     |  24561352
 uk     |  21690698
 net    |  21550050
 org    |  17248197
 xyz    |  13132943
 ru     |  11531064
 top    |  10386174
 ga     |   9692899
 info   |   8854119
 icu    |   8071155
 cf     |   7969228
 ml     |   7950780
 nl     |   7407664
 br     |   7117537
 online |   6731924
 fr     |   6127045
 gq     |   5985001
(20 rows)

Let’s figure out the proportion each TLD has over the total number of DUMs. Math is straightforward: divide the TLD’s count over the total sum, and show it as a percentage. This can be written using WINDOW Functions as

create view v_proportions as
  select tld
       , (100.0 * dum) / (sum(dum) over ()) as percentage
    from tld_dum;

Looking at the results

emhn=> select * from v_proportions limit 10;
 tld |     percentage
-----+---------------------
 com | 49.7013150290190881
 cn  |  5.9336080700594433
 tk  |  5.4562638452398355
 de  |  4.8755861098140426
 uk  |  4.3057428549117017
 net |  4.2778233236426931
 org |  3.4238778758000064
 xyz |  2.6069735278326520
 ru  |  2.2889902587519104
 top |  2.0617222410440498
(10 rows)

we now have a way to check whether or nor our random selection matches our expectations: proportions on our generated set must match these proportions.

Throwing dice the right way

Sure, you could repeatedly use random() to produce random floating point numbers, scale them, and look where the value lands on the distribution. This is not only slow but also prone to all precision errors associated with working with floating numbers. Undesirable for my millions of domains tests.

You don’t play AD&D with a continuous random number generator. You use dicrete dice to compute proportions. Expectations are tabulated as integer values, making them extremely fast to check without rounding or precision errors.

Let’s create a view v_throwing_dice that generates an infinite set of rows, each one being an independently generated dice throw. As usual, I’ll focus on the interesting parts only, allowing your actual intelligence to fill the blanks and make it work on your system.

For starters we need a dice that goes from 1 to total DUM. Since random() generates a floating point number in [0..1), we only need to know the cardinality of our dice,

  dice as (
    select sum(tld_dum.dum) as cardinality
      from tld_dum
  ) 

Taking advantage of recursive CTEs, let’s produce an infinite set of rows having a sequential counter and value for each roll.

  recursive thrower as (
    select 1 as n
         , round(random() * dice.cardinality) as value
      from dice
    union
    select n+1
         , round(random() * dice.cardinality) as value
      from thrower, dice
  )

Recursive CTEs work in the same way recursion works for functions in programming langues: one or more base cases, and one or more recursive (or inductive) cases. Except they work on sets of rows, and you get to choose whether or not to remove duplicates. In this case, the lone base case is the first roll, having constant counter 1 and value randomly generated in the desired range. This would make the resulting set have only one row. The inductive case uses the table itself (notice thrower in the FROM clause) to generate the next roll.

We can check it is working by querying the view with a LIMIT.

emhn=> select * from v_throwing_dice limit 10;
 n  |   value
----+-----------
  1 | 478851732
  2 | 427883508
  3 | 294048689
  4 |   1379955
  5 | 280591245
  6 | 414892780
  7 |   5061995
  8 | 277754851
  9 | 386606479
 10 | 209763117
(10 rows)

and if you run the above query more than once, you will get different values as each use of random() is independent. The value will always be between 0 and DUM total minus one.

Grate Expectations

We have built a way to generate an arbitrary long set of rolls with values between 1 and Total of DUMs. Continuing with our AD&D analogy, we need to come up with the table that, given our roll, will let us know which TLD label to use.

If we had three possible outcomes, one with a 15% chance of happening, another with a 40% chance, and the last one with a 45% chance, and we want to use a 100-sided dice to decide, we’d possible have a table stating

Roll 1d100
 1-15  Fail
16-55  Fail miserably
56-00  Fail miserably while the DM cackles

Therefore, we need to «stack» TLD expected frequencies as discrete intervals. Table tld_dum holds absolute frequency, so we need to come up with the intervals. The first interval is easy: it starts at 0 and finishes at the first TLD’s frequency minus one. The next interval starts where the previous interval ended, and extends to the particular TLD’s frequency minus one. So on and so forth. You don’t need arrays to do this. You don’t need variables either. You only need WINDOW functions

  stacker as (
  select tld
       , coalesce(sum(dum) over w, 0) as lo
       , dum - 1 + COALESCE(sum(dum) over w, 0) as hi
    from tld_dum
  window w as ( order by dum
                rows unbounded preceding
                exclude current row )
  )

There’s an aggregate window for each row in the result. The window is defined by having rows in ascending order of dum, only using all previous rows so far but excluding the current row. Using all previous rows allows us to compute the running total directly using sum(dum) to use as lower bound lo. Excluding the current row helps figuring out the upper bound hi by simple addition. This works for all intervals from the second onwards. But for the first interval, there are no preceding rows: that makes for an empty window with no sum, explaining the use of COALESCE to put the 0 to start all intervals.

emhn=> select sum(dum) from tld_dum ;
    sum
-----------
 503762039
 (1 row)

emhn=> select * from v_discrete_intervals ;
  tld   |    lo     |    hi     
--------+-----------+-----------
 gq     |         0 |   5985000
 fr     |   5985001 |  12112045
 online |  12112046 |  18843969
 br     |  18843970 |  25961506
 nl     |  25961507 |  33369170
 ml     |  33369171 |  41319950
 cf     |  41319951 |  49289178
 icu    |  49289179 |  57360333
 info   |  57360334 |  66214452
 ga     |  66214453 |  75907351
 top    |  75907352 |  86293525
 ru     |  86293526 |  97824589
 xyz    |  97824590 | 110957532
 org    | 110957533 | 128205729
 net    | 128205730 | 149755779
 uk     | 149755780 | 171446477
 de     | 171446478 | 196007829
 tk     | 196007830 | 223494415
 cn     | 223494416 | 253385680
 com    | 253385681 | 503762038

How many random TLDs are you looking for?

We can generate as many dice throws as we neeed, and we are sure their value results will land in one of the intervals. For each dice throw, just check which interval its value landed on, and that’s your random TLD label.

Combine all of the above partial queries into a single view named v_random_tld having this as body statement

select tld
  from stacker s, thrower d
 where d.value between s.lo and s.hi

We can use it to ask for any number of random TLD labels that match the expected distribution.

emhn=> select tld from v_random_tld limit 20;
 tld
-----
 com
 cn
 org
 tk
 ru
 com
 tk
 icu
 com
 com
 com
 com
 com
 com
 com
 com
 tk
 br
 com
 com
(20 rows)

But… are we doing it right?

Recall view v_proportions: it gives the expected percentage each TLD label should have if we were to conduct a very large number of samples. Well, if we use v_random_tld to generate a lot of random TLD labels, the resulting frequencies should be close to the expected frequencies, right?

I created v_are_we_doing_it_right to generate ten million random TLD labels, and compute their their relative frequencies

create view v_are_we_doing_it_right as
  with
  picked as (
    select * from v_random_tld limit 10000000
  ),
  frequencies as (
    select tld
         , count(*) as f
      from picked
  group by tld
  )
select tld
     , (100.0 * f) / (sum(f) over ()) as percentage
  from frequencies
  order by percentage desc;

and… guess what?

emhn=> select * from v_are_we_doing_it_right ;
  tld   |     percentage
--------+---------------------
 com    | 49.6945400000000000
 cn     |  5.9374600000000000
 tk     |  5.4555200000000000
 de     |  4.8801500000000000
 uk     |  4.3065800000000000
 net    |  4.2673300000000000
 org    |  3.4308200000000000
 xyz    |  2.6007900000000000
 ru     |  2.2895700000000000
 top    |  2.0601900000000000
 ga     |  1.9228900000000000
 info   |  1.7535000000000000
 icu    |  1.5974000000000000
 cf     |  1.5818600000000000
 ml     |  1.5812500000000000
 nl     |  1.4766000000000000
 br     |  1.4190300000000000
 online |  1.3374800000000000
 fr     |  1.2162300000000000
 gq     |  1.1908100000000000
(20 rows)

We are doing it right.

It’s sets all the way down

Try joining v_throwing_dice and v_discrete_intervals independently. It does not work like v_random_tld, built as a combination of v_throwing_dice and v_discrete_intervals body’s tacked together. Can you figure out why?

Finally, I mentioned this technique is being used to generate tests. By definition, tests need to be repeatable in order for them to be meaninful when comparing operations before and after a change. To make this machinery repeatable, you’d need to use pseudo-random number generation with a configurable «seed». PostgreSQL has support for this behavior, its implementation left as an exercise for the reader.

Roll for intelligence. Do not touch my dice.