本文介绍了distinct count的SQL优化方法，以及常用的高效近似算法及其在PostgreSQL上的实现。

UV vs. PV

在互联网中，经常需要计算UV和PV。所谓PV即Page View，网页被打开多少次（YouTube等视频网站非常重视视频的点击率，即被播放多少次，也即PV）。而UV即Unique Visitor（微信朋友圈或者微信公众号中的文章则统计有多少人看过该文章，也即UV。虽然微信上显示是指明该值是PV，但经笔者测试，实为UV）。这两个概念非常重要，比如淘宝卖家在做活动时，他往往需要统计宝贝被看了多少次，有多少个不同的人看过该活动介绍。至于如何在互联网上唯一标识一个自然人，也是一个难点，目前还没有一个非常准确的方法，常用的方法是用户名加cookie，这里不作深究。

count distinct vs. count group by

很多情景下，尤其对于文本类型的字段，直接使用count distinct的查询效率是非常低的，而先做group by更count往往能提升查询效率。但实验表明，对于不同的字段，count distinct与count group by的性能并不一样，而且其效率也与目标数据集的数据重复度相关。

本节通过几组实验说明了不同场景下不同query的不同效率，同时分析性能差异的原因。（本文所有实验皆基于PostgreSQL 9.3.5平台）
分别使用count distinct 和 count group by对 bigint, macaddr, text三种类型的字段做查询。
首先创建如下结构的表

| Column | Type | Modifiers |
|---------------------------|
| mac_bigint | bigint | |
| mac_macaddr | macaddr | |
| mac_text | text | |

并插入1000万条记录，并保证mac_bigint为mac_macaddr去掉冒号后的16进制转换而成的10进制bigint，而mac_text为mac_macaddr的文本形式，从而保证在这三个字段上查询的结果，也及复杂度相同。

count distinct SQL如下

select 
    count(distinct mac_macaddr) 
from 
    testmac

count group by SQL如下

select
    count(*)
from
    (select
        mac_macaddr
    from
        testmac
    group by
        1) foo

对于不同记录数较大的情景（1000万条记录中，有300多万条不同记录），查询时间（单位毫秒）如下表所示。

| query/字段类型 | macaddr | bigint | text |
|------------------------------------------|
| count distinct | 24668.023 | 13890.051 | 149048.911 |
| count group by | 32152.808 | 25929.555 | 159212.700 |

对于不同记录数较小的情景（1000万条记录中，只有1万条不同记录），查询时间（单位毫秒）如下表所示。

| query/字段类型 | macaddr | bigint | text |
|------------------------------------------|
| count distinct | 20006.681 | 9984.763 | 225208.133 |
| count group by | 2529.420 | 2554.720 | 3701.869 |

从上面两组实验可看出，在不同记录数较小时，count group by性能普遍高于count distinct，尤其对于text类型表现的更明显。而对于不同记录数较大的场景，count group by性能反而低于直接count distinct。为什么会造成这种差异呢，我们以macaddr类型为例来对比不同结果集下count group by的query plan。
　　当结果集较小时，planner会使用HashAggregation。

explain analyze select count(*) from (select mac_macaddr from testmac_small group by 1) foo;
                                        QUERY PLAN
 Aggregate  (cost=668465.04..668465.05 rows=1 width=0) (actual time=9166.486..9166.486 rows=1 loops=1)
   ->  HashAggregate  (cost=668296.74..668371.54 rows=7480 width=6) (actual time=9161.796..9164.393 rows=10001 loops=1)
         ->  Seq Scan on testmac_small  (cost=0.00..572898.79 rows=38159179 width=6) (actual time=323.338..5091.112 rows=10000000 l
oops=1)

而当结果集较大时，无法通过在内存中维护Hash表的方式使用HashAggregation，planner会使用GroupAggregation，并会用到排序，而且因为目标数据集太大，无法在内存中使用Quick Sort，而要在外存中使用Merge Sort，而这就极大的增加了I/O开销。

explain analyze select count(*) from (select mac_macaddr from testmac group by 1) foo;
                                        QUERY PLAN
 Aggregate  (cost=1881542.62..1881542.63 rows=1 width=0) (actual time=34288.232..34288.232 rows=1 loops=1)
   ->  Group  (cost=1794262.09..1844329.41 rows=2977057 width=6) (actual time=25291.372..33481.228 rows=3671797 loops=1)
         ->  Sort  (cost=1794262.09..1819295.75 rows=10013464 width=6) (actual time=25291.366..29907.351 rows=10000000 loops=1)
               Sort Key: testmac.mac_macaddr
               Sort Method: external merge  Disk: 156440kB
               ->  Seq Scan on testmac  (cost=0.00..219206.64 rows=10013464 width=6) (actual time=0.082..4312.053 rows=10000000 loo
ps=1)

dinstinct count高效近似算法

数据集100万条，每条记录均不相同，几种算法耗时及内存使用如下。

| algorithm | result | error | time(ms) | memory (B) |
|----------------------------------------------------|
| count(distinct) | 1000000 | 0% | 14026 | ？ |
| Adaptive Sampling | 1008128 | 0.8% | 8653 | 57627 |
| Self-learning Bitmap | 991651 | 0.9% | 1151 | 65571 |
| Bloom filter| 788052 | 22% | 2400 | 1198164 |
| Probalilistic Counting | 1139925 | 14% | 3613 | 95 |
| PCSA | 841735 | 16% | 842 | 495 |

数据集100万条，只有100条不同记录，几种近似算法耗时及内存使用如下。

| algorithm | result | error | time(ms) | memory (B) |
|-----------------------------------------------------|
| count(distinct) | 100 | 0% | 75306 | ？ |
| Adaptive Sampling | 100 | 0% | 1491 | 57627 |
| Self-learning Bitmap | 101 | 1% | 1031 | 65571 |
| Bloom filter | 100 | 0%| 1675 | 1198164 |
| Probalilistic Counting | 95 | 5% | 3613 | 95 |
| PCSA | 98 | 2% | 852 | 495 |
　　
　　从上面两组实验可看出，大部分的近似算法工作得都很好，其速度都比简单的count distinct要快很多，而且它们对内存的使用并不多而结果去非常好，尤其是Adaptive Sampling和Self-learning Bitmap，误差一般不超过1%，性能却比简单的count distinct高十几倍乃至几十倍。

distinct count结果合并

| Column | Type | Modifiers |
|---------------------------|
| day | date | |
| user_id | integer | |
| sales | numeric | |

插入三年的数据，并保证总共有10万个不同的user_id，总数据量为1亿条（一天10万条左右）。

insert into fact
select
    current_date - (random()*1095)::integer * '1 day'::interval,
    (random()*100000)::integer + 1,
    random() * 10000 + 500
from
    generate_series(1, 100000000, 1);

直接从fact表中查询不同用户的总数，耗时115143.217 ms。
利用hll，创建daily_unique_user_hll表，将每天的不同用户信息存于hll类型的字段中。

create table daily_unique_user_hll 
as select
    day, 
    hll_add_agg(hll_hash_integer(user_id))
from 
    fact
group by 1;

通过上面的daily aggregation table可计算任意日期范围内的unique user count。如计算整个三年的不同用户数，耗时17.485 ms，查询结果为101044，误差为(101044-100000)/100000=1.044%。

explain analyze select hll_cardinality(hll_union_agg(hll_add_agg)) from daily_unique_user_hll;
                                   QUERY PLAN
 Aggregate  (cost=196.70..196.72 rows=1 width=32) (actual time=16.772..16.772 rows=1 loops=1)
   ->  Seq Scan on daily_unique_user_hll  (cost=0.00..193.96 rows=1096 width=32) (actual time=0.298..3.251 rows=
1096 loops=1)
 Planning time: 0.081 ms
 Execution time: 16.851 ms
 Time: 17.485 ms

而如果直接使用count distinct基于fact表计算该值，则耗时长达 127807.105 ms。
　　
　　从上面的实验中可以看到，hll类型实现了distinct count的合并，并可以通过hll存储各个部分数据集上的distinct count值，并可通过合并这些hll值来快速计算整个数据集上的distinct count值，耗时只有直接使用count distinct在原始数据上计算的1/7308，并且误差非常小，1%左右。

总结

SQL优化系列更多精彩文章

SQL优化（二） 快速计算Distinct Count

UV vs. PV

count distinct vs. count group by

dinstinct count高效近似算法

distinct count结果合并

总结

SQL优化（二）快速计算Distinct Count