The Size Distribution of Websites

The Size Distribution of Websites∗
Steven Schmeiser†
Mount Holyoke College
October 2014
Abstract
The upper tail of the size distribution of websites follows a power law with slope close to
one (Zipf’s law). This finding is robust to measuring website size by unique visitors and page
views, and holds for the United States, Germany, and the world. Web traffic in China has less
support for a power law.
∗
This letter benefited greatly from discussions with Ashley Miller, Katherine Schmeiser, and Erzo G.J. Luttmer.
All errors are my own.
†
Email: [email protected]
1
1
Introduction
Many size distributions of interest in economics take on the form of a power law with a slope
coefficient ζ near one (Zipf’s law). Well studied examples of this phenomenon include city sizes1
and the size distribution of firms2 . Here, I use OLS regressions and MLE estimates of the Pareto
distribution to show that the upper tail of websites also follows a power-law distribution with ζ
close to one. The observation that the upper tail of websites follow Zipf’s law provides a strong
constraint for models that either predict or exogenously specify a website size distribution, and
contributes to the growing list of empirical power laws in economics.
I analyze data on the top 30,000 sites visited worldwide, as well as the top 30,000 sites viewed
by users in China, Germany, and the United States. Size is measured two ways: by number of
unique visitors, and by number of page views. Size distributions for Germany, the U.S., and the
world all follow a power law, while the evidence for China is weaker.
To my knowledge this is the first paper to document the size distribution of websites in the
economics literature. A few notable examples from computer science include Adamic and Huberman
(2000), who obtain data on sites accessed by America Online users in 1997 and find evidence of
a power law distribution, but do not report the slope coefficient. Adamic and Huberman (2002)
revisit the AOL data and plot the data against a Pareto distribution with slope 1. Breslau et al.
(1999) conduct six traces on academic, ISP, and corporate networks between 1996 and 1998 and
find that web requests from a fixed group of users follow a power law distribution with slope ranging
from 0.64 to 0.83 depending on the network. Last, Albert et al. (1999) investigate the topological
properties of the World-Wide Web and find that the tail distributions of incoming and outgoing
links on web pages follows a power law with slopes 2.1 and 2.45, respectively. Here, I use recent
data collected from an Internet data company and find a slope coefficient close to one for many
combinations of geography and tail size.
2
Data
The data is gathered from Alexa Top Sites3 . Alexa ranks websites according to their Alexa Traffic
Rank, which is a combination of reach and page views (defined below). The rankings are calculated
based on a three month period ending in September 2014 and are aggregated to the domain level.
For instance,
www.example.com/index.html
www.example.com/subdir/anotherpage.html and
subdomain.example.com/index.html
are all included under example.com.4 The data is collected from users of Alexa’s browser toolbar
1
Gabaix (1999), Gabaix and Ioannides (2004), Holmes and Lee (2010)
Luttmer (2007), Axtell (2001)
3
http://aws.amazon.com/alexa-top-sites/. Alexa is an Amazon company.
4
The company does note that it separates out personal home pages and blogs when possible.
2
2
and other web data sources.5
The data includes two measures: reach and page views. Reach is defined as the number of
unique visitors per million users that a website receives on a given day. Reach divided by 1,000,000
gives the fraction of the web-using population that visits the site. Page views are defined as the
number of URL requests for a site, per million page requests. Page views divided by 1,000,000
gives the fraction of page requests that were directed at the given site. Multiple requests from
the same user for the same URL on the same day are only counted once. This could result in
under-reporting of sites that dynamically update content that users access at the same URL, such
as facebook.com/. Both reach and page views are reasonable measures of website size. Reach
emphasizes the size of a website’s audience, while page views proxy for both the amount of activity
a site receives and opportunities to display advertisements (a primary source of website revenue).
I collect the top 30,000 sites for the world and three countries: China, Germany, and the United
States. Country is determined by the location of the visitor, not the location of the website. The
reach and page view measures described above are for a given geography. For example, in the
Germany data, a site with reach of 100,000 would indicate that ten percent of German web users
visit that site.
Verisign estimates that of the 2.5 billion Internet users worldwide, China has the most (618
million) and the U.S. has the second most (254 million).6 The top 30,000 sites account for about
76% of world page views, 95% in China, 80% in Germany, and 82% in the United States. However,
30,000 sites represents only a small fraction of the 271 million web domains.7 Summary statistics
for the data are presented in Table 1. Within a country, the reach (per-million) of the top site is
greater than the reach for the top site worldwide, as the top site in a country is more targeted to the
unit of measurement. As can be seen in the table, an order of magnitude increase in rank roughly
corresponds to an order of magnitude decrease in size for many of the combinations of geography
and size measure.
3
Analysis
Let i ∈ {1, 2, 3, ...} denote the rank of a website, with i = 1 being the largest, and let S(i) denote
the size of the i’th largest site. Regressing the log of rank on log of size is a common method to
measure the fit of data to a power law. I run the regression
ln i = const − ζˆ ln S(i) + i
(1)
with the null hypothesis that ζ = 1 to find the slope coefficient on each combination of geographic
area, size measure, and tail sizes (1,000, 10,000, and 30,000). As outlined in Gabaix and Ioannides
5
http://aws.amazon.com/awis/faqs/
http://www.verisigninc.com/assets/domain-name-report-april2014.pdf
7
ibid.
6
3
World
Site
Site
Site
Site
Site
Site
1
10
100
1,000
10,000
30,000
China
Germany
U.S.
Reach
Pageviews
Reach
Pageviews
Reach
Pageviews
Reach
Pageviews
496,000
64,930
10,430
1,380
163
29
113,250
5,675
648
61
6.38
0.8
779,700
185,400
8,500
700
170
12
106,510
15,433
876
66
9.4
0.41
778,000
87,100
12,600
1,838
190
53
97,480
4,497
591
74
8.5
1.1
834,200
99,500
13,490
1,640
177
40
194,900
6,562
556
57
6.5
0.95
Table 1: Summary statistics for reach and pageviews. Reach and pageviews are both measured on
a per-million basis within the specified geographic market.
(2004), standard errors reported by statistical software packages are incorrect due to correlation in
the residuals generated by the ranking procedure. I therefore construct standard errors according
to the process outlined in Gabaix and Ioannides (2004) by running 20,000 Monte Carlo simulations
for each N . Results are reported in Table 2. The null hypothesis is rejected for most combinations
of tail size and geography, however it is not rejected for the U.S. with tail size 30,000 when measured
by reach, and it is not rejected for the world or U.S. for small tail sizes when size is measured by
page views.
With large sample sizes, it is perhaps not surprising that the null hypothesis is rejected in
many cases. Gabaix (2008) cautions against testing whether or not data follows Zipf’s law by
statistical rejection, and rather advocates judging by fit. Additionally, Gabaix (2008) suggests that
any ζ ∈ [0.8, 1.2] is a useful result that merits further theoretical investigation. Using this metric,
all estimates except for a few of China’s follow Zipf’s law. Figures 1 and 2 plot the data along with
the OLS slope estimates for all combinations of geography and tail size. As seen in the plots for
30,000 sites, the smallest sites tend to be smaller than a power law would suggest. Additionally,
the largest sites tend to be too small when size is measured by reach, and too large when size is
measured by page views. In general, data for Germany, U.S., and the world fit a straight line very
well, while data for China has large deviations from a power law.
Next, I find maximum likelihood estimates for the Pareto distribution and judge goodness of
fit using visual and quantitative methods. For the Pareto distribution, I use the Hill estimator and
associated standard error described in Gabaix (2008), which is the MLE estimator under the null
hypothesis of a perfect power law. Hill’s estimator is given by
N −2
ζˆ = P
N −1
ln
S
−
ln
S
(i)
(N
)
i=1
(2)
−1/2 . Table 3 reports the results. Similar to the OLS results, most, but
ˆ
with standard error ζ(N/2)
not all, of the estimates reject the null hypothesis. However, all geographies except China have
estimates close to one for tail sizes of 1,000 and 10,000. The Pareto distribution does not fit the
4
data well for a tail of 30,000 sites, as the numerous low ranked sites are too small to fit a power
law. This is visually apparent in Figures 3 and 4.
Last, I find the maximum likelihood estimates of the mean and standard deviation of the
log-normal distribution. I test fit using the Kolmogorov-Smirnov (KS) test, which measures the
maximum deviation of the empirical CDF from the estimated log-normal CDF. Results are reported
in Table 4. The log-normal distribution is not a good fit for any combination of geography, tail
size, and size measure. This is perhaps unsurprising, as I am only examining the upper tail of
the website distribution rather than the full distribution. An industry brief by Verisign reports
271 million registered domains at the end of 2013, so even the largest tail size considered here
represents the extreme upper tail.8 Eeckhout (2004) finds evidence that the distribution of all city
sizes follows a log normal distribution. Rossi-Hansberg and Wright (2007) and Holmes and Lee
(2010) both find that when the entire distribution is included, the log-log plot of size distribution
(for firms and six-by-six mile squares, respectively) take on a concave shape. This can already be
seen here in the plots for N = 30, 000 – the smallest websites start to bend down in a concave
shape. The log-normal distribution may therefore provide a better fit as more of the website size
distribution is included.
4
Conclusion
This letter demonstrates that the upper tail of the website size distribution follows Zipf’s law in
the United States, Germany, and the world. Data for China deviates from Zipf’s law. The power
law nature of website size is a striking feature that can guide the development of models that either
generate a website size distribution or take a size distribution as given.
These results also suggest a few interesting theoretical questions. First, why do websites follow
Zipf’s law? New models of random growth specific to the Internet or of the distribution of consumer
tastes and the resulting consumer demand for websites are interesting avenues of exploration.
Second, why does the data for China look so different? Compared to the United States and
Germany, China has heavy Internet regulation and censorship that may distort viewing habits.
Does censorship lead to the observed deviations from a power law?
References
L. A. Adamic and B. A. Huberman. The nature of markets in the world wide web. Quarterly
Journal of Electronic Commerce, 1(1):5–12, 2000.
L. A. Adamic and B. A. Huberman. Zipf’s law and the internet. Glottometrics, 3(1):143–150, 2002.
R. Albert, H. Jeong, and A.-L. Barabasi. Internet: Diameter of the world-wide web. Nature, 401
(6749):130–131, 1999.
8
http://www.verisigninc.com/assets/domain-name-report-april2014.pdf
5
R. L. Axtell. Zipf distribution of u.s. firm sizes. Science, 293(5536):1818–1820, 2001.
L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions:
evidence and implications. In INFOCOM ’99. Eighteenth Annual Joint Conference of the IEEE
Computer and Communications Societies. Proceedings, volume 1, pages 126–134, 1999.
J. Eeckhout. Gibrat’s law for (all) cities. The American Economic Review, 94(5):pp. 1429–1451,
2004.
X. Gabaix. Zipf’s law for cities: An explanation. The Quarterly Journal of Economics, 114(3):pp.
739–767, 1999.
X. Gabaix. Power laws in economics and finance. NBER Working Paper, NBER Working Paper
No. 14299, 2008.
X. Gabaix and Y. M. Ioannides. The evolution of city size distributions. Handbook of regional and
urban economics, 4:2341–2378, 2004.
T. J. Holmes and S. Lee. Cities as six-by-six-mile squares: Zipf’s law? In Agglomeration Economics,
pages 105–131. University of Chicago Press, 2010.
E. G. J. Luttmer. Selection, growth, and the size distribution of firms. The Quarterly Journal of
Economics, 122(3):pp. 1103–1144, 2007.
E. Rossi-Hansberg and M. L. J. Wright. Establishment size dynamics in the aggregate economy.
The American Economic Review, 97(5):pp. 1639–1666, 2007.
6
World
N =
Reach
ζˆ
R2
Pageviews
ζˆ
R2
China
Germany
U.S.
1,000
10,000
30,000
1,000
10,000
30,000
1,000
10,000
30,000
1,000
10,000
30,000
1.137
(0.044)
0.9988
1.098
(0.014)
0.9997
1.050
(0.008)
0.9959
0.873
(0.044)
0.9942
1.260
(0.014)
0.9491
0.814
(0.008)
0.9074
1.176
(0.044)
0.9982
1.064
(0.014)
0.9982
1.045
(0.008)
0.9968
1.096
(0.044)
0.9954
1.045
(0.014)
0.9992
0.998
(0.008)
0.9960
0.970
(0.044)
0.9979
1.010
(0.014)
0.9995
0.892
(0.008)
0.9810
0.908
(0.044)
0.9982
1.039
(0.014)
0.9900
0.665
(0.008)
0.9029
1.090
(0.044)
0.9913
1.091
(0.014)
0.9978
0.872
(0.008)
0.9694
0.994
(0.044)
0.9955
1.044
(0.014)
0.9991
0.903
(0.008)
0.9778
Table 2: OLS estimates of the slope coefficient ζ, with the null hypothesis that the data follows
Zipf’s law. Standard errors are constructed according to cite. Estimates in bold are those for
which the null hypothesis is not rejected.
World
N =
Reach
ζˆ
Pageviews
ζˆ
China
Germany
U.S.
1,000
10,000
30,000
1,000
10,000
30,000
1,000
10,000
30,000
1,000
10,000
30,000
1.141
(0.036)
1.093
(0.011)
0.628
(0.004)
1.019
(0.032)
1.710
(0.017)
0.523
(0.003)
1.178
(0.037)
1.025
(0.010)
0.830
(0.005)
1.043
(0.033)
1.033
(0.010)
0.741
(0.004)
0.985
(0.031)
1.004
(0.010)
0.541
(0.003)
0.894
(0.028)
1.241
(0.012)
0.422
(0.002)
1.151
(0.036)
1.023
(0.010)
0.573
(0.003)
1.011
(0.032)
1.048
(0.010)
0.599
(0.003)
Table 3: Hill estimates of ζ. Standard errors are constructed according to cite. Estimates in bold
are those for which the null hypothesis is not rejected.
World
N =
Reach
µ
ˆ
σ
ˆ
KS
Pageviews
µ
ˆ
σ
ˆ
KS
China
Germany
U.S.
1,000
10,000
30,000
1,000
10,000
30,000
1,000
10,000
30,000
1,000
10,000
30,000
8.105
0.749
0.143
6.008
0.825
0.145
4.960
0.902
0.123
7.530
1.265
0.219
5.721
0.595
0.247
4.396
1.366
0.095
8.364
0.700
0.150
6.223
0.878
0.144
5.175
0.910
0.143
8.359
0.803
0.121
6.144
0.911
0.144
5.039
0.998
0.121
5.123
1.030
0.163
2.849
0.974
0.153
1.625
1.230
0.094
5.306
1.174
0.171
3.046
0.912
0.190
1.477
2.035
0.124
5.176
0.809
0.157
3.118
0.834
0.123
1.841
1.272
0.095
5.035
0.977
0.156
2.826
0.912
0.148
1.617
1.197
0.085
Table 4: Log-normal estimates and KS tests. All estimates reject the null hypothesis that the data
was generated from a log-normal distribution with p-values < 2.2 e − 16.
7
(a) World, N = 1, 000
(b) World, N = 10, 000
(c) World, N = 30, 000
(d) China, N = 1, 000
(e) China, N = 10, 000
(f) China, N = 30, 000
(g) Germany, N = 1, 000
(h) Germany, N = 10, 000
(i) Germany, N = 30, 000
(j) United States, N = 1, 000
(k) United States, N = 10, 000
(l) United States, N = 30, 000
Figure 1: Linear regression plots for reach.
8
(a) World, N = 1, 000
(b) World, N = 10, 000
(c) World, N = 30, 000
(d) China, N = 1, 000
(e) China, N = 10, 000
(f) China, N = 30, 000
(g) Germany, N = 1, 000
(h) Germany, N = 10, 000
(i) Germany, N = 30, 000
(j) United States, N = 1, 000
(k) United States, N = 10, 000
(l) United States, N = 30, 000
Figure 2: Linear regression plots for page views.
9
(a) World, N = 1, 000
(b) World, N = 10, 000
(c) World, N = 30, 000
(d) China, N = 1, 000
(e) China, N = 10, 000
(f) China, N = 30, 000
(g) Germany, N = 1, 000
(h) Germany, N = 10, 000
(i) Germany, N = 30, 000
(j) United States, N = 1, 000
(k) United States, N = 10, 000
(l) United States, N = 30, 000
Figure 3: Maximum likelihood plots for reach.
10
(a) World, N = 1, 000
(b) World, N = 10, 000
(c) World, N = 30, 000
(d) China, N = 1, 000
(e) China, N = 10, 000
(f) China, N = 30, 000
(g) Germany, N = 1, 000
(h) Germany, N = 10, 000
(i) Germany, N = 30, 000
(j) United States, N = 1, 000
(k) United States, N = 10, 000
(l) United States, N = 30, 000
Figure 4: Maximum likelihood plots for page views.
11