The Size Distribution of Websites∗ Steven Schmeiser† Mount Holyoke College October 2014 Abstract The upper tail of the size distribution of websites follows a power law with slope close to one (Zipf’s law). This finding is robust to measuring website size by unique visitors and page views, and holds for the United States, Germany, and the world. Web traffic in China has less support for a power law. ∗ This letter benefited greatly from discussions with Ashley Miller, Katherine Schmeiser, and Erzo G.J. Luttmer. All errors are my own. † Email: [email protected] 1 1 Introduction Many size distributions of interest in economics take on the form of a power law with a slope coefficient ζ near one (Zipf’s law). Well studied examples of this phenomenon include city sizes1 and the size distribution of firms2 . Here, I use OLS regressions and MLE estimates of the Pareto distribution to show that the upper tail of websites also follows a power-law distribution with ζ close to one. The observation that the upper tail of websites follow Zipf’s law provides a strong constraint for models that either predict or exogenously specify a website size distribution, and contributes to the growing list of empirical power laws in economics. I analyze data on the top 30,000 sites visited worldwide, as well as the top 30,000 sites viewed by users in China, Germany, and the United States. Size is measured two ways: by number of unique visitors, and by number of page views. Size distributions for Germany, the U.S., and the world all follow a power law, while the evidence for China is weaker. To my knowledge this is the first paper to document the size distribution of websites in the economics literature. A few notable examples from computer science include Adamic and Huberman (2000), who obtain data on sites accessed by America Online users in 1997 and find evidence of a power law distribution, but do not report the slope coefficient. Adamic and Huberman (2002) revisit the AOL data and plot the data against a Pareto distribution with slope 1. Breslau et al. (1999) conduct six traces on academic, ISP, and corporate networks between 1996 and 1998 and find that web requests from a fixed group of users follow a power law distribution with slope ranging from 0.64 to 0.83 depending on the network. Last, Albert et al. (1999) investigate the topological properties of the World-Wide Web and find that the tail distributions of incoming and outgoing links on web pages follows a power law with slopes 2.1 and 2.45, respectively. Here, I use recent data collected from an Internet data company and find a slope coefficient close to one for many combinations of geography and tail size. 2 Data The data is gathered from Alexa Top Sites3 . Alexa ranks websites according to their Alexa Traffic Rank, which is a combination of reach and page views (defined below). The rankings are calculated based on a three month period ending in September 2014 and are aggregated to the domain level. For instance, www.example.com/index.html www.example.com/subdir/anotherpage.html and subdomain.example.com/index.html are all included under example.com.4 The data is collected from users of Alexa’s browser toolbar 1 Gabaix (1999), Gabaix and Ioannides (2004), Holmes and Lee (2010) Luttmer (2007), Axtell (2001) 3 http://aws.amazon.com/alexa-top-sites/. Alexa is an Amazon company. 4 The company does note that it separates out personal home pages and blogs when possible. 2 2 and other web data sources.5 The data includes two measures: reach and page views. Reach is defined as the number of unique visitors per million users that a website receives on a given day. Reach divided by 1,000,000 gives the fraction of the web-using population that visits the site. Page views are defined as the number of URL requests for a site, per million page requests. Page views divided by 1,000,000 gives the fraction of page requests that were directed at the given site. Multiple requests from the same user for the same URL on the same day are only counted once. This could result in under-reporting of sites that dynamically update content that users access at the same URL, such as facebook.com/. Both reach and page views are reasonable measures of website size. Reach emphasizes the size of a website’s audience, while page views proxy for both the amount of activity a site receives and opportunities to display advertisements (a primary source of website revenue). I collect the top 30,000 sites for the world and three countries: China, Germany, and the United States. Country is determined by the location of the visitor, not the location of the website. The reach and page view measures described above are for a given geography. For example, in the Germany data, a site with reach of 100,000 would indicate that ten percent of German web users visit that site. Verisign estimates that of the 2.5 billion Internet users worldwide, China has the most (618 million) and the U.S. has the second most (254 million).6 The top 30,000 sites account for about 76% of world page views, 95% in China, 80% in Germany, and 82% in the United States. However, 30,000 sites represents only a small fraction of the 271 million web domains.7 Summary statistics for the data are presented in Table 1. Within a country, the reach (per-million) of the top site is greater than the reach for the top site worldwide, as the top site in a country is more targeted to the unit of measurement. As can be seen in the table, an order of magnitude increase in rank roughly corresponds to an order of magnitude decrease in size for many of the combinations of geography and size measure. 3 Analysis Let i ∈ {1, 2, 3, ...} denote the rank of a website, with i = 1 being the largest, and let S(i) denote the size of the i’th largest site. Regressing the log of rank on log of size is a common method to measure the fit of data to a power law. I run the regression ln i = const − ζˆ ln S(i) + i (1) with the null hypothesis that ζ = 1 to find the slope coefficient on each combination of geographic area, size measure, and tail sizes (1,000, 10,000, and 30,000). As outlined in Gabaix and Ioannides 5 http://aws.amazon.com/awis/faqs/ http://www.verisigninc.com/assets/domain-name-report-april2014.pdf 7 ibid. 6 3 World Site Site Site Site Site Site 1 10 100 1,000 10,000 30,000 China Germany U.S. Reach Pageviews Reach Pageviews Reach Pageviews Reach Pageviews 496,000 64,930 10,430 1,380 163 29 113,250 5,675 648 61 6.38 0.8 779,700 185,400 8,500 700 170 12 106,510 15,433 876 66 9.4 0.41 778,000 87,100 12,600 1,838 190 53 97,480 4,497 591 74 8.5 1.1 834,200 99,500 13,490 1,640 177 40 194,900 6,562 556 57 6.5 0.95 Table 1: Summary statistics for reach and pageviews. Reach and pageviews are both measured on a per-million basis within the specified geographic market. (2004), standard errors reported by statistical software packages are incorrect due to correlation in the residuals generated by the ranking procedure. I therefore construct standard errors according to the process outlined in Gabaix and Ioannides (2004) by running 20,000 Monte Carlo simulations for each N . Results are reported in Table 2. The null hypothesis is rejected for most combinations of tail size and geography, however it is not rejected for the U.S. with tail size 30,000 when measured by reach, and it is not rejected for the world or U.S. for small tail sizes when size is measured by page views. With large sample sizes, it is perhaps not surprising that the null hypothesis is rejected in many cases. Gabaix (2008) cautions against testing whether or not data follows Zipf’s law by statistical rejection, and rather advocates judging by fit. Additionally, Gabaix (2008) suggests that any ζ ∈ [0.8, 1.2] is a useful result that merits further theoretical investigation. Using this metric, all estimates except for a few of China’s follow Zipf’s law. Figures 1 and 2 plot the data along with the OLS slope estimates for all combinations of geography and tail size. As seen in the plots for 30,000 sites, the smallest sites tend to be smaller than a power law would suggest. Additionally, the largest sites tend to be too small when size is measured by reach, and too large when size is measured by page views. In general, data for Germany, U.S., and the world fit a straight line very well, while data for China has large deviations from a power law. Next, I find maximum likelihood estimates for the Pareto distribution and judge goodness of fit using visual and quantitative methods. For the Pareto distribution, I use the Hill estimator and associated standard error described in Gabaix (2008), which is the MLE estimator under the null hypothesis of a perfect power law. Hill’s estimator is given by N −2 ζˆ = P N −1 ln S − ln S (i) (N ) i=1 (2) −1/2 . Table 3 reports the results. Similar to the OLS results, most, but ˆ with standard error ζ(N/2) not all, of the estimates reject the null hypothesis. However, all geographies except China have estimates close to one for tail sizes of 1,000 and 10,000. The Pareto distribution does not fit the 4 data well for a tail of 30,000 sites, as the numerous low ranked sites are too small to fit a power law. This is visually apparent in Figures 3 and 4. Last, I find the maximum likelihood estimates of the mean and standard deviation of the log-normal distribution. I test fit using the Kolmogorov-Smirnov (KS) test, which measures the maximum deviation of the empirical CDF from the estimated log-normal CDF. Results are reported in Table 4. The log-normal distribution is not a good fit for any combination of geography, tail size, and size measure. This is perhaps unsurprising, as I am only examining the upper tail of the website distribution rather than the full distribution. An industry brief by Verisign reports 271 million registered domains at the end of 2013, so even the largest tail size considered here represents the extreme upper tail.8 Eeckhout (2004) finds evidence that the distribution of all city sizes follows a log normal distribution. Rossi-Hansberg and Wright (2007) and Holmes and Lee (2010) both find that when the entire distribution is included, the log-log plot of size distribution (for firms and six-by-six mile squares, respectively) take on a concave shape. This can already be seen here in the plots for N = 30, 000 – the smallest websites start to bend down in a concave shape. The log-normal distribution may therefore provide a better fit as more of the website size distribution is included. 4 Conclusion This letter demonstrates that the upper tail of the website size distribution follows Zipf’s law in the United States, Germany, and the world. Data for China deviates from Zipf’s law. The power law nature of website size is a striking feature that can guide the development of models that either generate a website size distribution or take a size distribution as given. These results also suggest a few interesting theoretical questions. First, why do websites follow Zipf’s law? New models of random growth specific to the Internet or of the distribution of consumer tastes and the resulting consumer demand for websites are interesting avenues of exploration. Second, why does the data for China look so different? Compared to the United States and Germany, China has heavy Internet regulation and censorship that may distort viewing habits. Does censorship lead to the observed deviations from a power law? References L. A. Adamic and B. A. Huberman. The nature of markets in the world wide web. Quarterly Journal of Electronic Commerce, 1(1):5–12, 2000. L. A. Adamic and B. A. Huberman. Zipf’s law and the internet. Glottometrics, 3(1):143–150, 2002. R. Albert, H. Jeong, and A.-L. Barabasi. Internet: Diameter of the world-wide web. Nature, 401 (6749):130–131, 1999. 8 http://www.verisigninc.com/assets/domain-name-report-april2014.pdf 5 R. L. Axtell. Zipf distribution of u.s. firm sizes. Science, 293(5536):1818–1820, 2001. L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: evidence and implications. In INFOCOM ’99. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings, volume 1, pages 126–134, 1999. J. Eeckhout. Gibrat’s law for (all) cities. The American Economic Review, 94(5):pp. 1429–1451, 2004. X. Gabaix. Zipf’s law for cities: An explanation. The Quarterly Journal of Economics, 114(3):pp. 739–767, 1999. X. Gabaix. Power laws in economics and finance. NBER Working Paper, NBER Working Paper No. 14299, 2008. X. Gabaix and Y. M. Ioannides. The evolution of city size distributions. Handbook of regional and urban economics, 4:2341–2378, 2004. T. J. Holmes and S. Lee. Cities as six-by-six-mile squares: Zipf’s law? In Agglomeration Economics, pages 105–131. University of Chicago Press, 2010. E. G. J. Luttmer. Selection, growth, and the size distribution of firms. The Quarterly Journal of Economics, 122(3):pp. 1103–1144, 2007. E. Rossi-Hansberg and M. L. J. Wright. Establishment size dynamics in the aggregate economy. The American Economic Review, 97(5):pp. 1639–1666, 2007. 6 World N = Reach ζˆ R2 Pageviews ζˆ R2 China Germany U.S. 1,000 10,000 30,000 1,000 10,000 30,000 1,000 10,000 30,000 1,000 10,000 30,000 1.137 (0.044) 0.9988 1.098 (0.014) 0.9997 1.050 (0.008) 0.9959 0.873 (0.044) 0.9942 1.260 (0.014) 0.9491 0.814 (0.008) 0.9074 1.176 (0.044) 0.9982 1.064 (0.014) 0.9982 1.045 (0.008) 0.9968 1.096 (0.044) 0.9954 1.045 (0.014) 0.9992 0.998 (0.008) 0.9960 0.970 (0.044) 0.9979 1.010 (0.014) 0.9995 0.892 (0.008) 0.9810 0.908 (0.044) 0.9982 1.039 (0.014) 0.9900 0.665 (0.008) 0.9029 1.090 (0.044) 0.9913 1.091 (0.014) 0.9978 0.872 (0.008) 0.9694 0.994 (0.044) 0.9955 1.044 (0.014) 0.9991 0.903 (0.008) 0.9778 Table 2: OLS estimates of the slope coefficient ζ, with the null hypothesis that the data follows Zipf’s law. Standard errors are constructed according to cite. Estimates in bold are those for which the null hypothesis is not rejected. World N = Reach ζˆ Pageviews ζˆ China Germany U.S. 1,000 10,000 30,000 1,000 10,000 30,000 1,000 10,000 30,000 1,000 10,000 30,000 1.141 (0.036) 1.093 (0.011) 0.628 (0.004) 1.019 (0.032) 1.710 (0.017) 0.523 (0.003) 1.178 (0.037) 1.025 (0.010) 0.830 (0.005) 1.043 (0.033) 1.033 (0.010) 0.741 (0.004) 0.985 (0.031) 1.004 (0.010) 0.541 (0.003) 0.894 (0.028) 1.241 (0.012) 0.422 (0.002) 1.151 (0.036) 1.023 (0.010) 0.573 (0.003) 1.011 (0.032) 1.048 (0.010) 0.599 (0.003) Table 3: Hill estimates of ζ. Standard errors are constructed according to cite. Estimates in bold are those for which the null hypothesis is not rejected. World N = Reach µ ˆ σ ˆ KS Pageviews µ ˆ σ ˆ KS China Germany U.S. 1,000 10,000 30,000 1,000 10,000 30,000 1,000 10,000 30,000 1,000 10,000 30,000 8.105 0.749 0.143 6.008 0.825 0.145 4.960 0.902 0.123 7.530 1.265 0.219 5.721 0.595 0.247 4.396 1.366 0.095 8.364 0.700 0.150 6.223 0.878 0.144 5.175 0.910 0.143 8.359 0.803 0.121 6.144 0.911 0.144 5.039 0.998 0.121 5.123 1.030 0.163 2.849 0.974 0.153 1.625 1.230 0.094 5.306 1.174 0.171 3.046 0.912 0.190 1.477 2.035 0.124 5.176 0.809 0.157 3.118 0.834 0.123 1.841 1.272 0.095 5.035 0.977 0.156 2.826 0.912 0.148 1.617 1.197 0.085 Table 4: Log-normal estimates and KS tests. All estimates reject the null hypothesis that the data was generated from a log-normal distribution with p-values < 2.2 e − 16. 7 (a) World, N = 1, 000 (b) World, N = 10, 000 (c) World, N = 30, 000 (d) China, N = 1, 000 (e) China, N = 10, 000 (f) China, N = 30, 000 (g) Germany, N = 1, 000 (h) Germany, N = 10, 000 (i) Germany, N = 30, 000 (j) United States, N = 1, 000 (k) United States, N = 10, 000 (l) United States, N = 30, 000 Figure 1: Linear regression plots for reach. 8 (a) World, N = 1, 000 (b) World, N = 10, 000 (c) World, N = 30, 000 (d) China, N = 1, 000 (e) China, N = 10, 000 (f) China, N = 30, 000 (g) Germany, N = 1, 000 (h) Germany, N = 10, 000 (i) Germany, N = 30, 000 (j) United States, N = 1, 000 (k) United States, N = 10, 000 (l) United States, N = 30, 000 Figure 2: Linear regression plots for page views. 9 (a) World, N = 1, 000 (b) World, N = 10, 000 (c) World, N = 30, 000 (d) China, N = 1, 000 (e) China, N = 10, 000 (f) China, N = 30, 000 (g) Germany, N = 1, 000 (h) Germany, N = 10, 000 (i) Germany, N = 30, 000 (j) United States, N = 1, 000 (k) United States, N = 10, 000 (l) United States, N = 30, 000 Figure 3: Maximum likelihood plots for reach. 10 (a) World, N = 1, 000 (b) World, N = 10, 000 (c) World, N = 30, 000 (d) China, N = 1, 000 (e) China, N = 10, 000 (f) China, N = 30, 000 (g) Germany, N = 1, 000 (h) Germany, N = 10, 000 (i) Germany, N = 30, 000 (j) United States, N = 1, 000 (k) United States, N = 10, 000 (l) United States, N = 30, 000 Figure 4: Maximum likelihood plots for page views. 11
© Copyright 2024 ExpyDoc