Heads in the Clouds? Measuring Universities' Migration to Public Clouds: Implications for Privacy & Academic Freedom

With the emergence of remote education and work in universities due to COVID-19, the ‘zoomification’ of higher education, i.e., the migration of universities to the clouds, reached the public discourse. Ongoing discussions reason about how this shift will take control over students’ data away from universities, and may ultimately harm the privacy of researchers and students alike. However, there has been no comprehensive measurement of universi-ties’ use of public clouds and reliance on Software-as-a-Service of-ferings to assess how far this migration has already progressed. Weperformalongitudinalstudyofthemigrationtopublicclouds among universities in the U.S. and Europe, as well as institutions listed in the Times Higher Education (THE) Top100 between January 2015 and October 2022. We find that cloud adoption differs between countries, with one cluster (Germany, France, Austria, Switzerland) showing a limited move to clouds, while the other (U.S., U.K., the Netherlands, THE Top100) frequently outsources universities’ core functions and services—starting long before the COVID-19 pandemic. We attribute this clustering to several socio-economic factors in the respective countries, including the general culture of higher education and the administrative paradigm taken towards running universities. We then analyze and interpret our results, finding that the implications reach beyond individuals’ privacy towards questions of academic independence and integrity.


INTRODUCTION
Over the past decade, we have seen a shift in IT operations towards the use of cloud infrastructures [104,140]. Instead of running IT services with on-site teams and on infrastructure owned by organizations, services are now often deployed on public cloud infrastructure. Especially for web services, the model of using Software-as-a-Service (SaaS) has become prominent. However, this operational paradigm shift also leads to a change in control. While, before, user data would remain on infrastructure controlled by an organization, this data is now stored and processed by an external operator.
For universities using cloud infrastructures, this leads to hard challenges, stretching from limiting their ability to audit or implement privacy-by-design, e.g., privacy guarantees ensured through technical means, or ensure privacy-as-compliance, e.g., in terms of following privacy regulations [50,156], to impacting a university's ability to obtain meaningful informed consent when they employ cloud operators. Over the past year, for example, much debate surrounded the use of Zoom as the now de facto standard for remote lectures. Zoom only started to systematically attend to privacy and security concerns raised by educational institutions when pressure was handed down to the company from investors [103]. Still, universities that adopted Zoom for their lectures practically reduced students' consent choices to either using Zoom, and having their personal data processed by Zoom, or not participating in lectures.
The infrastructural and data control acquired by companies like Zoom have a knock-on effect on academic freedom. In 2020, Zoom ultimately prevented faculty and students at New York University from conducting a guest lecture -incidentally on censorship by Zoom and other tech companies -using their Zoom license [117]. The question hence expands beyond 'what private data do universities share with cloud platforms,' to include 'in what way can these cloud platforms use their infrastructural position and data practices to influence academic processes in universities. ' The adoption of educational technology ('EdTech'), i.e., the use of "market-facing digital technologies in education" [109], already prompted critical studies from the social sciences, warning about blurring lines between public educational institutions and private corporations as a threat to academic self-governance [91,109,132,144]. Despite these concerns, there are no comprehensive measurements of how reliant universities are on public cloud infrastructures. We address this gap by measuring cloud adoption in universities since 2015 in seven countries (the U.S., the U.K., Germany, Switzerland, Austria, the Netherlands, and France) and in the Times Higher Education (THE) Top100. We measure universities' hosting on cloud platforms, their use of cloud-based email providers, cloud-based learning management systems (LMS), and cloud-based video and lecturing tools.
We find that universities in the Netherlands, the U.K., the U.S. and in the THE Top100 are significantly more prone to depend on cloud infrastructures, while those in France and Germany rely far more on in-house services. We attribute these differences to a diverse set of socio-economic factors, including a historically different understanding of what higher education means, the university functions (research, education, administration) the IT infrastructure is aligned with, and the value placed on academic independence. We further observe that universities' migration to centralized clouds (Google/Amazon/Microsoft) does not show a clear pandemic effect as observed for the Internet as a whole [62]. The notable exception are video conferencing tools, where we see a clear uptick of adoption across the board, except for the U.S., where especially Zoom adoption was on the rise years before the pandemic. In summary, we make the following contributions: • We are the first to map out the cloud dependence of universities in Europe, the U.S., and the THE Top100, and find that it is an ongoing process predating the COVID-19 pandemic. • We document and attribute differences between countries to different paradigms for university IT and higher education. • We find that data and infrastructure control have implications for privacy and beyond, also threatening academic freedom.

UNIVERSITY IT
Generally speaking, universities are organizations with a purpose or function, which can be supported by IT pillars [154]. Commonly, these major functions are: Education, Research, and to enable these two, Administration [141]. While these functions may seem intuitively discrete, they partially overlap, also in the tools and applications used to address their needs. Universities look towards cloud infrastructure as a way to reduce their own IT investments, and potentially even a chance to free up and monetize assets, e.g., IPv4 addresses [121,134]. While the use of specific tools may lead the university to enter into agreements with a multitude of companies, many of these tools themselves are hosted on one of the three largest cloud platforms: Google Cloud, Amazon EC2, and Microsoft Azure. Education. IT infrastructure for education includes all tools that enable students to learn. Traditionally, this means all systems used for assessment and learning management systems (LMS), e.g., Moodle [32]. While educational software for remote teaching already received attention before the COVID-19 pandemic in the context of blended learning and MOOCs (Massive Open Online Courses), COVID-19 increased the importance of learning infrastructure, like video chat and streaming solutions, as well as examination and proctoring software. In most universities, these tools are offered institution-wide as centralized services, usually with the support of a central IT department. In addition, specific programs might need additional infrastructure, e.g., a program on system and network engineering may also need dedicated server rooms and networking labs [33], often offered in a decentralized manner.
Several vendors offer cloud-based LMS, which allow universities to outsource one of their largest systems (in terms of users). Even though tools for self-hosted remote lectures exist, the common perception associates remote lecturing mainly with Zoom, and, to a lesser extent, other cloud-based platforms like Microsoft Teams and Cisco WebEx. Similarly, proctoring solutions -a concept of questionable ethics [46,142] -are almost exclusively provided as cloud-hosted services.
Research. Research IT infrastructure is often more dependent on the individual needs of researchers, and therefore tends to be decentralized. Applications here range from the, in our field, common experimental systems (IoT test labs, network measurement infrastructure, and machines vulnerable to certain exploits) to IT systems used to control a diverse set of research instruments, such as electron microscopes or chemical processing lines. In addition, super computing capabilities [71], data storage and open data platforms [173], and research software that support quantitative and qualitative methods, e.g., survey and statistics tools [31] are often outsourced or centrally provided.
Cloud services can replace both types of research infrastructure. Researchers may use Platforms-as-a-Service for running measurement and experimental systems, especially when using GPUsupported compute. Furthermore, universities may provide outsourced and cloud-hosted instances of survey and interview platforms as a service for their researchers. Especially Amazon's Mechanical Turk has become a common tool in human factors work, ranging from social sciences to usable security and privacy [122]. Administration. The administrative function of a university entails all services and operations needed to support (not execute) its primary functions for education and research. This means budgeting and accounting tools, HR systems including personnel management databases and applicant management systems, and also student admissions. Furthermore, this entails foundational services like email, and the operation of a universities' network. Similarly, telephony and business communication tools -before the pandemic tools like Skype-for-Business (SfB), Microsoft Sharepoint, as well as Microsoft Teams and other video chat solutions that now overlap with educational tooling -traditionally fall into this category.
Applications for specific use cases (hiring, student admission, finance and accounting) are complex and highly business critical. Hence, outsourcing allows universities to reduce the needed local expertise to run these tools, while shifting the responsibility in case they become inoperable. Especially for tasks like email or security management, cloud setups promise higher reliability.

METHODOLOGY OVERVIEW
We first describe our general methodology in terms of the dataset and selected institutions. We describe specific aspects for the individual services in the corresponding sections ( §4 on cloud infrastructure, §5 email, §6 LMS, and §7 video conferencing).

Dataset
We use Farsight's Security Information Exchange (SIE) dataset [61] to measure (i) to what extent universities depend on cloud infrastructure, and, (ii) how this dependency developed over time. Farsight collects this dataset via recursive DNS resolvers of ISPs. Collaborating ISPs can install a sensor, which sends all cache misses [95,111]-see Table 4 in Appendix A-of their clients to Farsight.
Our dataset spans from January 1, 2015 to Oct 31, 2022 in permonth slices. Due to the nature of our dataset (see Appendix A) we focus on determining if an organization utilizes specific cloud resources, but not how much they utilize it. We use a historic dataset of all cache misses observed by participating DNS resolvers spanning from January 1, 2015 to Oct 31, 2022 in per-month slices. A unique cache miss is defined by the tuple of <rrname, rrtype, bailiwick, rdata> (see Appendix A). As we only receive cache misses, we cannot make statements about the popularity of domain names. Therefore, we focus our analysis on establishing a lower bound on the use of cloud resources, or, in more practical terms, we determine if an organization utilizes specific cloud resources, but not how much they utilize it. We provide a comprehensive description of DNS and the Farsight dataset in Appendix A.
Compared to actively collected large-scale DNS datasets, for example OpenINTEL [79,124], the Farsight dataset enables us to look deeper into the DNS tree of individual organizations. As we see all names that were requested by clients behind DNS recursors participating as sensors, we can see application-specific names (e.g., application.example.com.) that are not part of the set of names gathered by active measurement platforms (as they need a priori knowledge of these names). Specifically, these platforms request a known set of record types and names for all domains listed in top-level-domain zone files to which they have access [79,124].
To illustrate this with a non-exhaustive example, example.com, these platforms will regularly request the NS, MX, A, and AAAA record for example.com, as well as A and AAAA records for www.example.com. However, lms.students.example.com will not be included, because the subdomain students.example.com is not listed in the authoritative zone file. Contrary, the Farsight SIE dataset contains data on lms.students.example.com, if at least one client behind a sensor did request that name during the measurement period, and data was successfully returned. For the limitations of this approach, please see §10.
Ethical Considerations. To not collect personally identifiable information (PII), the Farsight passive DNS dataset consists only of cache misses found at recursive DNS servers, and does neither list the recursive resolver a record was seen on, nor the client that requested it [61]. Furthermore, we only process DNS entries under universities' domains and under domains of major cloud services (zoom.us etc.). We followed established best practices for handling passive datasets, as outlined by Allman et al. [5]. In our analysis, we only look at specific names under university domains (see §4- §8), which are only related to services and not individual users, and only investigate IP addresses of cloud platforms (see §4).
A hypothetical scenario -usually filtered for by Farsight -that may still leak PII is a university using dynamic DNS updates for user networks, see RFC2136 [155]. For example, a user's machine with the hostname 'Firstnames-iPad' may obtain a DHCP [57] lease  [14,55], we conducted a harm-benefit analysis. We found that, as we take additional measures against accidentally handling PII -as described above -and we work with a historic dataset that has been collected under the premise of not containing PII, the benefits of using this dataset to investigate cloud adoption, given its far reaching implications as discussed in §9, outweigh the limited and mitigated potential harm.

Selection of Institutions
We focus on universities (PhD awarding institutions) in the global north, specifically the U.S., Germany, Switzerland, Austria, the U.K., the Netherlands, and France, see Table 1. Appendix C lists all institutions and corresponding domains for each category.
We are familiar with the laws and educational systems of these countries, which we saw as a precondition to interpreting the data. We also hoped to contrast the effect of GDPR across countries, but found no conclusive evidence. For international comparison, we included institutions listed in the THE Top100 for 2020 [147]. The universities we studied predominantly use services dependent on dominant cloud providers from the U.S.. An expansion of this study to include universities and cloud providers from other parts of the world is ongoing research with collaborators from those regions.
For the U.S., we selected all R1 [39] and R2 [40] universities based on the Carnegie Classification of Institutions of Higher Education (also listed on Wikipedia [159]). For the remaining countries, i.e., Germany [161], the U.K. [164], Switzerland [162], Austria [160], France [158], and the Netherlands [163], we rely on the Wikipedia pages listing universities. We argue that this is a sufficiently reliable source for this information, given its general nature. We further manually investigated each listed university to identify their associated domain name(s). We do not claim completeness, but instead try to estimate a lower bound with our measurements. If a university uses multiple domains, or used a different domain in the past (especially common in France due to a history of reorganization of the university system), we check all domains and aggregate the results under the name of the institution.
To ensure that our data is not influenced by institutions into which we only have limited visibility, we excluded all institutions for which we did not see at least ten distinct names 1 in at least one month within our seven-year dataset. The institutions we filtered due to limited visibility are: (i) in the Netherlands the Theological University Apeldoorn, a small topic-specific university with no considerable IT infrastructure, (ii) in France 16 domains, which are remnants from before the merging processes of universities in the late 1990s/early 2000s, (iii) in the U.K. 28 domains, the Courtauld Institute of Art and 27 domains which belong to universities that are included in the dataset with other domains they predominantly use, e.g., ox.ac.uk. instead of oxford.ac.uk., and (iv) in Austria four domains, one of which is a secondary domain for the University of Salzburg, which is included via its mainly used domain uni-salzburg.at., and three small private universities in Vienna.

Visibility of Private Cloud Use
One question that may arise is whether we also observe private cloud use, e.g., an individual researcher using Gmail, or if such private cloud use from a university campus adds noise to our data. As the dataset we use is an aggregate view of DNS requests made, and we only look at names under universities' domains, we cannot infer usage patterns on a campus or that of individual users. Specifically, a user visiting mail.google.com in their browser -no matter if on campus or not -would have their browser make a DNS request for mail.google.com. The aggregation of all requests with the same data that month leads to this entry (simplified): From this, it cannot be inferred which user made this request, where they made it, or when exactly they made it. Hence, we are not even including these requests in our work. Instead, we select records under the domain of a university, to infer whether they are using Google services for their inbound email. For example, for 'Example University' using the domain 'example.com', we would find the following entry in our dataset: This means that, during the month in which this DNS entry was observed, mail for, e.g., user@example.com would have to be delivered to the mail server alt1.aspmx.l.google.com. For other analyses, we analogously use different RRtypes under universities' domain names. Hence, private use of cloud services does not show up in our dataset. To get a perspective including individual researchers' use of cloud products for a single university, a measurement study looking at a university's network uplink would have to be conducted, similar to the concurrent study by Karamollahi et al. [88].

Quantitative vs. Qualitative Approach
In our work, we utilize a quantitative approach to measure cloud adoption across a sample of over 600 universities. Arguably, a qualitative approach could have provided more in-depth insights into the organizational motives, while being more direct, and allowing for a larger temporal sample, i.e., observations on situations before the start of our quantitative dataset. Nevertheless, in order to attain a comparative perspective between regions, collecting qualitative data from over 600 universities constitutes a significant effort, and necessitates recruitment channels for these specific universities as well as local language proficiency. Qualitative investigations of the observed differences can then become the subject of subsequent studies, also reducing the number of interviews necessary, as these can be restricted to more targeted research questions and observed differences in regions. Furthermore, a qualitative approach requires linear effort for including further regions and universities in future analyses. In contrast, our quantitative approach can be easily scaled beyond the list of universities we studied.

CLOUD USE OVERVIEW
In this section, we provide a first overview of universities' reliance on cloud infrastructure of the 'Big Three' (Amazon, Google, and Microsoft). We want to understand to which extent names under universities' domains point toward these infrastructures, regardless of their popularity. This way, we do not only capture the most frequented names -for example the main website, or resources commonly used by students -but also capture, e.g., HR and administration tools, along with systems used for research. Hence, we look at whether universities have at least one name under one of their domains that points to each of the three providers above. Service Deployment. We start by measuring whether at least one generic service under a university's domain runs on cloud infrastructure. Hosts and services on the Internet generally need a name, and this name [28] usually points to an IPv4 address [120], IPv6 address [53], or both (dual stack) [133], commonly enabled by DNS [110], via A and AAAA resource records, respectively. The servers, or hosts, a service runs on can then be addressed by their corresponding IP address (IPv4 or IPv6). IP addresses are commonly registered to an organization that uses them [81] via their Regional Internet Registry (RIR). When a service is run on hosts in a cloud provider's infrastructure, the IP address via which they are reachable will identify that cloud provider.
This means that if the infrastructure for studentadmin.example.com runs on Amazon EC2 infrastructure, its A and/or AAAA records will point to an IP address owned by Amazon. For this, it is not necessary that studentadmin.example.com provides a web service (a service offered via HTTP(S) [65]), but it could also be any other common network service like a file or authentication server. This also means that a university may use multiple cloud providers at the same time, if hrservice.example.com runs on Microsoft infrastructure and its A and/or AAAA correspondingly point there. Methodology. For each university, we collect all A, AAAA, and CNAME resource records (RRs) for its domains. We then try to resolve all CNAME RRs from the dataset of the corresponding month in which they were observed. If we are unable to resolve a CNAME to an IPv4 or IPv6 address, we match RRs for products regularly hosted in certain infrastructure to IP addresses of the hoster. For example, we consider CNAMEs like www.example.com. IN A ec2-203-0--113-25.compute-1.amazonaws.com. as hosted by Amazon. We then use the AS59645 BTTF historic bulk whois service [139] to identify the Autonomous System (AS) that has been announcing a specific IP address during the month in the past for which we observed it in our dataset. The AS59645 BTTF historic bulk whois service leverages several historic datasets to provide information on the ASes that announced specific IP addresses up to one-day resolutions, spanning the period from May 2005 (for IPv4) and January 2007 (for IPv6) up until today. Please see the BTTF whois paper by Streibelt et al. for a detailed description of the methodology [139]. Results. Figure 1 presents an overview of our findings. On a macroscopic level, we already see major differences between institutions from different countries. Having at least one system located at a major cloud provider is common for the U.S., the U.K., and the Netherlands. The THE Top100 show a pattern similar to the U.S.. Cloud usage in these three countries and the THE Top100 shows a high share of using services hosted at Amazon. We find that the U.S. developed towards a situation where all of the three major operators are used at universities at the same time, rising from 79 institutions (30.38%) in January 2015 to 227 (87.31%) in October 2022. For the Netherlands and U.K., we see a lower share of Google over time, starting at 30 (26.09%) of all institutions for the U.K. and 4 (21.05%) for the Netherlands in January 2015, reaching 62 (53.91%) and 10 (52.63%), respectively, in October 2022. Note that, in the U.K., the adoption of Amazon-hosted cloud services took place between 2015 and 2017, with the largest adoption happening in 2016. We conjecture that this is related to AWS being included in the U.K.'s government cloud from late 2013 onwards [86], as prior outsourcing arrangements are unlikely to be quickly changed to new offerings [67]. Contrary to the U.S., where commonly all three major providers are used, a combination of Amazon and Microsoft is more common in the Netherlands and U.K..
France, Germany, and Austria form a clear contrast to this picture. All three of these countries have a lower cloud usage, with less than 50% of universities relying on cloud providers for any services: 2 (2.47%) to 40 (49.38%) for Germany, 10 (13.51%) to 32 (43.24%) for France, and 0 to 20 (58.82%) for Austria from January 2015 to October 2022. Switzerland, starting at 5 (35.71%) in January 2015 and reaching 12 (85.71%) in October 2022 developed from a middle ground between these two clusters towards the first one.
Note that the uptick of Microsoft-related infrastructure in Germany in December 2020 relates to the occurrence of names like (lync)autodiscover.example.com that point to Microsoft Azure addresses. Without this increase, we observed at least 24 (29.63%) of institutions in Germany using public cloud infrastructure in November 2020. We conjecture (also see §7), that this connects to the wider introduction of Microsoft Office 365, or activation of new features in an existing MS Teams installation (the specific name is a necessary condition for using Skype-for-Business use, but may also occur for an Office365 or Teams deployment). See, for example, an announcement of the Ruhr University Bochum [129].
In general, we find that cloud infrastructure dependence across all sampled countries is on the rise. However, in the Netherlands, the U.S., the THE Top100, and the U.K., we find this increase from a high level, i.e., U.S., U.K., Dutch, and THE Top100 universities already frequently used cloud infrastructure before January 2015. Still, we find an increase in cloud usage for these countries.
We note that we do not find a 'pandemic effect' [62] in the use of cloud infrastructure across institutions. Instead, the migration of higher education to the cloud seems to be an ongoing process that started more than five years ago. Furthermore, we find that the use of cloud resources fundamentally differs between countries. We revisit this pattern in §9, as we can observe similar effects for other cloud infrastructures as well.

CLOUD-BASED EMAIL
Here, we investigate universities' use of cloud-based email infrastructure. Email is arguably one of the most essential services on the Internet for professional communication. It regularly carries significant PII, when students have questions on courses, or seek advice in professional and personal matters, and it transports grades and course assignments, but also job applications, research data, academic discourse, and ideas.
Email is a common gateway for attackers to convince users to install malware or redirect them to phishing sites [77,135,169]. Hence, spam and malware filtering are common services offered by outsourced email platforms, and usually a significant selling point in moving to cloud-based email providers [51]. However, as Patrick Breyer, a member of the European Parliament, recently noted this also means that the operator is in control of which emails are and which are not delivered to users. 2 The strict inbound rules of major providers, which can lead to false positives [68,108], mean that universities using these services outsource the decision which emails reach their faculty and students along with the service. Service Deployment. Email is one of the most complex protocols currently used on the Internet [80], and for a more in-depth explanation we refer to related work, e.g., see Holzbauer et al. [80] for details on the configuration of modern email sending setups.
Here, we restrict ourselves to a description of inbound email handling. To receive email for a domain, one has to set MX records in that domain that provide a (prioritized) list of names of servers that accept emails for the domain. When a cloud-based email service provider is being used, the MX records of a domain point at email servers of said cloud provider. So, when Exchange in the Cloud or Office 365 are being used at an institution, the MX records point at servers with names under outlook.com. In addition, there are various email security appliances that upload received emails to cloud setups for, e.g., security and spam checks. For these to work effectively, additional DNS records for, e.g., DMARC have to be set to direct information to other services of said operator. Methodology. To identify whether universities use a cloud-based email service, we investigate their MX records. MX records are DNS entries that determine the email servers responsible for receiving emails for a domain [90]. Hence, we only measure who handles inbound email for a university. Their user email access and email storage may be handled on-site or via another cloud-based solution. Still, this means that all email to this institution flows via the identified service operator.
To identify the used operators, we follow the methodology of Henze et al. [76]. We first check if, for any of the second-level domains (SLDs) of a university, any of the MX records points to a domain associated with a cloud-based email provider (see Table 2). If we do not find an MX record for any of the SLDs, we descend further down the DNS tree. This happens, for example, if an institution has dedicated sub-domains for email, similar to using staff.example.com and students.example.com. Hence, if a sub-domain points the MX record at a cloud provider, we also consider the university to be using a cloud provider for email. Please note that the existence of an MX record pointing to a cloud provider does not have interaction effects with the measurements in §4, as we only utilize A, AAAA, and CNAME records, but not MX records there. Nevertheless, use of a cloud email provider might lead to other services, e.g., webmail, having names allocated under a university's domain which then point towards a cloud provider in the sense of what we measured in §4.
We also check whether a university uses Proofpoint's email security solution, which analyzes all incoming emails for an organization to filter out spam and malicious emails. It can either be used as a hosted solution where email is redirected via Proofpoint's servers, similar to products from Cisco, or via an appliance installed on-site that uploads emails and attachments for analysis to the Proofpoint cloud. We identify hosted setups via their MX records, while measuring appliance usage indirectly by evaluating DMARC [93] records. If the rua or ruf 3 of a university points to an email address under emaildefense.proofpoint.com, we assume that it uses Proofpoint's appliance-based services. If we do not find an MX record that points to hosts under a cloud provider's domain, or a DMARC record indicating the use of Proofpoint, we count the institution as 'Other/Private. ' This approach may under-estimate the number of cloud providers we find. Furthermore, if we are unable to observe an MX record for an institution included in our dataset for a given month, we mark this as 'No MX. ' Validation. We manually retrieved the MX records of universities found using cloud-based email and verified that they indeed point to the identified cloud provider. Results. When looking at the results of our measurements, see Figure 2, we find that they align with our observations from §4. The U.S., the U.K., the Netherlands, and the THE Top100 are again the countries with the most frequent use of cloud-based email providers, reaching 81 (70.43%) for the U.K., 59 (59.00%) for the THE Top100, and 13 (68.42%) for the Netherlands in October 2022.
The Netherlands present an interesting case here, as we see an increase in Microsoft-based email hosting between late 2018 and early 2020. Manually going through the websites of these universities revealed that they, either shortly before this time posted news items announcing a plan to migrate to Microsoft services, or directly announced a migration to Microsoft services at this time. As for early adopters, Utrecht University had been using Gmail for its students, while using a self-hosted solution for staff. 4 In 2018, they then decided to migrate students' and staff's email to Microsoft to create a common platform and -as mentioned in their press release -improve 'security' [150]. We assume that this relates to concerns about Google in the context of GDPR. Similarly, Nyenrode University migrated to Microsoft as part of a larger strategy to unify their IT infrastructure [167]. The larger increase in 2019 then may connect to a letter from the Dutch Ministry of Justice and Security sent to parliament, that essentially notes that privacy concerns regarding Microsoft cloud products have been resolved in negotiations [69]. This letter is then, for example, explicitly cited by TU Eindhoven as a reason why earlier concerns about privacy and security no longer apply, and they now migrate their email infrastructure to Microsoft [149].
In the U.S., a total of five companies control email services for 220 (84.62%) of all R1 and R2 universities in 2020. Again, Germany and France have a lower use of cloud resources, with neither of those countries exceeding 20% in October 2022: 2 (2.47%) for Germany and 7 (9.46%) for France. Both Austria and Switzerland have a higher adoption of cloud-based email services than Germany and France, with 3 (21.43%) for Switzerland and 9 (26.47%) for Austria in October 2022, both of them staying well below 50% adoption. We see a slight upward trend in the U.S. and the THE Top100, and a notable increase in the U.K. from 67 (58.26%) in January 2015 to 81 (70.43%) in October 2022. For the remaining countries, adoption of cloud email seems to stagnate over our measurement period.
The two most prominent operators are Google, likely with their classroom product, a work-suite containing email, documents and integration with Chrome Books, as well as Microsoft with cloudhosted Exchange/Office365/Teams. Other cloud providers only play a notable role in the U.K., where they occupy 11.30% of the market in October 2022. The most prominent smaller cloud providers are FireEye and Trend Micro. We find that Proofpoint is most prominent in the U.S. and the THE Top100, where we see the service being used by 22/30 (appliance/hosted; 8.46%/11.54%) and 11/12 (11.00%/12.00%) institutions in October 2022, respectively. We also see Proofpoint moving in the Dutch market, with first universities deploying their products in September 2019.

CLOUD-BASED LMS
We now take a look at universities' use of cloud-based Learning Management Systems (LMS), i.e., online tools that allow lecturers to manage and automate courses, reaching from registration, via providing content, to assessment and examination of enrolled students. As such, these systems provide some of the core functionality of what a university does. However, these systems also hold the most sensitive data stored about students: Grades, deliverables, and overall study performance.
Putting these systems into cloud infrastructure potentially provides access to this confidential data to unauthorized entities, e.g., via the cloud act [126]. At the same time, it also prevents students from effectively consenting to their data being processed by cloud companies, as an opt-out is only possible by not studying at a university using one of these products. Furthermore, these systems are also especially susceptible if a cloud provider decides to enforce their own policies and principles. If, for example, a U.S.-based LMS provider decides to enforce U.S. sanctions against citizens of specific countries for an LMS, including customers outside the U.S., it can effectively dictate which students a university enrolls by controlling the 'means of study.' Given the precedent of GitHub restricting accounts [1] for developers located in Crimea, Cuba, Iran, North Korea, and Syria to comply with U.S. trade sanctions, this is by far no hypothetical scenario.
Service Deployment. Cloud-hosted LMS are commonly run on servers operated by the company providing the LMS as a SaaS. However, to integrate these solutions with the organization for which they are provided, they are commonly aliased to a name under a university's domain name using a CNAME record, see, e.g., the documentation of Brightspace [29] and Blackboard [23]. Hence, by having a CNAME of the form lms.example.com IN CNAME university-name.brightspace.com., users can use the LMS by directing their browser at lms.example.com, providing a consistent appearance for services used by an organization. Methodology. We focus on four large providers of cloud-based LMS: Brightspace (Desire2Learn, brightspace.com), Courseleaf (courseleaf.com), Blackboard (blackboard.com), and Canvas (Instructure, instructure.com). These tools provision their services by having a name in a university's zone pointing a CNAME to their infrastructure, e.g., for Canvas canvas.example.com. IN CNAME example-com.instructure.com. To measure whether a university uses one of these LMS, we check whether we find a CNAME with a target that is below one of the domains used by the above cloud LMS. Note that we also count a SaaS-hosted LMS with servers located with Amazon, Google, or Microsoft, as a cloud-hosted service in §4. Naturally, we do not see whether a university uses an on-site LMS, like Moodle, or a locally hosted version of Blackboard. Validation. We manually went over matches for December 2021 and visited the identified LMS sites to verify that these universities indeed run the cloud-based LMS. Results. We find that cloud-hosted LMS are mostly relevant in the U.K., the U.S., and the Netherlands, for the latter two see Figure 3. We find no instances of cloud-hosted LMS in Germany and France, and only two in Austria. In Switzerland, we only find a single Canvas instance at the University of St. Gallen, which has been in operation since January 2019. We revisit the question what universities in these countries are using instead in §9. For the THE Top100, the use of cloud-hosted LMS is mostly due to U.S. universities. In fact, 37 of the 62 universities in the THE Top100 that use a cloud-based LMS in October 2022 are U.S. universities, while U.S. universities only make up 40 universities in the THE Top100. The remaining 25 institutions using cloud-based LMS in the THE Top100 are from the Netherlands (6), the U.K. (6), Canada (3), Australia (4), Hong Kong (2), Singapore (2), Sweden (1), and Belgium (1). Courseleaf is exclusively catering to the U.S. market, as we find no instances outside of the U.S.. We also find a steady growth of the use of cloudbased LMS over time between January 2015 to October 2022: in the U.S. from 87

VIDEO & REMOTE LECTURING TOOLS
Tools for video chatting and VoIP carry longstanding significance in professional communications, for example, Skype-for-Business (SfB). However, with COVID-19, academic core activities -teaching, meetings, and conferences -became dependent on them, a discussion often framed as the 'zoomification' of education.
Hence, here, we review universities' reliance on cloud-hosted video chat solutions. Following §2, we look at common video call tools like SfB, Cisco WebEx, Adobe Connect, and Zoom. Furthermore, we estimate the use of Microsoft Teams, but due to its implementation are limited to an upper-bound estimate.
Service Deployment. Cloud-based video chat tools are commonly hosted under the provider's domain name, e.g., zoom.us for Zoom. Large customers can, however, authenticate their domain using a challenge response mechanic via TXT records for their own domain, allowing the consolidation of users under that domain [170]. In addition, organizational users can create custom 'vanity' sub-domains under the service's second-level domain. This is obligatory if the organization, as most universities, uses a Single-Sign-On (SSO) system, which necessitates a custom landing page [171]. In addition, there are also options where a part of the video communication platform can be hosted on-premise by a customer, while account management remains in the public cloud infrastructure [172]. Adobe Connect, WebEx, and Zoom follow this approach.
MS Teams and SfB may also contain some on-premise components, requiring specific DNS records to enable cloud integration of these products [106,107]. Furthermore, organizations can authenticate their domain to Microsoft using a dedicated TXT record.
Methodology. To identify universities' use of centralized video chat solutions, we follow three different approaches, based on the platform we are looking at. For Zoom (zoom.us), Cisco WebEx (webex.com), and Adobe Connect (adobeconnect.com), we follow the naming scheme of these services for clients under their domains, i.e., we check if a name exists whose first label is (i) the SLD of a university, (ii) the SLD and TLD of a university with hyphens in between, or (iii) the SLD of a university plus -live (see Table 3). If we find a corresponding name lookup in our dataset during a month, we consider a university as using this service during that month. Furthermore, we consider universities as using Zoom that have a Zoom verification TXT record (see Appendix A).
To establish if a university uses SfB, we check for required DNS entries when operating SfB [107], specifically lyncdiscover.example.com, with example.com being replaced by a university's domain. This overlaps with the prior product name of SfB, Microsoft Lync. Finally, we check for universities which may be using Microsoft Teams. Unlike SfB, Microsoft Teams does not require special DNS entries that make its use uniquely identifiable [106], even though the voice component requires the same DNS entries as SfB [107]. However, to be able to use Microsoft Teams, an operator still has to set a Microsoft cloud verification token of the form MS=ms12345678. Even though the presence of this record does not mean a site does use Microsoft Teams -it may use other Microsoft products as well -we also count the number of sites using this token and report the number of additional universities that may be exclusively using Microsoft Teams, i.e., that do not use any of the other tools (SfB, Zoom, WebEx, or Adobe Connect).  Validation. We manually verified all Zoom links for July 2021 by visiting the corresponding Zoom page and ensuring that it forces users to log in via the related universities' SSO systems. If this was not the case, we used web searches to identify whether the related university refers to this Zoom subdomain for events on any of its websites. From 363 Zoom links we verified, 12 (3.31%) were incorrectly attributed or could not be verified through other channels. We excluded these false positives from our analysis.
Results. Taking a macroscopic look at our data, we again see a similar segmentation as with the previous cases of general cloud usage, email, and cloud-based LMS (see Figure 4). We see a heavy adoption of SfB (from 2015 to 2021) in the Netherlands (one with a large increase mid-2015 to 12 (63.16%)), the U.S. (110 (42.31%) to 199 (76.54%)), the U.K. (9 (7.83%) to 62 (53.91%)) and the THE Top100 (30 (30.00%) to 71 (71.00%)). At the same time we see close to no SfB instances in France, and limited adoption in the remaining countries: 5 (35.71%) in Switzerland, 11 (32.35%) in Austria, and 25 (30.86%) in Germany. Note that in Germany we observed an increase of 20 institutions using SfB between November and December 2020, most likely due to the introduction of Microsoft Teams, which partially uses DNS entries overlapping with those for SfB. We conjecture that this overall picture connects to different operational paradigms between universities in these two clusters, also in terms of administrative centralization (see §2).
When we look at the adoption of the other three platforms, we find an interesting picture, also in relation to the COVID-19 pandemic. In the U.S., we find that the adoption of Zoom and, to a slightly lower extent, WebEx has been an ongoing process that already started back in 2016 leading to 212 (81.54%) U.S. universities using Zoom and 71 (27.31%) using WebEx in October 2022. However, in comparison to December 2019, these numbers 'only' rose by 63 from 144 for Zoom and by 24 from 68 for WebEx, meaning that the pandemic effect is not as large as in other countries, mostly due to the already high adoption of Zoom in the U.S.. Adobe Connect, in general, has a market share similar to WebEx, with 130 (50.00%) U.S. universities using it in October 2022. U.S. universities seem to generally be using a multitude of video chat solutions, with 97 (37.31%) using two, 84 (32.31%) three, and 32 (12.31%) all four of the surveyed tools in October 2022.
This effect can again be found to a similar extent in the THE Top100. Please note that only 40 universities in the THE Top100 are U.S. universities. Here, we also see a continuous adoption of Zoom starting in 2016, leading up to 79 (79.00%) institutions using Zoom in October 2022. We also observe an apparent lack of a significant pandemic effect, and a large diversity of employed tools across universities, with 38 using two, 29 three, and 12 all of the surveyed video chat solutions.
We see a pandemic effect among the remaining countries, especially in terms of Zoom adoption. While Zoom played essentially no role in European universities before February 2020, its adoption quickly increased with the move to remote teaching. Interesting observations here are that most European universities are more discrete in their choice of video teaching platform (either Zoom or WebEx), and that the onset of these tools was sudden, i.e., within a month in the beginning of 2020. France shows a slower increase focused on Zoom, contrary to other European countries where we also observe an increase in WebEx. In the end, we find that in October 2022 Zoom/WebEx use in German universities is at 49 (60.49%)/32 (39.51%), in the U.K. 51 (44.35%)/19 (16.52%), in the Netherlands 13 (68.42%)/2 (10.53%), in Austria 14 (41.18%)/9 (26.47%), in Switzerland 11 (78.57%)/5 (35.71%), and in France 31 (41.89%)/7 (9.46%).
Looking at the possible upper bound for universities using Microsoft Teams without the SfB/voice and video chat component, we find that this number is close to zero for the U.S. (4/1.54%), Germany (3/3.70%), and Switzerland (0) in October 2022. In the U.K. (10/8.70%), the THE Top100 (8/8.00%), Austria (2/5.88%), and the Netherlands (2/10.53%) we see a modest number of additional institutions that might be using Microsoft Teams. France is the only country where we find a comparatively large amount of potential Microsoft Teams users who do not use any of the other solutions of SfB, with 13 (17.57%) institutions in October 2022. This difference has been stable over the past years, and is likely not related to an increase in Teams adoption by universities not already using Microsoft cloud services (or providing access to Microsoft software licenses to users from their domain) in the beginning of 2020.

SELF HOSTING IN GERMANY
The differences we observe from §4 to §7 beg the question what digital learning tools universities use instead of cloud products, e.g., in Germany. Hence, we look at the use of common self-hosted alternatives for LMS (Moodle [32] and Stud.IP [10,73]) and video chats (BigBlueButton [22]) in Germany, which are reportedly deployed in 90% of higher education institutions [43]. Self-hosted tools may, by default, not be necessarily more privacy preserving than offerings of large cloud providers. However, control over data nevertheless remains with the university hosting them, and they are able to audit and -if necessary -reconfigure and patch these tools to conform to privacy regulations and requirements. This could, for example, be seen with BigBlueButton, where the user group around German universities made significant contributions towards the privacy-preserving operation once privacy limitations in its design became apparent [21,82].
Service Deployment. Self-hosted services are typically deployed on servers located in a university's data-center. As with all services, see §4, these systems and associated services need an IP address and a name to be easily accessible by users. Best practice for naming systems in a professional setting is the use of a hybrid naming scheme, i.e., a scheme in which systems are named partially in a functional way, e.g., mail, survey, or lms, in combination with a formularic component [97]. With this hybrid scheme, a mail system might have the name mail023, for being the 23 r d mailserver, leading to the FQDN mail023.example.com. To make such systems more accessible to users, frontend systems commonly also receive an additional functional alias via a CNAME [97]. In the above example, a frontend alias might be mail.example.com, which may also be a load balancer to distribute load and scale the mail setup horizontally [97]. While technically not advised as best practice, functional names are also commonly modeled after the utilized software instead of the function of the software [97].
Hence, for the three systems we study in this section, operators are not restricted to specific naming. Yet, common operational practice makes it likely that systems are either provisioned under partially hybrid functional-formularic names based on the utilized software stack, or at least have an alias with a semi-functional naming scheme, i.e., a scheme that utilizes the software name instead of the system's function for naming. This is also a practice we observed in §6, where, e.g., SaaS instances of Brightspace commonly were aliased from brightspace.example.com instead of the purely functional lms.example.com.
Methodology. To estimate self-hosted LMS and BigBlueButton use in Germany and the U.S., we count the number of universities that either have Moodle/Stud.IP or BigBlueButton related names under their domain. For Moodle and Stud.IP, we count a university as having Moodle or Stud.IP if there is at least one name containing either moodle or studip. For BigBlueButton we count all universities that have at least one name containing bbb, bigbluebutton, scalelite (the load balancer component of BigBlueButton), or greenlight (a common BigBlueButton frontend).
Note that our matching is fuzzy: we may overmatch on hostnames that contain product names without running the associated service, while we may also undermatch when universities host these tools under different names. For example, in Germany, we often found BigBlueButton systems being called konferenz, the German word for conference, explaining the difference between our measured 71.60% and the 90% reported in the media [43].  Validation. We manually verified the use of BBB by visiting the BBB pages for the 59 German universities observed using BBB in December 2021, checking if they run BBB related software. We found that all of them did, indeed, ran BBB related software.
Results. In October 2022 65 (80.25%) universities in Germany have names related to Moodle or Stud.IP vs. 146 (56.15%) in the U.S. Moodle. Similarly, we find 58 (71.60%) universities in Germany having BigBlueButton related names under their domain, while this is the case for around 10% in the U.S., see Figure 6. We see a pandemic effect for BigBlueButton in Germany, starting in February 2022.

DISCUSSION 9.1 Cloud Infrastructures and Power
The last decade has seen big tech companies honing in on cloud infrastructures as an alternative source of growth [60,151]. This growth relies on two effects: Realizing the value proposition of reducing costs in terms of Capital Expenses (CapEx) and local Operational Expenses (OpEx) for lower OpEx paid for cloud service charges, and -for individual cloud providers -by attaining a market position making them 'the default' platform to be used [157].
The increasing dependency of big tech on cloud computing for their financial success means that they use political, economic and technical resources to ensure that the clouds are 'the default' infrastructure in as many domains as possible. Their political force is brought to bare using international initiatives, e.g., New Pedagogies for Deep Learning is a global partnership between the OECD, Gates Foundation, Pearson and Microsoft [165]; government partnerships, e.g., the U.K. government has incentivised schools to opt for platforms that are both free to use and bundled up with government-funded technical assistance [165]; and lobbying efforts [114]. Cloud providers use economic incentives by mounting the benefits of economies of scale, financing and physically migrating data to the cloud, and by providing free services that can bypass regular procurement rules. They can capture educational IT either by providing storage, compute, communication platforms, or by becoming the default infrastructure for smaller EdTech companies [87]. As our results in §4 show, universities increasingly use cloud services provided by Amazon, Google, and Microsoft. The fact that we also see a combination of different cloud services, hints at cloud platforms being introduced through other EdTech services.
On the flip side, this migration leads to 'cloud lock-in,' i.e., the dependency on cloud services even when terms and conditions change. For example, Google [136] discontinued free unlimited cloud storage, limiting, e.g., the University of Washington 5 and University of Utah 6 to 100 TB of total shared data for each university.
The trend of big tech monopolies shifting from "being mere owners of information, ... to becoming owners of the infrastructures of society" [137] has prompted an ongoing public discussion about the implications of this 'platform capitalism' on different aspects of society [30,44,137], yet without zooming in on its implications on higher education. At first sight, the political economic advantages put forth by cloud companies make good fellows with the economized management of universities. However, this also comes with power shifts. Mirrlees and Alvi [109] argue that universities focus on cutting costs, while allowing the big five (Apple, Alphabet/Google, Amazon, Microsoft, Facebook) and a growing ecosystem of start-ups, e.g., in the area of MOOCs, to compete with, and ultimately replace, public education. Most universities do not have the economic or political power to insert their own values and interests in such a market, unless they coordinate on these issues. The international initiatives these companies support make up informal policy networks that increasingly dominate educational policy [152]. Aside from potential impact on democratic societies and educational values, these networks are likely to promote certain forms of education, e.g., the individualised pursuit of 'mastery' enacted primarily through adaptive software, in favor of education that, e.g., promotes interpersonal dialogue and relations with others [165]. In the bigger scheme of things, there are also concerns about 'platform imperialism:' U.S.-based companies could use their global digital default infrastructure to exert 'soft-power' and economic control influencing global norms and values of digital cultures [84], steering curricula and research activities.

Cloud Use vs. Academic Freedom
The dependence of universities on cloud platforms for teaching, communication, and research that we observed has serious implications for academic freedom. If education and research depend on an external cloud service, researchers may become bound to comply with requirements set by these organizations. We recently saw Google's handling of Timnit Gebru for a paper not 'deemed worthy for publication' by the company [146], as well with other instances of Google telling its researchers to put a positive spin on 'sensitive topics, 7 or remove references to Google products [52]. One might argue that this concerns employees of Google, but it also begs the question whether cloud operators could leverage their power to influence critical university research in a similar way. In fact, Google has already been in the spotlight for sponsoring favorable research that is in line with its business and policy interests [45,113], both in the U.S. [37], and Europe [38]. 8 More generally, Abdalla et al. [3] discussed 'Big Tech' funding for research on the societal impact and ethics of AI, as a way to influence research questions (and answers). In the area of EdTech research, Mirrlees and Alvi [109] observed a lack of critical research, likely because of "...little incentive to 'bite the hand that feeds' " [131].
Conceivably, a major cloud provider could simply indicate that a continued business relation with a university may not be desirable in case the institution and its researchers continue to voice positions critical of that cloud provider. That institution would then face the dilemma of either 'aligning' their researchers, or facing a costly migration of essential services. Such a migration could easily cost millions while severely interrupting research and teaching.
Similar cases can be made for cloud operators enforcing their business rules in terms of, e.g., global sanctions as in the case of GitHub [1]. This may effectively put universities in a position where they either bar students from sanctioned countries attending the university, or at least from using their digital learning environment. Similarly, the reliance on platforms and their policies might impede global research collaborations [42]. Thus, this centralization of power may indeed inadvertently threaten core functions of universities. Hence, the question we have to ask as academics is not whether cloud operators would use these powers. Instead, we have to ask ourselves if we are willing to risk that they could.

Privacy and Academia
The move to the cloud raises a number of concerns with respect to the application of privacy by design or compliance. Past studies show that educational institutions do not fare well in making transparent the data collection and processing practices of cloud providers to their faculty, staff and students [85,98,102]. This can, e.g., happen when a university implements a blanket privacy policy for all digital tooling, including all cloud services. Depending on the diversity of data collection and processing these services entail, privacy policies may become generic, potentially falling short of legal transparency requirements [123]. It may also not be clear to the university what data is going to the cloud. Universities may evaluate and make data agreements with cloud providers, but ensuring these are effective can be challenging. Besides vague privacy policies [92], cloud services come with the promise of being plug-and-play, and recursively, they leverage the benefit of service architectures, and often bundle dozens of third parties [70]. As a result, cloud service providers may fail to make their data flows transparent. The promise of plug-and-play also means that university IT departments are often not given the time or resources to evaluate services. Even when performing privacy evaluations, these stand against digital branding efforts of the university and the partnerships between public institutions and cloud providers [165].
When students, faculty and staff access these services, they are not asked for explicit consent. Universities can, e.g., in the case of GDPR [156], use legitimate interest, public tasks or performance of a contract to justify data flows to cloud services. Hence, students and faculty may not have a (meaningful) option to opt-out of these services. When there is an opt-out process, people may be incentivized not to use them, e.g., reserving them for 'severe cases' creating time and capacity burdens for faculty and staff. When incentive structures are set up by-design and by-policy to push people onto cloud infrastructures, it is hard to speak of choice. Hence, public education institutions may end up leveraging their structures to on-board students, faculty, and staff as cloud service consumers [98].
If universities continue to outsource core functions to cloud platforms, students will no longer have a choice on whether they want to expose some of their most private information to these major cloud providers. Considering that these cloud services are economically under pressure to monetize either the data they collect (e.g., by creating a recruiting business [143]), or the infrastructural dependency they create, the practices that are being established here are concerning. As Hasan and Fritz [75] argued, platforms, such as LMS (see §6), can collect a wealth of sensitive behavioral and demographic student data, which can be abused for advertisement or surveillance. They also observed, even when this data is not collected directly, student demographics can still be inferred, potentially through the combination with data from other sources. Thus, universities may have to consider whether it is ethical or legal to create an environment where informed consent to data collection is, essentially, no longer possible.

Universities as Enterprise Networks
In §7 we observe that regions with major cloud adoption also saw a major adoption of SfB early on. Revisiting §2, we noted that tools like SfB would be expected in centralized enterprise IT. Hence, we argue that SfB adoption can serve as a proxy to assess the general operational paradigm of a university, i.e., if it is run more like an enterprise network or a university network.
This mechanic of administrative alignment of IT infrastructures with administration leading to centralization and organizations behaving similarly is a well documented effect in the field of Information Systems (IS). DiMaggio and Powell [54] discuss how bureaucratization via coercive, memetic, and normative processes leads to a structural alignment of organizations in a market, see also Scott for a more recent reflection on these theories [130]. Avgerou [11] transferred this institutional perspective to the introduction of IT systems and their connection to organizational change. To synthesize, the findings from IS indicate an effect in organizations where administrative alignment leads to IT transformation as a goal in itself, lacking "adequate legitimacy" [11], without any "contribution to the process of organizational change" [11].
Following SfB as a proxy, we conjecture that we observe an increased adoption of cloud technology for countries in which the university system has seen a stronger commoditization -the U.S., the U.K., the Netherlands, and THE Top100 -as also discussed by Bosetti and Walker [27]. In these countries, organizational alignment led to a situation where academic leaders governing a body of scholars were replaced by administrators and business managers overseeing university operations. These new managers imported and integrated enterprise tools and culture into the heart of public education, leading towards more cloud adoption.

Self-Hosting Challenges & Future Work
Concerning self-hosting vs. cloud infrastructures, there are opposing perspectives which have to be discussed. On the one hand, there are positions claiming that, for universities and other companies, the use of cloud infrastructure provides a wealth of benefits. The core idea of cloud infrastructure is the tenant's ability to quickly increase and decrease utilization based on actual demand, while only paying the resources they actually use. Following that point, there is an abundance of reports on cloud-operators' websites describing that, in contrast to self-hosting, cloud infrastructure enables more features and better adaptability [115], transforming organizations "from "IT can't do that" to a 'can do' situation" [105], as claimed in a Microsoft customer success-story. Similarly, customer-stories from Amazon claim that cloud hosting decreases an organization's carbon footprint and fiscal spend by reducing on-site personnel and facilities, along with this increase in agility [6]. Amazon's customers even note that integrating cloud services in education also aids students' future employ-ability [7].
Self-hosting needs local expertise to build usable and secure infrastructure, creating a conundrum between privacy issues of cloud infrastructure and their comparatively higher staffing enabling better security and reliability [64]. Similarly, self-hosted solutions may lack in observed efficacy, stability, and usability. For example, email grew so complex [80] that even experts with decades of experience are unable to get their self-hosted mail setup to deliver to major email providers like Outlook.com and Gmail [63]. Yet, Fenollosa attributes this to these operators' strict filtering against smaller operators in an attempt to reduce spam [63], see §5.
Similarly, in terms of efficacy of specific tools, while some work notes benefits for classroom implementations provided by Google (Google Classroom), more recent work finds non-application specific border conditions more relevant for usability, i.e., availability of Internet access and lecturer's engagement with a platform [72]. To the best of the authors knowledge, there are no recent publications comparing the usability of open-source self-hosted digital teaching tools against their cloud-hosted counterparts in established venues in security and privacy, human computer interaction, and educational technology research. Still, research from the early to late 2000s on Linux desktop software notes structural root-causes for limited usability, e.g., due to lacking in user research [118], and community-based initiatives lacking the organizational structure needed for common approaches to usability [12,13,19]. Furthermore, looking at decision makers' perspectives via a qualitative study focused on the perceived benefits of cloud migrations, Lal et al. find that clouds are seen as providing better scalability, higher flexibility, and more usable interfaces [94]. Still, concerning cost savings, executives at Hey.com claim staffing reduced in engineering has to be added in other areas [49].
Hence, operating self-hosted infrastructure is certainly neither easy, nor guaranteed to be successful. Looking at cases where selfhosting was successful, we find greater adaptability to be frequently mentioned. For example, the head of IT at the University of Osnabrück -a university committed to self-hosting and open-source for decades-notes that their self-hosting approach was ultimately cheaper and allowed them to react to the COVID-19 pandemic much more seamlessly than other universities who procured cloud products [18]. Similarly, the authors of Blacklight, an open-source literature search and indexing software for university libraries initially started at the University of Virginia, explicitly note adaptability as a major reason to start the project [66].
However, both cases highlight that preserving universities' digital sovereignty -especially given a reduction of local competencies in favor of cloud infrastructure -is not an easy task, but a long-term policy and resource commitment. In both cases, the concerned universities made a long-term commitment, essentially driving their infrastructure and relevant open-source software like in-house applications at large organizations, thereby creating the organizational structure found lacking for solely community-based open source projects in the past [12,13,19]. Hence, current cost savings and efficacy, e.g., in Osnabrück or for organizations using Blacklight [116], stem from a decade-long investment in the accumulation of competences by implementing a long-term strategy in which IT is seen as a support facility for teaching and research [18]. With no tangible return-on-investment in the short-term, changes -as all changes -potentially showing a (perceived) reduced system efficacy at first [97], and benefits potentially taking decades to materialize, this approach can be challenging to justify and sustain.
Future Work. Given our observations on a split perspective on self-hosting vs. cloud infrastructure above, several directions of future research emerge. The challenges and benefits of self-hosting as well as cloud infrastructures have to be critically and analytically evaluated. Corporate promotion material is as little of a comparative source for benefits of cloud infrastructures, as is a single case of a counter example proof for the feasibility of self-hosting. Especially in the latter case, scrutiny will have to be applied to the question which combination of factors, including local and state policy, led to current success in that isolated case, and if and how these factors can be facilitated more generally. Similarly, the scientific community should execute structured evaluations of, e.g., public digital infrastructures' usability, to provide an independent empirical basis for discussions on the efficacy of proprietary and cloud-based vs. open-source and self-hosted systems. Furthermore, business, organizational, and societal factors in the progressing adoption of cloud infrastructure have to move into the focus of future research, see, e.g., the work of Srnicek [137].

LIMITATIONS
The Farsight SIE dataset may not contain all cloud-related names, if these are not queried from a client behind a sensor. While instances of cloud hosting we identify are certainly there, more universities may be using major cloud providers without it showing up in the dataset. Similarly, we cannot make statements on the popularity of names, as Farsight SIE only collects DNS cache misses [61]. Furthermore, the number of universities among the surveyed countries differs (14 in Switzerland, 260 in the U.S.), which may amplify the effect of individual institutions' choices in smaller countries. Our work partly relies on heuristics and the automated analysis and classification of historic data, e.g., in the identification of Zoom, WebEx, and Adobe Connect domains, the use of Proofpoint, and the estimation of Moodle, Stud.IP, and BigBlueButton instances. Hence, we manually revisited our results and verified our findings against live data, as documented in a validation paragraph for each methodology section.
Given the large effect sizes, the alignment of ratio changes between different countries, our additional spot-checks, and our coverage of domain names, we are confident that our results paint an accurate picture of universities' cloud use since January 2015.

RELATED WORK
Cloud Infrastructure Measurements. Similar to us, Borgolte et al. [25] use the Farsight SIE dataset to identify domains pointing at cloud infrastructure. Jacquemart et al. [83] performed active DNS measurements on the most popular domains according to Alexa to measure the adoption of cloud services from 2013-2018. Portier et al. [119] and van der Toorn et al. [148] identify cloud service usage via TXT records. Streibelt et al. [138] and Calder et al. [36] use the EDNS0 extension to map cloud infrastructure. Doan at al. [56] combined active DNS measurements with crawling and rendering webpages to measure the consolidation of web resource hosting by Content Delivery Networks (CDNs) and cloud hosts. Henze et al. [76] focused on the adoption of cloud-based email services and identified them based on email headers on a dataset collected from mailing lists, spam traps, and volunteer users. Vermeer et al.'s provide a general taxonomy of asset discovery techniques similar to our targeted asset discovery using a passive dataset [153]. COVID-19 and the Internet. With COVID-19, it became apparent that the continued lock-down situation would have an extended effect on the Internet. As such, several researchers studied this effect, including the increased utilization of cloud-based services. Feldmann et al. [62] studied the impact of the COVID-19 pandemic through the lens of a major Internet Exchange Point from a European perspective, while Liu et al. [99] performed a similar study on changes in network traffic patterns in the U.S.. Boettger et al. [24] provided a similar perspective from the vantage point of the Facebook social network. Along the same lines, Lutu et al. [101] investigated the impact of COVID-19 on mobile network traffic. Educational Technology in the Cloud. Cohney et al. [47] perform a study into the privacy implications of virtual classroom technology. Contrary to us, they root their evaluation of technology use in a self-reported study among 49 instructors and administrators in U.S. universities, obtaining results similar to our Internet measurement data. In addition, they also analyze privacy policies of common virtual classroom tools, as well as 50 public Data Privacy Addenda (DPAs) in which universities negotiate their own terms platform operators. However, they also note that these terms only apply for institution-wide contracts, and that individual instructors might use other platforms without being aware of the privacy implications. From a student perspective, Balash et al. [15] focused on online proctoring services. They observed an institutional power dynamic and students' implicit trust in the tools selected by their university: the assumption that a university vets a tool or platform before using it lends it credibility. Emami-Naeini et al. [59] studied user attitudes towards video conferencing tools, including in educational settings. They also noted the participants' lack of agency when it comes to platform selection.
Similar to us, Komljenovic [91] theoretically analyzes the implications of the progressing centralization and platformization of educational technology, particularly noting the de-institutionalization of public education accelerated by centralized platforms. Zeide and Nissenbaum [168] analyze (before the COVID-19 pandemic) learner privacy in MOOCs and virtual education, finding it to often violate established norms in terms of privacy and education, supporting our assessment that the 'zoomification' of education is a long-standing process predating the COVID-19 pandemic. Besides these major related publications, several small-scale evaluations often limited to specific tools (usually Zoom) were undertaken during the last years, and have been summarized by Cohney et al. [47].
Finally, similar to us, Angiolini et al. [9] identified data protection challenges of remote teaching from a legal perspective, noting the conflict between platforms' business models and the public interest goals of universities, as well as threats to academic freedom and the right to education.

CONCLUSION
We investigated the reliance of universities on cloud infrastructure in seven countries and in the Times Higher Education Top100. We found that the move to the cloud has been ongoing for the past several years, and, apart from video lecturing tools, was not heavily influenced by the COVID-19 pandemic. Our results also highlight that university systems highly differ in their susceptibility to migrate to the cloud. We conjecture that this ties in with a multitude of factors, including the academic and administrative culture, and the history of university IT in the corresponding countries. Furthermore, we discuss the potential impact of this progressing development on the very essence of academic freedom.
In the end, as academics, we have to ask ourselves: Now that we know, are we content with this development, and can we live with the broader implications. If not, we have to find ways to counteract these developments, by investing in decentralized capabilities for independent research and teaching infrastructure, learning fromcertainly not perfect -cases like in Germany.

ACKNOWLEDGMENTS
This work has been supported by the European Commission via the H2020 program in project CyberSecurity4Europe (Grant No. #830929). Our work was enabled by the use of Slack and Signal (both hosted on Amazon EC2), Overleaf (hosted on Google cloud), GitHub (owned and hosted by Microsoft), as well as self-hosted BigBlueButton and Gitea instances. We thank Florian Streibelt for collaborating with us on implementing BTTF whois [139], which we leverage in §4,and Farsight Security, Inc. (now DomainTools) for providing access to the Farsight Security Information Exchange's passive DNS data feed. Without this data, the project would not have been possible. The authors express their gratitude to the anonymous reviewers and the shepherd of this paper for their continuous input during the review and shepherding process. Their input was instrumental in shaping the flow of the discussion in §9.5 and motivating the creation of the BTTF whois service. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the authors' host institutions, Farsight Security, Inc., DomainTools, or those of the European Commission.

A FARSIGHT METHODOLOGY OVERVIEW
In this section, we provide a primer on the Farisght dataset and aspects of the Domain Name System (DNS) to make our methodology accessible to a wider group of readers. Our primer assumes that the reader is familiar with the concept of IPv4/IPv6 addresses and the common analogy of DNS functioning as a form of phone book to look up IP addresses. We will first discuss DNS, common DNS terminology, and DNS resolution, i.e., how a client uses the DNS to resolve a name to a value. There we will see that DNS is not only a tool for looking up IPs, but instead a globally distributed error tolerant database used for various forms of lookups. Finally, we discuss the Farsight dataset and how it is collected.

A.1 DNS
Here, we first introduce the basics of DNS. Please see Table 4 for an overview of DNS related terms and abbreviations we use.
The DNS is a tree-shaped hierarchy for names [110,111] consisting of multiple labels delimited by dots [110], with the root of the tree at the end of the name, see also Figure 7. Names that reach up to the root, i.e., have a right-most label that is empty, are also called FQDNs (Fully Qualified Domain Names). The final '.' separating the empty root label is usually omitted when spelling out FQDNs [20]. One most regularly encounters names when included in a URI on the web [20], i.e., in the form of https://www.example.com/. A zone can contain names (as leaf nodes) that form RRsets consisting of the name and all resource records of one specific RRtype for that name, and a name can have multiple RRsets for different RRtypes [111]. Similarly, a zone (parent) can contain a delegation to one or multiple other 'authoritative DNS server' for a zone below itself (child), creating a zone-cut [111]. A zone is 'authoritative' for the names within itself or below if no zone-cut takes place [111]. We list and describe the most commonly used RRtypes in Table 5.

A.2 DNS Resolution
The process of retrieving the RRset for a name in the DNS is called 'resolving' that name [110]. DNS servers that recursively iterate through the DNS tree to retry a reply [110] are called 'recursors' or 'recursive resolvers.' Operating systems usually contain a socalled 'stub' resolver [112], which simply forwards DNS resolution requests to a configured recursive resolver, for example one provided by the end-users' Internet service provider, or one of the well known public resolvers, e.g., 1.1.1.1 (Cloudflare), 8.8.8.8 (Google), or 9.9.9.9 (Packet Clearing House/IBM). These then resolve a requested name for a given RRtype (together: query) for the client and return the answer, i.e., the retrieved RRset [111].
Resolution. Recursion takes place by the recursive resolver asking at least one authoritative DNS servers for all zones on the path to the root for the name, starting at the root, see RFC1034 and RFC1035 [110,111]. A recursive resolver usually contains a static 'root hint' that lists the IP addresses of the DNS servers authoritative for the root ('.'). When a recursive resolver asks an authoritative server for an RRset in a zone for which the authoritative server is not authoritative, while being authoritative for a zone which contains a delegation for a child that is closer to the requested name, Table 4: List of common DNS terms and abbreviations. See RFC8499 for a comprehensive list and as standard reference [78].

Abbrev. Term Description
A ADDITIONAL Additional information in a DNS response, may consists of one or multiple RRsets [111]. ANSWER The part of a DNS response that contains one or multiple RRsets that hold the answer to the query. Commonly only if the queried server is authoritative for the QNAME, or a recursive resolver [111]. Apex All RRsets whose RRname is equal to the zone are 'at the apex' of a zone. Authoritative DNS Server DNS server to which a zone is delegated, who can answer queries based on its local zone file. B Bailiwick Names either below or matching a zone are 'in-bailiwick' for that zone. BIND notation The common notation of RRs for a zone in the form of <FQDN> <CLASS> <RRtype> <RDATA>. The syntax is more complex, but we will use this most simple form throughout the paper. C Cache A local temporary storage on recursive resolvers populated with earlier retrieved RRsets whose TTL has not yet expired [8,111]. Caching The process of committing retrieved RRsets to a cache, but also serving answers from this cache.

Cache-Hit
A query for which a recursive resolver is able to provide an answer from its cache.

Cache-Miss
A query for which a recursive resolver is unable to provide an answer from its cache, and has to perform recursion instead. Child A zone that has been delegated by a parent, i.e., a zone that is deeper in the tree than its parent. CLASS The DNS class. This is essentially always IN for Internet [111], even though other classes (CH for Chaos [2], HS for Hesiod [2], NONE [155], and ANY [111]) do exist. D Delegation The process of pointing to different authoritative servers for a child of the current zone. DNS Domain Name System A system to resolve names to a variety of data points, which replaced /etc/hosts [74]. E Expire A value in SOA records, instructing secondary authoritative servers how long (in seconds) they should wait after a failed zone transfer until they stop being authoritative for a zone. F FQDN Fully Qualified Domain Name An FQDN is a name, i.e., see below, containing all labels from the terminal label to the root (the empty label above the TLD) [111].
Hence, all FQDNs are names, but not all names are FQDNs. G Glue A and AAAA RRsets send along with NS that are in-bailiwick for a delegated zone by the authoritative NS for the parent in response to a recursive resolver trying to resolve a name in or below the child zone, to enable said recursive resolver to reach the in-bailiwick NS, as their authoritative A and AAAA records would have to be provided by themselves [17]. N Name A '.'-delimited set of labels, ordered by the distance to the root of the DNS tree from left (greatest) to right (smallest). Negative response caching TTL Value provided in SOA records that instructs a recursive resolver on how long it should cache the non-existence of records [8].

NS Nameserver
A server implementing the DNS protocol, commonly used for authoritative DNS servers. P Parent A zone which is above a child in the DNS hierarchy, that delegates the child by reporting the NS authoritative for the child zone. Primary The authoritative server of a zone that holds the primary copy of the zone file and distributes it to secondaries via zone-transfers. Q QNAME Query Name The FQDN in a DNS query. Query A DNS request either from a stub to a recursive resolver or from a recursive resolver to an authoritative server.

QUESTION
The part of a DNS query or response that contains the combination of RRtype and RRname which was the initial query. The top label of the DNS tree, i.e., the root of the tree. Root-Server DNS servers that are authoritative for names at the root, i.e., TLDs [35].

RR Resource Record
An entry at a node (label) within the DNS, consisting of the RRname, CLASS, RRtype, TTL, and its RRdata. RRname Resource Record Name The FQDN associated with a specific RR.

RRset Resource Record Set
The set of all RR that have the same RRtype and RRname.

RRtype Resource Record Type
The type of a RR, see Table 5. S Serial An identifier for the version of a zone in the SOA record. This is an integer, and must be monotonously increasing. Commonly, the syntax for this value is YYYYMMDD00 for the first serial created on a day, continuously incremented over the day. This seeing the same serial on two authoritative servers for one zone means that the zone files should be in sync, and no zone transfer is needed. Secondary A server that is authoritative for a zone, but receives the zone via a zone-transfer from a primary.

SLD Second Level Domain
A domain that is a child of a TLD. Stub See Stub Resolver.

Stub Resolver
A DNS server that does not perform recursion but instead just forwards queries it receives to a recursive resolver. Subdomain Generally, a domain below a parent, similar to a child, but only used for zones that are at least below SLDs. T TLD Top Level Domain Domains that are children of the root. TTL Time-to-Live The time a received RRset may be used to answer queries for its RRname and RRtype from the cache. Z Zone A zone represents a part of the DNS tree and contains a collection of RRsets for which it is authoritative. This means that the names of these RRsets are in bailiwick of the zone, and authority for these names has not been delegated elsewhere, i.e., the zone contains the final-hence authoritative-answer for queries for these names. For references to corner cases, see RFC8499, Section 7 [78]. Zone file While nowadays database backed DNS servers are more common, Zones used to be stored in a single text file in BIND notation. Hence, this name is still being used for the data store of zone data [100].

Zone-Cut
A zone-cut is the point in an FQDN where the authority is delegated from one zone to the other.

Zone Transfer
Traditionally, there were primary and secondary authoritative servers. Changes would be made on the primary and then distributed to secondaries via zone transfers. Using only DNS, this can be done using an AXFR RRtype query, to which the primary replies (to authorized secondaries) with a copy of its zone file, i.e., all RRsets of a zone are within the ANSWER section. Alternatively, an IXFR can be used, where the secondary provides the primary with its current SOA serial, and the primary then only sends the difference between the zone file with the primary's serial and the one with the secondary's serial [96].   (1) The resolver asks the root-servers for www.example.com., (2) The resolver is redirected to the authoritative NS for com., (3) The resolver asks the com. NS for www.example.com., (4) The resolver is redirected to the authoritative NS for example.com., (5) The resolver asks the example.com. NS for www.example.com., (6) The resolver receives the A RRset for www.example.com. in response.

DNS Resolver
it returns the name of that zone and the responsible authoritative DNS servers. 9 For example, if a recursive resolver has to resolve the IPv4 address for www.example.com., it will first ask the rootservers for A www.example.com.. As these are not authoritative for example.com., they will reply with the RRset containing the NS for com.. The recursive resolver will then ask these for www.example.com., which will reply with the RRset containing the NS authoritative for example.com.. Finally, the recursive resolve can then ask the authoritative servers for example.com. for www.example.com.. As these are authoritative for the zone, they will return the requested rdata, e.g., the IPv4 address of www.example.com., if www.example.com. is not further delegated and an RRset for the requested RRtype at www.example.com. exists. Names either below or matching a zone are 'in-bailiwick' for that zone [78]. Please see Figure 8 for an overview of this process.
Caching. Recursive resolution is a comparably long and latency dependent process. As such, recursive resolvers employ caching to improve their response time. If a resolver successfully retrieved an RRset, it will put this RRset into its local cache. If a subsequent request from a stub resolver for that RRset reaches the recursive resolver, the resolver will not perform recursion, but instead reply from cache. The amount of time an RRset remains in a recursive resolver's cache depends on the configured Time-To-Live (TTL) of that RRset. 10 If a request can be answered from a recursive resolver's cache, it is called a cache-hit, while cases where the RRset is not part of the local cache are called cache-misses.

A.3 The Farsight Dataset
The Farsight Security Information Exchange (Farsight SIE) dataset, is a dataset of DNS requests shared by Farsight Inc. (now Domain-Tools) to allow researchers and security professionals to handle digital threats and study the Internet [61]. In the past, it has been used to characterize general use of the Internet, characterize malware, or study specific security vulnerabilities. Here, we describe how this dataset is being collected, and what data it contains.
Collection. The dataset is being collected on collaborating recursive resolvers (called 'sensors') around the world. Each sensor reports all cache-misses along with the retrieved data to Farsight. Farsight then further aggregates this data, so that individual sensors can not be inferred. Please see Figure 9 for an overview. By ensuring that neither individual clients nor specific sensors can be inferred from the aggregate data view, Farsight prevents the exposure of personally identifiable information. More broadly speaking, from the dataset collected by Farsight it is possible to infer that a specific name exists, but not which user requested it, or even where a specific user requested that name. Furthermore, due to the use of cache-misses, the exact popularity of names cannot be inferred.
Dataset Structure. We use a historic dataset of all cache misses observed by participating DNS resolvers spanning from January 1, 2015 to Oct 31, 2022 in per-month slices. A unique cache miss is defined by the tuple of <rrname, rrtype, bailiwick, rdata>. See Table 6 for an overview of the dataset, and the description below for a detailed explanation.
• count: The aggregate number of times cache misses for this unique tuple of rrname, rrtype, bailiwick, and rdata (sorted) have been observed by Farsight sensors. There is no distinction between one sensor having seen a tuple 10 times or 10 sensors having each seen it once. Furthermore, the count depends on the TTL of the RRset, as a higher TTL leads to less cache misses. Hence, the count only provides an indication of request frequency, which is why we do not rely on it in our study. Instead, we focus our analysis on establishing a lower bound on the use of cloud resources, or, in more practical terms, we determine if an organization uses specific cloud resources, but not how much they use it. • time_first: The first time in a month the unique tuple of rrname, rrtype, bailiwick, and rdata (sorted) was seen. • time_last: The last time in a month the unique tuple of rrname, rrtype, bailiwick, and rdata (sorted) was seen. • rrname: The name for which the rrset has been requested.
• bailiwick: The zone from which a reply to a query was received, e.g., considering the example from Figure 8  NS Each zone must have at least one, by policy usually at least two for TLDs and SLDs, NS records identifying the authoritative DNS servers for this zone. These have to be set in the zone's apex as well as in the parent (creating the delegation).
If the names used in NS records of a zone are within that zone, the parent must also provide A and/or AAAA records for these names. Even though the parent is not authoritative, it will send these RRsets as 'ADDITIONAL' information along with QNAMES in or below example.com. so that recursive resolvers have a hint on where to find the nameservers for the domain with in-bailiwick NS. See 'Glue' in Table 4. NS records must point to names that have an A or AAAA record [111]. CNAMEs are not allowed in NS records [17,111].
PTR PTR or 'pointer' records point to another part of the DNS tree. They are commonly used as 'reverse pointers' for IPv4 and IPv6 addresses mapping these to FQDNs independent of forward lookups [4]. For each IPv4 address, there is a representation below in-addr.arpa. [58,111], and for each IPv6 address one under ip6.arpa. [34], but the existence of a PTR pointing to, e.g., web01.example.com. does not imply or require the existence of a corresponding A or AAAA record. By placing a provided token at an RRname within a zone, one can proof ownership of that zone. This mechanic is used by several online and cloud services, including authorizing TLS certificates [16]. This list is not exhaustive.
example.com. IN TXT "adobe-idp-site-verification=6c3..." Table 6: List of data fields in the Farsight SIE dataset. In our study, we work with monthly slices of the dataset. For an overview of common rrtype and rdata values, please see Table 5, and for an overview of DNS terminology, see Table 4. Visibility of the Dataset. The data collection approach of the Farsight dataset also explains why it is better suited for our research question than actively collected DNS datasets. Compared to actively collected DNS datasets, for example OpenINTEL [79,124], the Farsight dataset enables us to look deeper into the DNS tree of individual organizations, i.e., we will observe a specific name, e.g., random.subdomain.service.example.com as soon as that record is requested at least once by a system using a recursive resolver that acts as a sensor for Farsight. In turn, active measurement platforms use a known list of domains retrieved from the zone files of top-level domains and will regularly request specific records below said domain. Using example.com. here, this might be the NS, MX, A, and AAAA record for example.com. Furthermore, they may request the A and AAAA records for www.example.com and a restricted set of well-known names. Hence, datasets collected by these platforms will not contain data on names like lms.students.example.com, because the subdomain students.example.com is not listed in the authoritative zone file of the top-level-domain. Contrary to that, the Farsight SIE dataset will contain data on lms.students.example.com, if at least one client behind a sensor did request that name during the measurement period, and data was successfully returned. At the same time, this also means that we miss specific names or institutions if the corresponding DNS resources have not been requested by a client behind a sensor contributing to the dataset. However, this does not pose a problem in the context of our objective to identify a lower bound of cloud usage in universities, as those records we do observe are certainly there. Also, as we discuss in §3.3, our measurements are not polluted by private cloud usage (e.g., Gmail) on university campuses.

B ONLINE DOMAIN-LIST & ARTIFACT
Hosted on GitHub (Microsoft Figure 9: The collection methodology for the Farsight dataset, following https://www.farsightsecurity.com/ technical/passive-dns/passive-dns-faq/. The Farsight dataset collects cache-misses encountered by recursive resolvers participating as sensors. Critical PII (DNS request times and timings, frequency of lookups and individual clients' IP addresses) are not included in the dataset, as they are not part of what is collected from cache-misses. Furthermore, additional filtering takes places for well-known privacy risks and patterns. All data produced by sensors is aggregated to the count of occurrences of specific cachemisses based on the unique touple of <rrname, rrtype, bailiwick, rdata>. Furthermore, additional filtering takes place to remove potential PII from the dataset, as for example, DNS queries and replies for IP-in-DNS tunneling.