TWo-IN-one-SSE: Fast, Scalable and Storage-Efficient Searchable Symmetric Encryption for Conjunctive and Disjunctive Boolean Queries

Searchable Symmetric Encryption (SSE) supports efficient yet secure query processing over outsourced symmetrically encrypted databases without the need for decryption. A longstanding open question has been the following: can we design a fast, scalable, linear storage and low-leakage SSE scheme that efficiently supports arbitrary Boolean queries over encrypted databases? In this paper, we present the design, analysis and prototype implementation of the first SSE scheme that efficiently supports conjunctive, disjunctive and more general Boolean queries (in both the conjunctive and disjunctive normal forms) while scaling smoothly to extremely large encrypted databases, and while incurring linear storage overheads and supporting extremely fast query processing in practice. We quantify the leakage of our proposal via a rigorous cryptographic analysis and argue that it achieves security against a well-known class of leakage-abuse and volume analysis attacks. Finally, we demonstrate the storage-efficiency and scalability of our proposed scheme by presenting experimental results of a prototype implementation of our scheme over large real-world databases.


INTRODUCTION
The advent of cloud computing potentially allows individuals and organizations to outsource storage and processing of large volumes of data to third party servers. However, this leads to concerns surrounding the confidentiality of the data stored on the third party cloud servers and outsourced access for processing.
Consider, for instance, a client that offloads an encrypted database of (potentially sensitive) emails to an untrusted cloud server. * Also with ETH Zürich, Switzerland and Visa Research USA (part of the work was done while the author was affiliated with these institutions). At a later point of time, the client might want to issue a query of the form: retrieve all emails received from xyz@foobar.org or abc@foobar.org and with "research" in the subject field. Ideally, the client should be able to perform this task without revealing any sensitive information to the server, such as the sources and contents of the emails, the keywords underlying a given query, the distribution of keywords across emails, etc. Unfortunately, existing techniques such as Fully Homomorphic Encryption (FHE) [14] and Oblivious RAM (ORAM) [16], that potentially support such an "ideal" notion of privacy, are currently unsuitable for wide-scale practical deployment due to high performance and storage overheads.
Searchable Symmetric Encryption. Searchable Symmetric Encryption (SSE) [3, 4, 6-13, 15, 19, 21-23, 29, 30] is the study of provisioning symmetric-key encryption schemes with search capabilities. The goal of SSE is two-fold: (a) to allow a (potentially untrusted) server to execute keyword search queries directly on a collection of a client's encrypted documents in an efficient manner, and (b) to ensure client privacy by minimising the amount of information "leakage" to the server in the process. Some examples of leakage include the database size, query pattern (which queries correspond to the same keyword) and the access pattern (the set of file identifiers matching a given query).
SSE for Boolean Queries. The example query over an email database that we outlined above is an instance of what we call a Boolean query, in the sense that it can be viewed as a Boolean formula involving certain equality predicates over keywords, connected by AND and OR operators. In this paper, we broadly investigate the following question: Can we design a fast, scalable, storage-efficient and low-leakage SSE scheme for general Boolean queries?
This seemingly natural question has, somewhat surprisingly, also been a longstanding open question. In particular, while significant progress has been made in designing efficient SSE schemes for simpler sub-classes of Boolean queries (such as atomic equality predicates and conjunctions of keywords), the handful of existing SSE schemes supporting disjunctive and general Boolean queries incur extremely large encrypted storage overheads (quadratic in the size of the plain database), which makes them impractical for real-world deployment. We briefly summarise the state-of-the-art on both conjunctive and disjunctive SSE below followed by main contributions of this work.

Background and Related Work
Initial constructions of SSE focused on single keyword search and their related extensions. Recent SSE algorithms now support multikeyword search. We briefly go over the current status of multikeyword SSE constructions below. SSE for Conjunctions. In a seminal work [7], Cash et al. proposed Oblivious Cross Tags (OXT) -an efficient, highly scalable and lowleakage (not leaking more than benign information) SSE schemes supporting conjunctive keyword queries over encrypted document collections. Since then, a number of SSE schemes supporting conjunctive keyword queries with a variety of leakage vs efficiency trade-offs have been proposed in different settings [6,23,29]. Unfortunately, these schemes are neither efficient nor low-leakage when processing disjunctive or general Boolean queries. For example, when processing a disjunctive query over (w 1 , . . . , w n ), these solutions leak the set of documents matching each w i individually, which can be devastating in the face of existing leakage-abuse attacks [2,5,25,32].
SSE for Disjunctions. While SSE for single and conjunctive keyword queries has been studied quite extensively, SSE for disjunctive queries has received much less attention. To the best of our knowledge, only IEX-2Lev and IEX-ZMF due to Kamara and Moataz [19] support reasonably efficient query processing without incurring potentially devastating leakage.
The IEX family of schemes [19,26] has a few disadvantages that we outline here. First, its performance for conjunctive queries over real-world databases is significantly worse as compared to OXT. Secondly, it is incompatible with OXT and its follow-up schemes [6,23,29]; so it does not lead to a common solution that supports both conjunctive and disjunctive queries efficiently. Finally, the IEX family of schemes incurs a (worst-case) storage overhead that grows quadratically with the number of keywords in the database. This makes it impractical for deployment over real-world databases.

Our Contributions
In this paper, we present the design, analysis and prototype implementation of the first SSE scheme that efficiently supports conjunctive, disjunctive and more general (and complex) Boolean queries (in both the conjunctive and disjunctive normal forms) while scaling smoothly to extremely large encrypted databases, and while incurring linear storage overheads and little query processing overheads in practice. Our scheme is named TWo-IN-one-SSE, or TWINSSE in short. We expand on our contributions and techniques below.
Supporting Conjunctive "and" Disjunctive Queries. Our core technical contribution is a novel mechanism for designing SSE schemes that support both conjunctive and disjunctive keyword searches in a fully compatible manner. At a high level, we achieve this as follows. Given any conjunctive SSE scheme (i.e., any generic SSE scheme that only supports conjunctive queries), we present a generic black-box transformation that yields an SSE scheme supporting conjunctive, disjunctive and general Boolean queries in the conjunctive normal form (CNF) and disjunctive normal form (DNF). Our transformation does not rely on any special properties of the underlying conjunctive SSE scheme. This allows it to be instantiated from any existing conjunctive SSE scheme, including OXT [7]. To the best of our knowledge, such a generic transformation from a conjunctive SSE scheme to an SSE scheme for general and complex Boolean queries has not been studied before in the SSE literature 1 .
A Naïve Approach. The naïve approach for supporting disjunctive queries in a generic way using a system that only supports conjunctive queries is to allow for "negative searches", wherein given a keyword, we can efficiently retrieve the set of documents that do not contain the keyword. Then, given a disjunctive query of the form q = (w 1 ∨ . . . w n ), we can transform q into the following conjunctive query q = (w 1 ∧ . . . w n ), where for each i, w i denotes the "negated keyword" that (hypothetically) occurs in every document that does not contain w i ; consequently, q can be viewed as the negated counterpart to the original query q.
This approach has two major disadvantages. First, it requires us to design data structures that support efficiently retrieving, for each keyword, not only the set of documents it occurs in, but also the set of documents that it does not occur in, while maintaining data and query privacy. This is likely to lead to massive blowup in storage. Secondly, and most crucially, for disjunctive queries involving less frequent keywords (which is what we expect from a very large proportion of the queries), the overall computational and communication complexity suffers a huge blowup, since it is now proportional to the result set for the negated query, which would be almost the entire set of documents. Our aim is to design a generic transformation mechanism that is significantly more efficient. To this end, we introduce and use a novel concept called meta-keywords.
Using Meta-Keywords. The technical centrepiece of our generic transformation is the concept of meta-keywords, which we introduce in this paper. At a high level, a meta-keyword mkw i is a disjunction of certain carefully chosen keywords of the form mkw i = (w i 1 ∨ w i 2 ∨. . .∨w i ℓ ), that we pre-process and store at setup in an inverted search index 2 .
Our core technical observation is the following: given a database with N keywords, there exists an O(N )-sized set S of metakeywords such that for any disjunctive query of the form q = (w 1 ∨ . . . w n ) for n ≤ N , there exists a meta-query which is a conjunction over meta-keywords of the form q ′ = (mkw 1 ∧ . . . ∧ mkw n ) such that mkw 1 , . . . , mkw n ∈ S, and such that Result-Set(q) ⊆ Result-Set(q ′ ).
The non-triviality of our approach lies in addressing the following challenges simultaneously: • Coverage: Designing an O(N )-sized meta-keyword set that "covers" an O(2 N )-sized space of all possible disjunctive queries. • Efficiency: Minimizing the overheads due to filtering of "spurious" documents in the result "meta-set" (i.e. ensuring that Result-Set(q ′ ) is as close to Result-Set(q) as possible).
• Security: Minimizing leakage by ensuring that the metakeywords reveal as little information as possible about the underlying keywords being actually queried.
While achieving these requirements simultaneously appears challenging at first sight, we develop a systematic and formal approach that allows us to achieve them for any database; more formally, given a database, we show how to convert the same into a meta-database equipped with a linearly-sized set of meta-keywords that meets all of the aforementioned requirements. We formalise these properties in Sections 3 and 4.

TWINSSE.
We use the aforementioned transformation to design our overall solution, that we call TWo-IN-one-SSE, or TWINSSE in short. As the astute reader might have already guessed, given a conjunctive SSE scheme, our design of TWINSSE uses the following two-step approach: • Step-1: Given any database DB, convert it into the corresponding meta-database DB, where DB can be viewed as a database equipped with two kinds of keywords -the original keywords and the meta-keywords. • Step-2: Apply the conjunctive SSE scheme to encrypt and query the database DB.
Note that conjunctive query processing over DB proceeds exactly as it would over DB, and requires no additional query planning on the part of the client. Disjunctive query processing is more involved because it requires the client to plan the meta-query. In Section 4, we formally describe how this can be done using O(n) computation (which is the information-theoretic minimum for any n-word disjunctive query). Finally, in Appendix G, we describe a hybrid query planning approach that allows handling general Boolean queries in CNF and DNF expressions in an efficient manner. We note here that the ability to efficiently handle CNF and DNF queries effectively allows TWINSSE to handle complex Boolean queries (involving both conjunctive and disjunctive clauses) by casting them into either CNF or DNF formulae over keywords.
We then present a concrete instantiation of TWINSSE from OXT as the baseline conjunctive SSE scheme. We denote this version of OXT as TWINSSE OXT . Details of TWINSSE OXT and, in particular, its handling of disjunctive queries, are presented in Section 4.3. Additionally, we present an elaborate discussion on executing complex Boolean queries using TWINSSE OXT in Appendix G.
Leakage Analysis. We formally detail the leakage profile of TWINSSE OXT in Appendix D. In order to analyse the impact of this leakage on the security of TWINSSE OXT , we perform a detailed cryptanalysis of TWINSSE OXT in Appendix F. In particular, we show that known leakage-abuse attacks [5,32], volume analysis-based attacks [2], and the state-of-the-art SAP attack [25] fail against TWINSSE OXT in practical adversarial settings.
Experimental Evaluation. We present a C++ implementation of TWINSSE OXT along with performance figures in Section 5. We experimented over the Enron email corpus 34 for compatibility with previous SSE literature. The data set contains around 170K keywords, 500K documents and 20 million unique keyword-document

PRELIMINARIES AND BACKGROUND
In this section we introduce the notations used in the rest of the paper, as well as preliminary background material on SSE.

Notations
We write x R ← − χ to represent that an element x is sampled uniformly at random from a set/distribution X. The output x of a deterministic algorithm A is denoted by x = A and the output x ′ of a randomized algorithm A ′ is denoted by x ′ ← A ′ . For a, b ∈ Z such that a, b ≥ 0, we denote by [a] and [a, b] the set of integers lying between 1 and a (both inclusive), and the set of integers lying between a and b (both inclusive), respectively.
Databases. Let ∆ = {w 1 , . . . , w N } be a dictionary of keywords, and let F = { f 1 , . . . , f D } be a collection of documents, such that each document f i is associated with a unique identifier id i and contains keywords from ∆. We assume that standard set operations including union and intersection are allowed over ∆. We denote by DB a database of identifier-keyword pairs, such that (id, w) ∈ DB if and only if the document with identifier id contains the keyword w. We denote by DB(w) the set of all identifiers corresponding to documents containing w. We denote by |∆| the number of distinct keywords in DB, by |DB| the number of distinct id − w pairs in DB, and by |DB(w)| the number of documents containing w.
Conjunctive and Disjunctive Queries. We represent a conjunctive query over n distinct keywords w 1 , . . . , w n as q = w 1 ∧. . .∧w n and define the set DB(q) as DB(q) = ∩ n i=1 DB(w i ). Similarly, we represent a disjunctive query over n distinct keywords w 1 , . . . , w n as q = (w 1 ∨ . . . ∨ w n ) and define the set DB(q) as DB(q) = ∪ n i=1 DB(w i ). Throughout the paper, we use the notation R q = DB(q) to represent the result of searching a query q (irrespective of the query type), unless otherwise specified.

Searchable Symmetric Encryption
Any SSE scheme [7,11] consists of a polynomial-time algorithm Setup executed by the client, and an interactive protocol Search executed jointly by the client and the server: • Setup(1 λ , DB): Takes as input the security parameter λ and a database DB, and outputs the tuple (sk, st, EDB), where sk is the client's secret-key, st is the client's internal state, and EDB is the encrypted database. • Search(sk, st, q; EDB): The client takes as input the secretkey sk, its state st and a query q, while the server takes as input the encrypted database EDB. At the end of the protocol, the client outputs DB(q) 5 .
Correctness. An SSE scheme is said to be correct if for every database DB and for every query q, the output of the Search protocol contains DB(q) with overwhelming probability.
Security. We refer to [7,11] for the standard simulation security definition of SSE against semi-honest adversaries in the real worldideal world paradigm.

TWINSSE: SIMPLIFIED VERSION
In this section, we introduce TWINSSE. For ease of representation, we first present a simplified version, which we refer to as TWINSSE Basic .

The Core Tool: Meta-Keywords
We begin by describing the core technical tool for our construction, which we refer to as meta-keywords. Let ∆ = {w 1 , . . . , w N } be a dictionary of keywords (N is total number of keywords in DB), and assume (without loss of generality) that these keywords are arranged in increasing order of frequency, i.e., Super-Keyword. We begin by defining a super-keyword, which is simply a disjunction of some subset of the keywords in ∆. Formally, a super-keyword w is represented by a bit-string of the form w = Note that one could equivalently represent w using the actual constituent keywords; we use the bit-string representation because it makes the description of our strategy easier to follow, and also more efficiently implementable.
Meta-Keyword. We now define a meta-keyword. At a high level, a meta-keyword is a "special" super-keyword with a single contiguous stretch of 0-entries in its Boolean representation. Formally, a metakeyword is defined as follows.
In other words, for the meta-keyword mkw i,j , we have Informally, one can view a meta-keyword mkw i,j as a disjunction over ∆ = {w 1 , . . . , w N } excluding a contiguous sequence of keywords (w i , w i+1 , . . . , w j ).
We also let mkw * = 1 N denote the special "all-ones" metakeyword. Finally, let S mkw,∆ = {mkw i,j } i ≤j ∪ {mkw * } be the set of all meta-keywords over the dictionary ∆. It is easy to see that for |∆| = N , we have |S mkw,∆ | = O(N 2 ).
Using a Meta-Keyword. A reader might wonder why we choose the above definition of a meta-keyword. To begin with, note that while pre-processing and storing an inverted search index consisting of all super-keywords along with the original keywords allows us to trivially answer conjunctive, disjunctive, and more general Boolean queries in a fully compatible manner. However, this would require exponential storage, and therefore is not practically feasible.
Hence, our approach is to look for a poly-sized subset of the set of all possible super-keywords that, if pre-processed and stored as part of the inverted search index, would allow us to "cover" any disjunctive query. It turns that the set of meta-keywords is indeed this set. We make this explicit by stating the following (informal) claim. We subsequently make this claim more formal and prove it. Claim 3.1 (Informal). For any disjunctive query of the form q = (w ℓ 1 ∨ . . . w ℓ n ) (where n ≤ N ), there exists a meta-query which is a conjunction over meta-keywords of the form such that mkw i 1 ,j 1 , . . . , mkw i n ,j n ∈ S mkw,∆ , and such that Using Claim 3.1 (Overview). As an astute reader might have already observed, this claim allows us to convert any disjunctive query over the original set of keywords into a conjunctive query over the set of meta-keywords. Consequently, given a database DB over a dictionary ∆, suppose we pre-process DB at setup to construct the set of meta-keywords S mkw,∆ , and build an augmented meta-database DB over the meta-dictionary ∆ = ∆ ∪ S mkw,∆ (consisting of both the original keywords and the meta-keywords). We can then use this augmented (plaintext) database together with any conjunctive SSE scheme in a black-box manner to build an SSE scheme that supports both conjunctive and disjunctive queries. The only price we pay is the O(N 2 ) storage overhead; we subsequently show how to reduce this to O(N ).

Meta-Keywords as "Covering" Set
We now formalize and prove Claim 3.1. Before detailing the formal proof, we illustrate why this claim is true via a simple toy-example.
Toy-Example. Consider a database DB with 10 documents (indexed as {id 1 , . . . , id 10 }) over a four-keyword dictionary Now consider the following example disjunctive queries q and the corresponding meta-queries q mkw .
Formal Statement. We now state the following formal version of Claim 3.1.
Observe that for the specific examples stated above, the conjunctive meta-query q mkw exactly follows the generic conjunctive metaquery laid out in the above lemma. Here, ℓ k denotes the index of k'th query keyword in ∆, whereas (i k ,j k ) denote the start and end indices of the absent keywords stretch in each mkw. Following the above mkw formulation, we see that mkw * occurs only if all keywords in ∆ are present in the query q -when the value of n (number of keywords in q) is same as the number of keywords in ∆.
In this case, the index of the last query keyword in ∆ (or ℓ k ) is equal to the number of query keywords n, and only mkw * is selected. This is a rather unusual case that rarely occurs in real applications. Note that, at a high level, to prove this lemma it suffices that to prove that for each keyword w ℓ k in the queried disjunction q, we have DB(w ℓ k ) ⊆ DB(q). In more detail, we would like to prove that for each keyword w ℓ k in the queried disjunction q, we have DB(w ℓ k ) ⊆ DB(mkw i k ,j k ) for each k ∈ [0, n]. The proof follows from the fact that for each k ∈ [n], the following must be true: Increasing Frequency Figure 1: Expressing disjunctive query q in terms of mkw-s with a single stretch of 0s. In this example, ∆ = {w 1 , . . . , w 12 }, and q = w 1 ∨ w 3 ∨ w 6 ∨ w 10 ∨ w 12 (where ℓ 1 = 1, ℓ 2 = 3, ℓ 3 = 6, ℓ 4 = 10, ℓ 5 = 12 and n = 5). Note that, each mkw has ws present at the same places where a w is present in q. The stretches of 0s (absence of ws) ensure that when the mkws are ANDed together (searched in a conjunctive manner), only ws in the original query q remain. (Gray and white cells represent 1 and 0, respectively.) • The index corresponding to the keyword w ℓ k has a 1-entry in every non-empty meta-keyword in the set {mkw i k ,j k } k <k . This is because the "stretch" of 0-entries in each such metakeyword ends before the index ℓ k . • The index corresponding to the keyword w ℓ k has a 1-entry in every non-empty meta-keyword in the set {mkw i k ,j k } k ≥k . This is because the "stretch" of 0-entries in each such metakeyword starts after the index ℓ k . • Finally, the index corresponding to the keyword w ℓ k has, by default, a 1-entry in mkw * -the "all-ones" meta-keyword.
Combining these observations, we get that for each keyword w ℓ k in the queried disjunction q, we have DB(w ℓ k ) ⊆ DB(mkw i k ,j k ) for each k ∈ [0, n], as desired. Figure 1 captures the aforementioned intuition pictorially. We defer the formal proof of Lemma 3.1 to Appendix A due to space constraints.

TWINSSE Basic
We now put everything together in our basic scheme TWINSSE Basic . Let CSSE = (CSSE.Setup, CSSE.Search) be any generic conjunctive SSE scheme. Given CSSE, we construct TWINSSE Basic = TWINSSE Basic .Setup TWINSSE Basic .Search as described subsequently. Our description here is slightly informal due to space constraint, but captures the overall idea of our approach. More details are available with the final construction in Section 4. We also present brief details of processing purely conjunctive or purely disjunctive; we defer the discussion on our treatment of general Boolean formulae to Appendix G. TWINSSE Basic .Setup(1 λ , DB): Given a database DB over a dictionary ∆, construct the set of meta-keywords S mkw,∆ as described above. Let DB denote the meta-database over ∆ = ∆ ∪ S mkw,∆ (consisting of both the original keywords and the meta-keywords). Output (sk, st, EDB) ← CSSE.Setup(1 λ , DB).
TWINSSE Basic .Search(sk, st, q; EDB): Given a query q, proceed as follows: • If q is a purely conjunctive query, output DB(q) = CSSE.Search(sk, st, q; EDB). • If q is a purely disjunctive query, construct the conjunctive meta-query q mkw as described in Lemma 3.1, which allows the client to recover DB(q mkw ) = CSSE.Search(sk, st, q; EDB), and locally filter DB(q) ⊆ DB(q mkw ).
Correctness. Correctness is immediate from Lemma 3.1 (see Section A) and correctness of the CSSE scheme. However, this includes the trivial case of returning entire database upon searching a disjunctive query. To avoid such trivial inclusions, we bound the returned result set size close to the actual result set via a precision parameter. We define precision η by the following ratio.
At a high level, this precision parameter η is a measure of the fraction of spurious ids present in the obtained result set compared to the actual result set. Thus, the correctness can now be defined by the following statement. For a functionally correct and exact 6 conjunctive SSE scheme CSSE, a plaintext database DB, and a disjunctive query q with the corresponding transformed meta-query q mkw , TWINSSE is functionally correct if the following expressions hold. where R q ⊆R q and |R q | ≤ 1 η · |R q | (0.85 < η ≤ 1) given thatR q is returned by TWINSSE Basic (or DB(q mkw )) and R q = DB(q).
Note that, the lower bound of η is obtained empirically from experiments over real databases. We present such experimental details in Section 5. The lower bound can be adjusted to accommodate largerR q for different databases if required.
Storage Overhead. TWINSSE Basic incurs O(N 2 ) storage overhead to store the meta-keywords, where N is the number of keywords in the original plaintext database. This follows immediately from the fact that the number of meta-keywords is O(N 2 ). This is undesirable in practice as it affects the scalability of the construction for large real databases. Currently, the schemes designed for disjunctive queries (such as IEX) require quadratic storage often leading to storage blow-up for large databases. Our final construction in Section 4 reduces O(N 2 ) storage overhead to O(N ) -a necessary and significant reduction to use large real databases for deployment.
Search Overhead. The disjunctive search uses a meta-keyword as the least-frequent term for searching with the CSSE search routine. Since each mkw is an "union" of constituent ws, on average 6 An exact solution returns only the documents belonging to the actual query result. the frequency of the least-frequent mkw is smaller compared to a conjunctive query constituting the same ws. As a result, this basic method potentially can result to worst-case linear search overhead, which would be highly undesirable. However, we avoid such overheads by choosing an underlying CSSE scheme that ensures sub-linear search complexity.

TWINSSE: FINAL VERSION
In this section, we present our final scheme -TWINSSE, which improves upon TWINSSE Basic with respect to storage requirements as well as search overheads. At the core of both these improvements lies an additional technique that we describe next -"frequencybased bucketization" of keywords. We note that similar techniques have been used in the SSE literature [17], albeit almost entirely for frequency padding and leakage-reduction. To the best of our knowledge, we are the first to show that bucketization can also be used to reduce storage and search overheads in SSE schemes.

Keyword Bucketization at Setup
We now describe our strategy for frequency-based keyword bucketization and intra-bucket meta-keyword generation at setup. We then use this updated meta-keyword generation strategy to formally describe the new setup algorithm -TWINSSE.Setup.
Bucketization. Let ∆ = {w 1 , . . . , w N } be a dictionary of keywords, and assume that these keywords are arranged in increasing order of frequency. Also, let n ′ = O(1) be any arbitrarily chosen constant. We partition the keyword space into n B = N /n ′ "buckets" of size n ′ each (n B = ⌈ N n ⌉, if N is not a multiple of n), where the k th bucket is defined formally as the keyword subset Note that since all keywords are arranged in increasing order of frequency, each bucket from ∆ 1 through ∆ n B progressively consists of keywords with increasing frequency ranges. We note that this is similar to the bucketization strategy employed in [17].
Intra-Bucket Meta-Keyword Generation. Having partitioned the keyword space into frequency-based buckets, we now proceed as follows: • For each bucket ∆ k , we generate an intra-bucket meta-keyword set S mkw,k of size O(|∆ k | 2 ) = O (n ′ ) 2 . This is done exactly as in TWINSSE Basic , i.e., following the meta-keyword generation strategy in Lemma 3.1. • We then define the overall set of meta-keywords as the collection of intra-bucket meta-keywords from all buckets, i.e., Observe that However, n ′ = O(1) is a constant, and hence, unlike in the basic solution described in Section 3, now |S mkw,∆ | = O(N ). In other words, we now have a linear-sized meta-keyword set, which forms the key stepping stone towards avoiding a quadratic storage overhead. We design our proposed TWINSSE to work for any choice return EDB, sk, st 5: Server receives EDB 6: Client keeps (sk, st, n B ) of n ′ (ideally, n ′ should be a small constant to avoid high storage overheads); we use n ′ = 10 for our prototype implementation and experimentation over real-world databases in Section 5. We present brief discussion and empirical evaluations on the Enron dataset in Section 5 to select a suitable value for n ′ . TWINSSE.Setup: We now put these ideas together to formally describe TWINSSE.Setup in Algorithm 1, which in turn again uses any generic conjunctive SSE scheme in a black-box way. The key changes from the basic scheme in Section 3 are highlighted in red for ease of exposition (in fact, TWINSSE Basic Setup can be viewed as a special case of TWINSSE Setup where all keywords are placed in the same bucket, i.e., n B = 1 and n ′ = N ).
Note that Algorithm 1 uses as a sub-routine Algorithm 2, which formally describes the meta-database generation based on the keyword bucketization and intra-bucket meta-keyword generation procedures described earlier. Overall, the working of Algorithm 1 can be divided into two steps: (a) generate the meta-database with the intra-bucket meta-keywords using Algorithm 2, and (b) generate the client state and the encrypted meta-database using CSSE.Setup in a black-box way (note that this second step is the same as in TWINSSE Basic ; the only alteration is in the generation of the metadatabase, which now uses linearly many meta-keywords).

Updated Query Planning
We now describe the updated query planning strategy that takes into account the above mentioned meta-keyword generation process. We use this updated query planning strategy to build TWINSSE.Search routine (the query planning for conjunctive queries remains same as in TWINSSE Basic ).
At a high level, we partition a disjunctive query into "regions", where each region consists of the keywords in the query that belong to the same bucket. Formally, given a query q In other words, Q k consists of all the keywords in the disjunction q that belong to the k th bucket. It is easy to see that we can re-write q as a disjunction over sub-queries as follows: Increasing Frequency i,j represents the bucket index the mkw i,j belongs to. Note that, ANDing mkws within a bucket retains only the ws in the q (covered by that bucket). Note that expressing each meta-keyword as a bit-string allows efficient an transformation from the original disjunctive query into the corresponding conjunction of meta-keywords through simple bit-wise set/clear operations. This transformation incurs only negligible additional computational overhead in practice.
Note that for some k, Q k could be an empty set; in this case, the sub-query q k is also empty. Based on the above representation, the query planning strategy in TWINSSE works as follows: • Step-1: Partition q into sub-queries {q k } k ∈[n B ] as described above. • Step-2: For each sub-query q k , construct a conjunctive (sub-) meta-query q mkw,k as described in Section 3, using the intrabucket meta-keywords corresponding to the k th bucket, i.e., the intra-bucket meta-keywords in S mkw,k . • Step-3: Finally, we define the overall meta-query q mkw as q mkw = k ∈[n B ] q mkw,k , and re-construct DB(q mkw ) as where the recovery of each DB(q mkw,k ) happens via an independent (and parallel) execution of the same search protocol as in TWINSSE Basic .
The above query planning strategy is summarized pictorially in Figure 2 (note that the superscript k in mkw (k ) i,j represents the bucket index for the meta-keyword mkw i,j ). In comparison with

Algorithm 2 GenMetaDB
Input: DB, n ′ , n B Output: DB 1: function GenMetaDB(DB, n ′ , n B ) 2: Extract ∆ from DB and sort ws in ∆ in increasing order of frequency 3: The last bin may not contain n ′ ws. Keep only as many are left.

4:
For each non-empty q k (in uniformly random order), the client and server engage in the search protocol as below. for each non-empty q mkw,k (in random order) do 7: DB(q mkw,k ) ← CSSE.Search(sk, st, q mkw,k ; EDB).

8:
At the end of the protocol, client receives DB(q mkw,k ).

9:
Client 10: Initialize DB(q mkw ) ← EMPTY-SET. 11: for each DB(q mkw,k ) from search protocol do 12: DB(q mkw ) ← DB(q mkw ) ∪ DB(q mkw,k ). 13: The client locally filters DB(q) ⊆ DB(q mkw ). the example figure ( Figure 1) in Section 3, we note that the metakeywords are now chosen from a smaller set of size ≈ (4 × 12) = 48, as compared to a set of size ≈ 12 2 = 144 in Figure 1. TWINSSE.Search: We now put these ideas together to formally describe TWINSSE.Search in Algorithm 3, which in turn uses Algorithm 4 as a sub-routine (we only summarize the processing of disjunctive queries since conjunctive queries are processed as in TWINSSE Basic ). The key changes from the basic scheme in Section 3 are highlighted in red for ease of exposition (again, TWINSSE Basic .Search can be viewed as a special case of TWINSSE.Search where all keywords are placed in the same bucket, i.e., n B = 1 and n ′ = N ).

Algorithm 4 GenMQuery
Sort ws in q in increasing order of frequency 4: Partition query q into set of sub-queries P q as q 1 || · · · ||q n B , such that q k contains ws only from ∆ k for k = 1, · · · , n B 5: for q k ∈ P q do 6: return q mkw Algorithm 4 formally captures the updated disjunctive query planning strategy based on query partitioning and intra-bucket meta-keywords, as described earlier. Note that in Algorithm 3, each conjunctive sub-meta-query q k is executed in parallel using the search algorithm CSSE.Search of the underlying conjunctive SSE scheme CSSE, and the final result-set corresponding to the overall meta-query is constructed locally at the client by taking the union over the result-sets corresponding to each conjunctive sub-metaquery.
Correctness. We state the following theorem for the correctness of TWINSSE. The proof essentially follows from the same arguments as the proof of correctness for TWINSSE Basic in Section 3 and is presented in Appendix B.

Instantiation from the OXT Protocol and Complexity Analysis
In its most general form, our proposed TWINSSE scheme can be concretely instantiated using any conjunctive SSE scheme. In this section, we analyze a concrete instance of TWINSSE based on the OXT protocol [7], which we call TWINSSE OXT . We analyze TWINSSE OXT asymptotically in terms of storage requirements and search overheads. Our analysis does not require understanding the internal details of OXT beyond what is already stated in this section; the reader may refer to [7] for more details. Finally, we refer the reader to Section 5 for experimental validation of the analysis presented here over the Enron email corpus.
Storage Requirements (Server). The (worst-case) server-side storage requirement for TWINSSE OXT is O(n ′ |DB|), where |DB| is the number of distinct identifier-keyword pairs in DB, and n ′ = O(1) denotes the (constant) size of each keyword bucket used. This linearization process through keyword bucketization process incurs an O(n ′ )-fold increase in storage overhead over OXT (where n ′ = O(1) is a constant). We view this as a necessary trade-off for the additional ability to support disjunctive queries efficiently yet securely. In comparison, the IEX family of schemes incur (worst-case) quadratic storage overheads, more precisely, O(|∆||DB|), where |∆| denotes the number of keywords in DB. However, n ′ (or the number of buckets) needs to be chosen carefully to bound the storage overhead to linear (which also keeps the leakage from multiple buckets at minimum). A high value of n ′ would incur a higher storage overhead with lesser leakage from small number of buckets (as outlined in Section 4.1). Whereas a small value of n ′ would result in a higher number of buckets leading to lesser storage but increased leakage from more number of buckets. We selected n ′ in the range 10-15 based on empirical evaluations over real data sets that allows to retain a linear storage overhead. These experimental results are provided in Section 5.
Storage Requirements (Client). The client-side storage requirement for TWINSSE OXT is O(n ′ |∆| log |DB|). This again represents an O(n ′ )-fold increase in storage overhead over the original OXT scheme (where n ′ = O(1) is a constant), which has a client-side storage requirement of O(|∆| log |DB|).
We also note here that TWINSSE OXT requires O(1) storage for the secret key(s) at the client-end (this is a purely client-side overhead, not associated with the server-side storage). In contrast, the IEX family of schemes require O(|∆|) secure storage for secret keys (one key per keyword due to individual multi-map structure required for each keyword index), which is likely to be costly for extremely large databases. We emphasise that this requirement is only for secure storage to store secret keys on the client-side. The client-side storage overhead mentioned in the previous paragraph accounts for storing only auxiliary information required during query processing, and this does not require secure storage.
Search Complexity. We now present an asymptotic analysis of the search complexity (more concretely, the computational and communication requirements during search query processing) of TWINSSE OXT . We divide our analysis into two parts -conjunctive queries and disjunctive queries: Conjunctive queries. Let q = (w 1 ∧ . . . ∧ w n ) be a conjunctive query, where w 1 is the least frequent keyword. When processing q using TWINSSE OXT , the computational costs (at both the client and the server) as well as the communication requirements between the client and the server scale linearly as O(n|DB(w 1 )|). This is exactly the same as in OXT, and is hence worst case sub-linear.
Disjunctive queries. Let q = (w 1 ∨ . . . ∨ w n ) be a disjunctive query. Also, let q mkw = ∨ k ∈[n B ] q mkw,k be the corresponding meta-query, and assume without loss of generality that mkw (k ) i k ,j k is the least frequent meta-keyword within q mkw,k for each k ∈ [n B ] (such that q mkw,k is non-empty). When processing q using TWINSSE OXT , the computational costs (at both the client and the server) as well as the communication requirements between the client and the server scale linearly as O(γ ), where where |q k | denotes the number of meta-keywords in the conjunctive sub-meta-query q k (|q k | = 0 when q k is empty). Note that this is essentially a generalization of the analysis of search query overheads for TWINSSE Basic in Section 3, where all keywords belong to the same bucket (i.e., n B = 1). We provide a comparative summary of storage and search overhead for TWINSSE and IEX in Table 1 for quick reference.
Spurious document identifiers. It turns out that keyword bucketization also significantly reduces the search overheads (both computational and communication) due to spurious document identifiers in DB(q mkw ). In particular, recall our observation with respect to TWINSSE Basic from Section 3: the fraction of spurious identifiers retrieved is directly proportional to the average number of common documents over every keyword-pair in the database. However, in our improved solution, the database is partitioned into buckets, and all keywords within the same bucket have essentially similar frequency ranges. This means, in particular, that an overwhelmingly large fraction of buckets either contain all low-frequency keywords (in which case, the spurious document-set is essentially null, since such keywords almost never co-occur across documents [7]), or very high-frequency keywords (in which case, such keywords occur in almost all documents, and the proportion of spurious documents is low by default). We generalise the aforementioned observations into the following (informal) claim about the search complexity incurred by TWINSSE OXT , which is essentially an extension of our claim for TWINSSE Basic . We validate this claim with experimental results over the Enron email corpus in Section 5. We also extend the analysis and experimental evaluations to more general/complicated Boolean queries (CNF or DNF formulae) in Appendix G. Our experiments show that searches in TWINSSE OXT incur at most 15% overhead due to spurious identifiers in the result set. In order to filter out the set of spurious documents from the final result set, we can resort to the same strategies as used by state-of-the-art volume hiding SSE constructions (e.g. SSE schemes obtained naturally from the encrypted multi-map constructions proposed in [20,28], where the client obtains a mixture of real and "fake" identifiers at the end of the query phase to hide the true query response volume from the server).
Note that, TWINSSE uses separate mkw * for each bucket where mkw * i denotes the mkw * for the i-th bucket. Following the definition of mkw * , mkw * i represents the disjunction of all ws in the i-th bucket. Using a separate mkw * per bucket allows TWINSSE to support non-SNF queries more efficiently than OXT. OXT uses a single mkw * for any non-SNF query into an SNF query. In this process, OXT incurs a worst-case linear search overhead. In contrast, TWINSSE uses mkw * i if and only if a query involves all keywords from the i-th bucket (in this case, it is the only optimal choice). These specific queries can be considered as corner cases which rarely occur in realistic searches. Consequently, TWINSSE incurs spurious ids, which is typically only 15 − 20% on average for realistic non-SNF queries. Additionally, this process in TWINSSE OXT also allows for parallel execution of independent sub-queries over different buckets. Adopting a similar approach with OXT for parallel execution would incur more leakage from each queried bucket due to the exact result set which reveals the volume pattern. Since TWINSSE OXT produces noisy result set due to the spurious ids, the volume pattern leakage is less compared to OXT.

Security of TWINSSE
We present an informal discussion on the security of our construction here. Detailed formal security discussion and leakage analysis (including experimental evaluations) are available in Appendix C, D, E and F. Security of TWINSSE is modelled in the semi-honest adversarial setting where the server is assumed to be a honestbut-curious entity (that means, the server follows the algorithmic specifications exactly, but can record information for later analysis).
Informally, TWINSSE inherits security properties and leakage profile from the underlying CSSE construction. We assume that the underlying CSSE construction is an adaptively secure sublinear conjunctive SSE algorithm which is secure against a semi-honest adversary A and the leakage of CSSE is characterised by the leakage function L CSSE . The leakage function L CSSE is an ensemble of the leakage functions for Setup and Search individually, expressed in the following way.
CSSE , L Search CSSE } Given the above CSSE leakage functions, security of TWINSSE can be analysed using TWINSSE leakage function L TWINSSE in the same adaptive semi-honest adversarial model. Similar to L CSSE , L TWINSSE is composed of two separate leakage functions for Setup and Search, as expressed below, that capture the leakage from TWINSSE execution in the meta-keyword setting.
the number of buckets) as an additional benign component. In other words, we show that L TWINSSE is equal toL CSSE whereL CSSE is L CSSE in the context of meta-keywords and n B . At a high level, L Setup TWINSSE incorporates DB instead of DB generated by the GenMetaDB during setup. Similarly, the search leakage encapsulates leakages from both conjunctive and disjunctive queries. We quantify these separately through two individual leakage function instances -one for conjunctive queries, and one for disjunctive queries from metakeywords, where meta-keywords are generated using GenMQuery routine. We show that the leakage for the conjunctive case is exactly the same as of the CSSE construction, and for disjunctive queries it incorporates the similar leakage profile, but from meta-keywords. We provide detailed formal analysis of L TWINSSE in Appendix D.
Security of TWINSSE OXT . The security analysis of TWINSSE OXT follows from the security notions of generic TWINSSE, as informally discussed above (formally in Appendix D and Appendix E). Due to lack of space, we move this discussion (including proofs) to the Appendix D. We also present a leakage-based cryptanalysis of the TWINSSE OXT scheme via experiments over the Enron email corpus in Appendix F.
Comparison with IEX. At a high level, TWINSSE OXT avoids two kinds of leakages that the IEX family of schemes incurs for any query. To begin with, IEX leaks to the server the exact size of the result set pertaining to a query (also referred to as the size pattern leakage). As already mentioned, due to the presence of spurious document identifiers in the result set, TWINSSE OXT inherently hides the size pattern from the server. More crucially, IEX incurs significant sub-query leakage. For example, given a disjunctive query of the form q = (w 1 ∨ w 2 ), where w 1 is the more frequent keyword, it leaks to the server: (a) the frequency of the more frequent keyword, i.e., |DB(w 1 )|, and (b) the number of documents that contain w 2 but not w 1 , i.e., |DB(w 2 ) \ DB(w 1 )|. Whereas, TWINSSE OXT only leaks the frequency of the least frequent meta-keyword (in this example, the meta-keyword corresponding to w 1 ), and no information about the other meta-keywords in the conjunction (in this example, no information about w 2 ). In other words, TWINSSE OXT incurs less leakage as compared to the IEX family of schemes during search queries.

EXPERIMENTAL RESULTS
We describe a prototype implementation of TWINSSE OXT and evaluate its performance over real-world databases. We present experimental results comparing the storage requirements and search performance of TWINSSE OXT with that of IEX-2Lev [19].
Data Set and Platform. We used the Enron email data set 89 for our experiments. The Enron email data set contained 517,401 documents (emails) and 20 million keyword-document pairs, with a total size 1.9 GB. The complete TWINSSE OXT implementation was done using C++ (with GCC9 compiler) with native multi-threading support, and we used Redis as the database backend. We ran the experiments on a single node with Intel Xeon E5-2690 v4 2.6 GHz CPU with 128 GB RAM and 512 GB SSD storage. Implementation Details. We created the meta-keyword database (or the transformed database) MDB from the parsed Enron database DB. The plain Enron database DB contains w-s and id-s in inverted index form. The transformed database MDB also contains the mkws and the associated id-s in inverted index layout. Since there are a large number of w-s in DB, length of each binary string mkw is large. Hence, we hash those strings prior to writing to MDB. This MDB is further encrypted using the underlying OXT setup to generate the encrypted meta-keyword database EDB.
We report the actual size of EDB in Figure 3 which is offloaded to the server. The query translation process first generates these   Figure 4: Comparison of end-to-end search latency vs frequency of the variable term (|v|). Observe that, for conjunctive queries TWINSSE OXT closely follows OXT in practice, and the latency is significantly less than IEX-2Lev. We note here that a fundamental difference between IEX-2Lev and TWINSSE OXT is that the search frequency of IEX-2Lev scales (by design) with the frequency of the most frequent conjunct, while that of TWINSSE OXT scales with the frequency of the least frequent conjunct; this is the main reason why TWINSSE OXT outperforms IEX-2Lev by a significant margin for conjunctive queries.
mkws in binary string format and we hash those prior to search over the encrypted meta-keyword database EDB.
Evaluation of Storage Overhead. One of the fundamental aspects of our implementation is that TWINSSE OXT improves upon the quadratic storage overhead of IEX-2Lev and scales linearly with the size of plaintext database. IEX-2Lev exploits the low size of mutual intersections for all pairs of keywords in DB and its storage overhead scales with the size of the intersection. In a sparse data set, the size of these intersections for most of the pairs of keywords is very low. However, if the database is not sparse, this results in large intersections for pairs of w-s and the overhead becomes truly quadratic for IEX-2Lev. for TWINSSE OXT whereas IEX-2Lev becomes quadratic leading to storage blow-up. The storage overhead of IEX-2Lev is 60× more than TWINSSE OXT . Despite the additional storage required for the meta-keywords, TWINSSE OXT has better storage overhead in worst-case distribution of DB as compared to IEX-2Lev.
Effect of linearization. As discussed in Section 4.3, the choice of n ′ greatly influences the storage overhead. Since the distribution of ws and ids varies across different databases (for example, a medical database's distribution differs from a tax record database), it is quite challenging (and inefficient) to obtain an analytic expression for n ′ that works for multiple databases. We rely on an empirically chosen value of n ′ that suitably works for different databases without blowing up the storage. We present experimental results in Figure 5 to illustrate the effect of varying n ′ . We fix n ′ at 10 for our final experiments from this evaluation.
Evaluation of End-to-End Search Latency. Figure 4 and 6 compare the end-to-end search latency of TWINSSE OXT with that of IEX-2Lev for conjunctive and disjunctive queries, respectively.
Conjunctive queries. Search performance of TWINSSE OXT is inherited from OXT and is therefore identical to OXT as shown in Figure 4. To validate this, we consider a two-keyword query of the form q = a ∧ v, where a and v are two keywords from DB. Without loss of generality, we consider the first term of q (or a here) to be the least frequent keyword. We vary the frequency of v (referred to as the variable term) with different queries where as the frequency of a is kept constant (constant term). The plot shows constant time overhead for conjunctive queries of this form with TWINSSE OXT , which is identical with OXT. In the same figure, IEX-2Lev conjunctive search time is also plotted which depicts that TWINSSE OXT is around 10× faster on average.
Disjunctive queries. The plots in Figure 6 compare the end-to-end query time with final result size for disjunctive queries of different hamming weights. Observe that, the query time increases with increasing number of id-s inR q (the obtained result set, inclusive of the spurious id-s for a disjunctive query) due to the increased frequency of least-frequent mkws in these queries.
As discussed in the main text, frequency of the least-frequent mkw is independent of the frequency of the least-frequent w in a query. Hence, we consider plotting with overall result size that represents the computation overhead. In disjunctive queries, union of   the id-s grows with more number of keywords present in the query. Therefore, plotting with the result size provides an accurate measure of computation cost for disjunctive queries. Nonetheless, the OXT sublinear search complexity is maintained, which we verified in our experiments.
The average query time increases with the number of ws in actual disjunctive query q. This increased time attributes to more number of mkws for each query, and the underlying OXT that scales linearly with number of keywords (in this case mkws) in the conjunctive query. The end-to-end disjunctive search latency for TWINSSE OXT is few hundred milliseconds over the Enron database for queries with moderate result size.
We provide an end-to-end query performance comparison of TWINSSE OXT with IEX-2Lev in Figure 7. For queries with smaller result sizes, TWINSSE OXT achieves almost identical end-to-end query latency as IEX-2Lev. For queries with larger result sizes, IEX-2Lev performs slightly better. This is primarily because of the usage of relatively costly elliptic-curve cryptography-based operations in TWINSSE OXT (a consequence of using OXT as a black-box, which uses such operations); IEX-2Lev, on the other hand, uses purely symmetric-key crypto-primitives. We view this as an efficiency trade-off; note that TWINSSE OXT outperforms IEX-2Lev significantly both in terms of storage requirements (as demonstrated in Figure 3) and end-to-end latency for conjunctive queries (as demonstrated in Figure 4 ). Hence, from the point of view of practical performance across a wide class of Boolean queries generally encountered in practice and scalability to extremely large databases, TWINSSE OXT outperforms IEX-2Lev.
Experiments on the Wiki database. We also present experimental results for performance and storage overhead evaluation over the Wikimedia dataset 10 in Appendix H. We observed similar results with the Wikimedia dataset as of the Enron dataset presented above.
Evaluation of Result Precision. In context of information retrieval precision (denoted by η in Section 3.3) is the fraction of relevant documents among the retrieved documents. We compare the average precision values ofR q for disjunctive queries (q) with different number of keywords in Figure 8. Observe that the average precision values for most of the cases is above 85%, which implies that at least 85% documents returned byR q is relevant to the disjunctive query q (or belongs to the actual result set R q without spurious id-s). The plot also illustrates that scaling the database does not affect the average precision of the retrieved documents. Hence, the query result of TWINSSE OXT does not degrade even for huge databases which is crucial for practical applications.
Note that, IEX is an exact solution that has 100% result precision -it returns the exact result set without spurious ids 11 . However, IEX incurs extremely high storage overhead that makes it impossible to deploy with large real datasets. In contrast, TWINSSE incurs less than 100% result precision (85%-90%, as shown in Figure 8), but TWINSSE outweighs the loss in precision with storage savings (10×-50× less than IEX, as shown in Figure 3).
Complex Boolean Queries. We defer the elaborate discussion on processing complex Boolean queries (as CNF and DNF formulae) using TWINSSE OXT to Appendix G due to lack of space. We present here the corresponding experiments to evaluate the performance of our prototype implementation of TWINSSE OXT when processing such queries over the Enron dataset (on the same computing platform as discussed in Section 5).
DNF queries. We considered multiple queries with two clauses and three clauses with each clause having two keywords. The endto-end query time is plotted in Figure 15, where the blue curve Size of final result set (|R q |) End-to-end query time (s) Figure 9: TWINSSE OXT performance with result set size on Enron data set for DNF queries.
represents the query time for two clause queries and the red curve represents the query time for three clause queries. Observe that the query time for both two and three clause queries increase with more number of id-s in the final result set. This increment can be attributed to large result size of the individual conjunctive clauses. Also note that the query time increases for three-clause queries due to more conjunctive clauses and follows the same trend of increased query time with the final result size. CNF queries. For experimenting with CNF queries, we considered two-clause queries with two and three keywords per clause. Since the Enron data set is relatively sparse in nature, it often results in small or empty intersection with higher number of clauses in query. We plotted end-to-end query time in Figure 16 for both cases -two keyword clauses and three keyword clauses with the size of the final result set. The blue curve represents the end-to-end query time for the queries with two keywords per clause. Similarly, the red curve represents the end-to-end query time for queries with three-keyword clauses. Observe that, in CNF queries also, the end-to-end query time increases with the final result size, due to the increased size of the initial result set retrieved. For the threekeyword clauses, the query time is higher than the two-keyword clauses due to the larger size of the initial result set retrieved for the disjunctive clauses. Normally, CNF and DNF expressions in generic form involve the negation of terms. TWINSSE avoids negations for plain disjunctive and DNF queries. This is purposefully done to avoid quadratic storage due to negated keywords, as discussed in Section 1.2. However, TWINSSE OXT supports negations in CNF by utilising the XSet data structure as a part of the underlying OXT construction. In particular, TWINSSE OXT can process keyword negation for any clause in a CNF query, except for the least-frequent one (which is used to retrieve encrypted ids), by evaluating the formula directly over the XSet structure. Since the retrieval is done using the least frequent clause, TWINSSE OXT avoids the inherent linear search overhead incurred by this kind of queries in OXT.
The authors of OXT [7] described a way to process CNF/DNF queries using a special keyword (similar to mkw * in TWINSSE) as the s-term (special term or the keyword with minimum frequency in the query). This process essentially retrieves all entries from the database due to this special keyword and filters according to the query. Naturally, this approach has linear complexity that leads End-to-end query time (s) Figure 10: TWINSSE OXT performance with result set size on Enron data set for CNF queries.
to poor performance and high leakage (a very high number of ids are retrieved, which do not belong to the final result). In contrast, TWINSSE OXT still offers a sublinear search for complex queries. The all-1 mkw * is defined per-bucket in TWINSSE OXT scheme, which avoids such costly retrieval as only specific buckets are accessed and, consequently, leaks much less compared to OXT.
Storage Overhead for Dense Databases. Finally, we compare the server-side storage overheads of TWINSSE OXT and IEX-2Lev for a special class of "dense databases" where for any pair of keywords, the number of documents containing both keywords is large (see Appendix I for an example of dense database and elaborate discussion). We use a synthetic dense database that follows Zipf's law [24]. Our experimental results in Appendix I ( Figure 20) show that IEX-2Lev incurs 70× higher storage overhead than TWINSSE OXT for this dense database, clearly indicating that TWINSSE OXT offers significantly greater scalability for such databases in practice.

SUPPORTING DYNAMIC DATABASES
In this paper, we described TWINSSE OXT for static databases. This leaves open the question of extending TWINSSE OXT to dynamic databases, and supporting updates efficiently yet securely over these. We note here that for dynamic databases where the set of keywords across all documents remains fixed (or, more generally, undergoes updates infrequently), the set of meta-keywords also does not change (frequently) over time. In this setting, it is possible achieve an extension of TWINSSE OXT to the setting of dynamic databases by simply substituting the underlying OXT scheme with a dynamic conjunctive SSE scheme with desirable efficiency and security guarantees (e.g. ODXT from [29]). However, such an extension becomes challenging for dynamic databases where the set of keywords (and hence, the set of meta-keywords) also gets updated frequently. We leave it as an interesting future question to extend TWINSSE OXT to dynamic databases.

Research in Sciences), India and Centre on Hardware Security Entrepreneurship Research & Development, Meity India for partially supporting the work.
A PROOF OF LEMMA 3.1 Proof of Lemma 3.1. We show in the formal proof of Lemma 3.1 that each mkw constructed following the description of Lemma 3.1 covers each w in q (DB(q) part). Other ws that are not in q are filtered out (due to the intersection in Lemma 3.1).
We start with the following conjunctive meta-keyword expression of q mkw for a particular query q as given in the Lemma 3.1.
By the definition of meta-keywords (Definition 3.1), the following relation holds.

DB(w l )
We rewrite Equation (1) in the following way.
Observe that, the union inside the right hand expression in the above expression keeps all ws except a stretch of ws (from index i k to j k ) for each value of k, inside the outer intersection of n + 1 terms. Since the intersection of these unions reduces to a small but finite set of id-s, the following relation holds, The proof of correctness for TWINSSE Basic (and TWINSSE as well) follows from the correctness of CSSE. The correctness of CSSE ensures that a conjunctive query q = w 1 ∧· · ·∧w n over an encrypted database satisfies the following relations.

EDB = CSSE.Setup(DB))
DB(w 1 ) ∩ · · · ∩ DB(w n ) = CSSE.Search(q, EDB)) We state the proof for TWINSSE Basic first. Then we show that this can be simply extended to main TWINSSE scheme (the final bucketized version).
Proof for TWINSSE Basic . Proof of the TWINSSE Basic directly follows from the proof of Lemma 3.1. Consider a disjunctive query q as stated below. q = w 1 ∨ · · · ∨ w n The equivalent conjunctive expression of meta-keywords can be expressed as below.
We write the following relation from Lemma 3.1.

DB(w l )
It easy to notice from the above equation that DB(q) ⊆ DB(q mkw ). Hence, all ids of the actual result set of disjunctive query q is included in the result set obtained from the query using TWINSSE Basic .Search, which proves the correctness of the TWINSSE Basic .
Proof of TWINSSE. Recall from Section 4, that in TWINSSE all ws from ∆ are partitioned into n B buckets of uniform size, and we execute the basic meta-keyword generation method developed in TWINSSE Basic over each partition independently. Only those partitions with query meta-keywords are accessed during search.
Assume that the dictionary of ws -∆ is partitioned in the following way, where n B is the number of buckets and each bucket ∆ u can be expressed in the following way.
The number of ws in each bucket is denoted by n ′ . The set of mkws in each bucket ∆ k are represented by the S mkw,k . The TWINSSE Basic is executed over each of these bucket individually to generate the encrypted database.
The query expression follows from the TWINSSE construction with the above structure (discussed in Section 4).
DB(q mkw,u ) and the actual query q can be partitioned in the following way.
We expand the above expression to individual buckets.

C ADAPTIVE SECURITY OF SSE
The adaptive security of any SSE scheme is parameterized by a leakage function where L Setup encapsulates the leakage to an adversarial server during the setup phase, and L Search encapsulates the leakage to an adversarial server during each execution of the search protocol.
Let τ k denote the view of the adversary after the k th query 8: b ← Adv(λ, EDB Q , τ 1 , . . . , τ Q ) 9: return b Informally, an SSE scheme is adaptively secure with respect to a leakage function L if the adversarial server provably learns no more information about DB other than that encapsulated by L. Formally, Parse the leakage function L as: Let τ k denote the view of the adversary after the k th query an SSE scheme is said to be adaptively secure with respect to a leakage function L if for any stateful PPT adversary Adv that issues a maximum of Q = poly(λ) queries, there exists a stateful probabilistic polynomial-time simulator Sim = (SimSetup, SimSearch) such that the following holds: where the "real" experiment Real SSE and the "ideal" experiment Ideal SSE are as described in Algorithm 5 and Algorithm 6.

D DETAILED ANALYSIS AND DISCUSSION ON THE LEAKAGE OF TWINSSE OXT
In Section 4.4, we informally described the leakage profile for TWINSSE built in a black-box way from any generic conjunctive SSE scheme CSSE. In this section, we formally detail the leakage profile for the specific instantiation of TWINSSE based on the OXT scheme, namely TWINSSE OXT . We then present a discussion on the leakage profiles for both TWINSSE and TWINSSE OXT .

D.1 Security of TWINSSE
We present a formal description of the security guarantees of our generic construction TWINSSE. Concretely, we state the following theorem.
Theorem D.1 (Security of TWINSSE). Assuming that CSSE is an (adaptively) secure SSE scheme with respect to the leakage function L CSSE = (L Setup CSSE , L Search CSSE ), TWINSSE is an (adaptively) secure SSE scheme with respect to the leakage function L TWINSSE = (L Setup TWINSSE , L Search TWINSSE ), where for any plaintext database DB, any search query q, and any pair of bucketization parameters (n ′ , n B ), we have where DB = GenMetaDB(DB, n ′ , n B ), and q mkw,k = GenMQuery(q, n ′ , n B ).
Proof. We defer the formal proof of this theorem to Appendix E.

D.2 Leakage Profile of TWINSSE OXT
In this section, we describe the leakage profile of TWINSSE OXT . We begin by recalling from [7] the leakage profile of the original OXT scheme. We then build upon it to describe the leakage profile of TWINSSE OXT , which is actually very similar in spirit to the leakage profile of OXT.
Setup Leakage. The setup leakage in the OXT scheme consists of the size of the database DB, which is nothing but the total number of keyword-document pairs in the database DB, formally defined as where ∆ = {w 1 , . . . , w N } is the dictionary over which the database DB is defined. Search Leakages. Next, we summarize the leakages incurred by OXT during conjunctive keyword search queries.
Result Pattern Leakage: The server learns the final set of document identifiers matching the query. Formally, for a conjunctive query q, the result pattern leakage RP is defined as Size Pattern Leakage. The server learns the frequency of the s-term (where s-term again refers to the least frequent keyword in the conjunction). Formally, for a conjunctive query q = (w 1 ∧ . . . ∧ w n ), where w 1 is the least frequent keyword in the conjunction, the size pattern SP is defined as Equality Pattern Leakage. The server learns if two (or more) conjunctive queries have the same s-term (where s-term again refers to the least frequent keyword in the conjunction). Formally, for a sequence of conjunctive queries (q 1 , . . . , q M ), where for each i ∈ [M], we have where w i,1 is the least frequent keyword in the conjunction, the equality pattern leakage EP is defined as an M × M matrix where for each i, j ∈ [M], we have Conditional Intersection Pattern Leakage. The server learns if two (or more) conjunctive queries have one or more x-terms in common (where x-term refers any keyword other than the least frequent keyword in the conjunction); more concretely, if two (or more) conjunctive queries have one or more x-terms in common, then the server learns the intersection of the set of documents containing the corresponding s-terms. Formally, for a sequence of conjunctive queries (q 1 , . . . , q M ), where for each i ∈ [M], we have where w i,1 is the least frequent keyword in the conjunction, the conditional intersection pattern leakage IP is defined as an M × M matrix of lists, where for each i, j ∈ [M], we have Security of TWINSSE OXT . We now formalize the security of TWINSSE OXT in terms of the leakage profiles described above. We do this using a formal theorem, which may be viewed as a specialization of Theorem D.1 to a specific instantiation of TWINSSE based on OXT. Once again, this theorem is based on the (adaptive) simulation-security definition of SSE in the real world-ideal world paradigm, described in Appendix C. q mkw,k ,ℓ = GenMQuery(q ℓ,1 , n ′ , n B ).
Proof. We defer the formal proof of this theorem to Appendix E.

D.3 Discussion on the Leakage Profile of TWINSSE OXT
In this subsection, we present a more in-depth analysis of the leakage profile for TWINSSE OXT during conjunctive and disjunctive search queries, and its implications.
Output Leakage. We begin by noting that the output leakage (alternatively, the result pattern leakage) is incurred by nearly all existing SSE schemes, including static and dynamic schemes, in the setting of both single and conjunctive keyword searches (such as in [4,7,8,11,23,31]). This is usually considered acceptable in the SSE literature; indeed the few known data/query recovery attacks that manage to exploit this leakage ( [2,5,18,32]) assume extremely strong adversarial models where the adversary has partial knowledge of the plaintext database/queries. s-Term Leakages. We focus next on the leakages related to the s-term, namely the size and equality pattern leakages. We begin by noting that these leakages are somewhat inherent to the design paradigm of OXT, which we base our instantiation of TWINSSE on. Even in the simpler setting of single keyword search, most existing schemes [3,4,6,8,10,11,31] also incur size and equality pattern leakages; the only constructions not to incur such leakages seem to rely on the use of ORAM-style data structures [4,8]. Fortifying TWINSSE OXT with such data structures in an attempt to prevent this leakage is an interesting open challenge, although this would probably have to trade-off with some degradation in search performance (mostly in terms of communication complexity and number of rounds of communication during searches). It is also possible (and perhaps conceptually simpler) to mask this leakage by using volume-hiding techniques such as padding and encrypted multi-maps (EMMs) [1,11,20,27,28]. This would incur a degradation in search performance, and it is up to the designer to decide on a suitable trade-off between performance and leakage.
However, we would like to point out that there are no known data/query recovery attacks on SSE schemes that specially exploit leakages related to the s-term. So we believe that even without the aforementioned fortifications, it appears that our TWINSSE OXT scheme is not vulnerable to any known attacks due to the leakages related to the s-term, in realistic/practical adversarial settings.
x-Term Leakages. Next, we focus on the x-term leakages. We again note that these leakages are somewhat inherent to the design paradigm of OXT, which we base our instantiation of TWINSSE on.. The only known attack on conjunctive SSE schemes that exploits a form of x-term leakages is the file injection attack proposed by Zhang et al. in [32]. More concretely, the adversarial server must be able to compute |DB(w 1 ) ∩ DB(w i )| when processing the search query.
We note however that for file injection attacks to work efficiently, the adversarial server must recover, for every x-term w i , the result size corresponding to each sub-query of the form w 1 ∩ w i . However, the x-term leakage profile of TWINSSE OXT is not sufficient to compute this term, since the set of xtoken values sent to the server is randomly permuted inside the underlying OXT instantiation precisely to mask such inference-style attacks. Further, one could also instantiate our generic construction of TWINSSE from other conjunctive SSE schemes such as HXT [23] that improve upon OXT in terms of provable security against leakage-based cryptanalysis and file-injection attacks.
Finally, fortifying implementations of TWINSSE OXT by using either ORAM-style data structures or adopting volume-hiding techniques such as padding/EMMs may be useful in masking this leakage even further; however, even without such additional fortifications, it appears that our TWINSSE OXT scheme is not vulnerable to file injection attacks, or any other known attacks for that matter, due to the leakages related to the s-term, in realistic/practical adversarial settings.
Leakage Cryptanalysis. Looking ahead, in Appendix F, we present a leakage-based cryptanalysis of the TWINSSE OXT scheme via experiments over the Enron email corpus. Our experiments help substantiate that the leakages incurred by the disjunctive search protocol in TWINSSE OXT are reasonably benign in practice and are quite resistant to even the most powerful leakage-based cryptanalysis techniques in the SSE literature over real-world databases, such as those in [5,32]. We leave it as an open question to extend the analysis using the more advanced leakage cryptanalysis techniques, such as those proposed in [2,25].

E SECURITY PROOFS FOR TWINSSE AND TWINSSE OXT
In this section, we formally prove the security of TWINSSE and TWINSSE OXT with respect to the generic and specific leakage profiles described in Theorems D.1 and D.2, respectively.

E.1 Proof of Theorem D.1 (Security Analysis of TWINSSE)
We provide a simulation-based proof approach for TWINSSE. We assumed that the underlying adaptively secure CSSE has the following leakage profile.
We express the leakage of TWINSSE as, We show that TWINSSE is secure against an adaptive semihonest adversary A, which has access to leakages from TWINSSE. We build a simulator SIM EDB generation by TWINSSE.Setup, and transcripts for queries over EDB. The simulator simulates the transcripts τ i for each query q i . The simulator has the inputs from the leakage function L TWINSSE only, with the setup leakage L Setup TWINSSE and the search leakage L Search TWINSSE . Simulating TWINSSE.Setup: The following public parameters are available to SI M CSSE as a part of L Setup TWINSSE . {DB, n ′ , n B } The simulator outputs the its version of EDB according to the simulation process of CSSE (we assumed that CSSE is provably simulation secure).
Since, CSSE is proven simulation secure, if follows from the simulation security guarantee of CSSE that ct EDB is indistinguishable from the one generated in the real experiment.
Simulating TWINSSE.Search: For conjunctive queries the adversary does not have any advantage from L Search TWINSSE compared to L Search CSSE , which exactly same as CSSE. For disjunctive queries we consider the effect of querying using q mkw .
For disjunctive queries, we argue that the adversary A does not gain any information about the original disjunctive query with this simulation experiment. The distribution of DB (hence, also for EDB) is abstracted from DB by the meta-keywords. The search leakages of CSSE is characterised by the L CSSE , provided from CSSE construction. Since, CSSE in TWINSSE executes over metakeyword only, this leakage is expressed in the context of metakeywords as below.
L ′ CSSE = L CSSE (meta − keywords) With this leakage information of CSSE, the search leakage of TWINSSE can be expressed as below.
, n ′ } The parameters n B and n ′ are derived from N (number of keywords), which is available during setup. Therefore, the search leakage of TWINSSE same as the underlying CSSE, which can be summarised as below.
This same leakage profile for search in TWINSSE and CSSE in the context of meta-keywords ensures that no additional information is leaked beyond CSSE leakage.

E.2 Proof of Theorem D.2 (Security Analysis of TWINSSE OXT )
We resort to a simulation-based security analysis for TWINSSE OXT . We assume a semi-honest adversary A which has access to the leakage from standard SSE leakages in an adaptive model. Security analysis of TWINSSE relies upon the semantic security notions provided by CSSE. TWINSSE inherits these notions through the core OXT (in case of TWINSSE OXT , the OXT) instance. We assume the following properties of OXT achieves with efficient performance.
(1) Primitives used in construction of OXT hold the standard security assumptions. (2) OXT is non-adaptively and adaptively secure with the above assumptions. We consider the following leakage profile for OXT.
OXT captures the leakage from the OXT.Setup, and L Search OXT encapsulates the leakage from OXT.Search. More precisely, these can be expressed as, We define the leakage profile of TWINSSE OXT with respect to these above definitions and assumptions as below.
Here, { RP, SP, EP, IP} are the {RP, SP, EP, IP} leakages in the context of meta-keywords. For conjunctive queries, it is exactly the same as OXT.
Since, OXT is simulation secure against these leakages, simulation security of TWINSSE OXT for conjunctive queries is straightforwardly implied from OXT.
In disjunctive queries, the query transformation process is carried out locally by the client, and the actual search is completed using OXT.Search protocol, we can write TWINSSE OXT .Search leakage as We build a simulator SIM to simulate the EDB generation by TWINSSE OXT from DB, and transcripts for query search over EDB. The simulator simulates the transcripts τ i for each query q i ∈ Q. The simulator has the inputs from the leakage function L TWINSSE OXT only, with the setup leakage L Setup

{|EDB|, | ∆|}
The simulator outputs the its version of EDB according to the simulation process of OXT (we assumed that OXT is provably simulation secure).
ct EDB = SIM OXT .Setup(|MDB|, | ∆|) It follows from the simulation security guarantee of OXT that ct EDB is indistinguishable from the one generated in the real experiment.
Simulating Search: For the conjunctive queries, the leakage L Search TWINSSE OXT is exactly the same as L Search OXT . Hence, we can write the following.
By the simulation security guarantee of OXT, TWINSSE OXT secure against these leakages.
For disjunctive queries, we argue that the adversary A does not gain any information about the original disjunctive query except |q|. The distribution of MDB (encrypted to EDB) is abstracted from DB through the meta-keywords. We resort to a more conservative analysis for this proof, as keywords do not have direct inference from meta-keywords, especially that is applicable over any database in general. The position of each w in an mkw is fixed according to the frequency of w, which is unique for a DB. The lemmas below relate worst cases where an inference can be established between the query keywords and the corresponding meta-keywords without any additional knowledge of the plain database.
Lemma E.1, Lemma E.2, and Lemma E.3 relates the disjunctive q with w i ∈ ∆ to the conjunctive q with mkw i ∈ ∆.
Proof. The proof of Lemma E.1 is given in Section E.3.1. □ Lemma E.2. Consider two disjunctive queries q 0 and q 1 , of the same length t have the mkw expressions as defined in Lemma E.1both of length t + 1. If the mkws at indices k 0 in q 0 , and k 1 in q 1 are the same, then x k 0 −1,q 0 = w k 1 −1,q 1 and w k 0 ,q 0 = w k 1 ,q 1 .
Proof. The proof of Lemma E.3 is given in Section E.3.3. □ Recall that, the query transformation is executed by the client locally. The search is executed as a two-party protocol between the client and the server using the meta-keywords. The server learns |q| trivially from q mkw through of meta-keywords. From Lemma E.1, E.2, and E.3, an adversary can infer the position of the same ws in two queries of same length or different lengths if both queries have a common mkw in them.
However, the server can only infer if the least-frequent mkws in q mkw are identical or not in mkw expressions of two qs from SP. The mkw expressions in each of the three lemmas require to place mkws in increasing order of the starting index of the 0's stretch. Whereas, the actual query expression for OXT has the least-frequent mkw first. No direct inference can be conjectured for the least-frequent mkw and the query expressions in the lemmas. Hence, an adversary A can not distinguish between the common meta-keyword and a distinct meta-keyword.
In the case, where the least-frequent of mkws is the first one in the query expression of the lemmas too, the first keyword is also the same for both ws. This is equivalent to the case of two conjunctive queries in keywords having the least-frequent w same. Therefore, the leakage from TWINSSE Search can be limited to the OXT pattern leakages only, as expressed below.
OXT is proven simulation secure, if follows from the simulation security guarantee that A no additional advantage over the real experiment.

E.3 Proofs of the Lemmas
We present the proofs of the lemmas presented earlier in this section. We follow the notations and conventions as used in the main body of the paper. Proof. By construction, each meta-keyword mkw i has the original keywords appearing in sorted order in the binary string representation (increasing order of frequency from left to right). Assume, the k'th meta-keyword mkw k is same for both the queries q 0 and q 1 . Without loss of generality, a meta-keyword in the basic O(N 2 ) (TWINSSE Basic ) method can be formed as where 1 ≤ r < s ≤ n, and b i = 0 for r < i < s.
To have an mkw of this form, q must have two keywords at indices r and s, and none in between (for q 0 and q 1 both). Since the mkws are constructed using ws in sorted order, if both queries q 0 and q 1 have the same r and same s (as one mkw is the same), the keywords w r and w s in both q 0 and q 1 are also the same. Hence, we have w k −1,q 0 = w k −1,q 1 and w k ,q 0 = w k ,q 1 . □

E.3.2 Proof of Lemma E.2.
Proof. We assume the common mkw of q 0 and q 1 can be expressed as where 1 ≤ r < s ≤ n, and b i = 0 for r < i < s. The mkw appears at indices k 0 in q 0 and at k 1 in q 1 . Since the indices of ws in the mkw strings are in sorted order (increasing frequency) and remains fixed for all mkws, the ws at index r and index s are the same for both q 0 and q 1 . However, as the index of mkw is different in q 0 and q 1 , the number of preceding ws before index r in q 0 and q 1 are different, equal to k 0 − 2 and k 1 − 2 respectively. Hence, for q 0 , r is equal to k 0 − 1, and equal to k 1 − 1 in q 1 . Following the above argument, we have w k 0 −1,q 0 = w k 1 −1,q 1 and w k 0 ,q 0 = w k 1 ,q 1 . Leakage-Abuse Attack [5] File-Injection Attack [32]

Fraction of Adversarially Chosen/Injected Files Probability of Query Recovery
Leakage-Abuse Attack [5] File-Injection Attack [32]  Our experiments help substantiate that the leakages incurred by the conjunctive and disjunctive search protocols in TWINSSE OXT are benign in practice and are resistant to even the most powerful leakage-based cryptanalysis techniques in the SSE literature over real-world databases, such as leakage-abuse attacks [5] and fileinjection attacks [32]. In particular, we experimentally establish the following claim: Claim F.1 (Informal). The conjunctive and disjunctive search protocols in TWINSSE OXT resist leakage-abuse attacks [5] and fileinjection attacks [32], and are benign in practice.
We substantiate this claim by leakage cryptanalysis experiments targeting the conjunctive and disjunctive keyword search protocols in TWINSSE OXT . We evaluate the probability that the adversary guesses correctly the keywords w 1 and w 2 underlying a twoconjunction query q = (w 1 ∧ w 2 ) (resp., a two-disjunction query q = (w 1 ∨ w 2 )) by one of two well-known and extensively studied cryptanalysis methodologies in the SSE literature-the leakageabuse attack of Cash et al. [5] and the file-injection attack of Zhang et al. [32]. The experiments were conducted over the same Enron email corpus as was used for the performance evaluation experiments in Section 5. The attacks operate in the chosen/injected file model (the strongest possible attack setting where a certain fraction of the files in the database are adversarially generated.) The corresponding results are plotted in Figure 11 and Figure 12 for conjunctive and disjunctive queries, respectively. Throughout, we use a bucket size n ′ = 10 (same as for the performance evaluation experiments in Section 5) for the disjunctive experiments.
We note here that fortifying implementations of TWINSSE OXT by using either ORAM-style data structures or adopting volumehiding techniques such as padding/encrypted multi-maps [20,28] may be useful in masking leakage even further. However, even without such additional fortifications, TWINSSE OXT resists leakageabuse and file-injection attacks in the strongest possible attacker setting, as demonstrated by the aforementioned experiments.
Volumetric Known-Data Attacks: We further evaluate TWINSSE OXT against the known-data volume analysis attacks presented by Blackstone et al. [2], where we analyse TWINSSE OXT against the SelVolAn attack. These specific class of attacks exploits total volume pattern of the queries to recover the keywords, assuming that a fraction of the total data (quantified by "known data rate" δ ) is available to the adversary. More precisely, it tries to associate the queried tags available to the server with known keywords. Since TWINSSE OXT produces noisy volumes due to the presence of spurious ids, the recovery rate is expected to be low in these evaluations. We use the LEAKER 12 framework to execute the SelVolAn attack.
We plot the attack results in Figure 13 which depicts the query recovery accuracy for the SelVolAn attack with varying known data rate δ . Note that, as stated earlier, the recovery rate for the SelVolAn attack is significantly low for our construction due the presence of spurious ids resulting in noisy volume pattern. Furthermore, the server receives the volume pattern information of the meta-keywords -not the actual query keywords. From the adversary's perspective, this requires additional auxiliary information to map recorded encrypted meta-keyword tags to probable metakeywords pre-computed by the adversary from a partial set of the keyword universe (available as auxiliary data). Precise mapping would require full set of keywords as auxiliary information (indicating a high δ value, an extremely strong assumption) to form actual meta-keywords on the adversary's side.
Search and Access Pattern (SAP) based Attack: We also evaluate TWINSSE OXT against the state-of-the-art SAP attack by Oya et al. [25]. The SAP attack exploits search pattern (the sequence in which the queries are searched) and the access pattern (the particular address/elements that are "touched" by the server for each queried tag) and combines these two to recover the association among queried tags (recorded by the adversarial server) and the probable keywords available from auxiliary information. Again, in this case too, the server receives noisy access pattern which prevents the adversary to precisely map a recorded tag to a probable keyword (meta-keyword in TWINSSE OXT ) available as auxiliary data. We validate this through experimental evaluations with TWINSSE OXT . We used the code available from the authors 13 for this evaluation.
The attack results are presented in Figure 14 which depicts the attack accuracy (as a fraction of correct "recorded tag"-"probable 12  Known data rate δ mkw recovery accuracy (×100%) |∆| = 10 3 Figure 13: Leakage Analysis of TWINSSE OXT -SelVolAn attack. The amount of information available to the adversary (as a fraction of the total data) is varied and plotted on the xaxis. The volume pattern from meta-keyword queries were supplied as the leaked information. meta-keyword" associations to all such reconstructed associations) with varying combination coefficient (α). At a high-level, α represents the fraction of frequency information of the total auxiliary information available to the adversary used in the attack. In this case, the adversary recovers the association among queried tags and probable meta-keywords -not tags and actual keywords. Since reconstructing the actual meta-keywords requires the exact same keyword universe available to the adversary, it is unlikely that the adversary would be successful in associating the recovered metakeywords with the subset of the keyword universe available to her (as auxiliary information).

G GENERAL BOOLEAN QUERIES (CNF AND DNF) IN TWINSSE AND TWINSSE OXT
In Section 4, we described how TWINSSE and its instantiation from OXT, namely TWINSSE OXT , handle purely conjunctive and purely disjunctive queries. In this section, we describe how TWINSSE can be extended to address general Boolean queries in either the conjunctive normal form (CNF) or the disjunctive normal form (DNF).
We note here that OXT does support Boolean queries beyond simple conjunctions, albeit where the query must be in a restricted searchable normal form (SNF) [7]; our transformation is significantly more general in the sense that it extends to any CNF or DNF formula over keywords, well beyond the scope of SNF queries.
We begin by describing how to handle DNF queries, because, similar to its purely conjunctive and disjunctive counterparts, DNF queries are also handled by TWINSSE (and hence, by extension, TWINSSE OXT ) by making fully black-box usage of the underlying conjunctive SSE scheme. Subsequently, we show how to address CNF queries. This is slightly more involved, and makes non blackbox usage of the underlying conjunctive SSE scheme (we describe a specific strategy for TWINSSE OXT to handle CNF queries that relies on a special data structure used by the OXT scheme).

G.1 Handling Boolean Queries in DNF Form
In Boolean logic, a disjunctive normal form (DNF) is a canonical normal form of a logical formula consisting of a disjunction of conjunctions (alternatively, OR of AND clauses). Formally, any query q that is a Boolean formula over keywords in DNF takes the form is a conjunctive clause. Our approach to handle a DNF query is straightforward, and closely resembles, at a high level, our strategy for handling disjunctive queries via query partitioning in TWINSSE. Let CSSE = (CSSE.Setup, CSSE.Search) be any generic conjunctive SSE scheme. The search algorithm processes q via the following steps (the setup algorithm remains the same as TWINSSE.Setup described in Algorithm 1, Section 4): • Client: Parse a DNF query as q = ℓ ∈[L] q ℓ .
• Client + Server: For each ℓ ∈ [L] (either in parallel or in uniformly random order), compute where EDB is the encrypted meta-database output by TWINSSE.Setup. • Client: Locally compute at the client Correctness. Correctness of search follows immediately from the correctness guarantees of the underlying conjunctive SSE scheme CSSE.
Search Complexity. We present an (asymptotic) analysis of the complexity of handling DNF search queries (more concretely, the computational and communication requirements during DNF query processing) when we instantiate TWINSSE using the OXT protocol from [7], i.e., in TWINSSE OXT . Let q be a DNF query of the form where we assume, without loss of generality, that for each ℓ ∈ [L], w ℓ,1 is the least frequent conjunct in the conjunctive clause q ℓ . When processing q using TWINSSE OXT , the computational costs (at both the client and the server) as well as the communication requirements between the client and the server scale linearly as O(γ DNF ), where t ℓ |DB(mkw ℓ,1 )|.
Note that this is very similar in flavor to the analysis of disjunctive search query overheads for TWINSSE OXT in Section 4.
Leakage Analysis. We state the following theorems for the leakage from TWINSSE and TWINSSE OXT when processing Boolean queries in DNF form.
Theorem G.1 (DNF Query Processing in TWINSSE). Assuming that CSSE is an (adaptively) secure SSE scheme with respect to the leakage function L CSSE = (L Setup CSSE , L Search CSSE ), the leakage incurred by TWINSSE when processing a DNF query as described above is L Search,DNF TWINSSE ), where for any DNF query q = ℓ ∈[L] q ℓ , we have .
Theorem G.2 (DNF Query Processing in TWINSSE OXT ). The leakage incurred by TWINSSE OXT when processing a DNF query as described above is L Search,DNF TWINSSE OXT , where for any sequence of DNF queries Q = (q 1 , . . . , q M ) such that q m = ℓ ∈[L m ] q m,ℓ for each m ∈ [M], we have where RP, SP, EP and IP leakages for conjunctive queries are as defined in Appendix D.
The proofs of these theorems are very similar to the proofs of Theorems D.1 and D.2 described earlier in Appendix E, and are hence not detailed separately.

G.2 Handling Boolean Queries in CNF Form
In Boolean logic, a conjunctive normal form (CNF) is a canonical normal form of a logical formula consisting of a conjunction of disjunctions (alternatively, AND of OR clauses). Formally, any query q that is a Boolean formula over keywords in the CNF form takes the form where each q ℓ = (w ℓ,1 ∨ . . . ∨ w ℓ,t ℓ ) for ℓ ∈ [L] is a disjunctive clause. Our approach to handle a CNF query is slightly more involved, and makes usage of some specific features of the OXT protocol to ensure sub-linear search overheads in practice. Hence, the subsequent description of how to handle CNF queries is specific to TWINSSE OXT . We leave it as an interesting open question to investigate a generic solution using any conjunctive SSE scheme in a black-box manner.
We now describe our proposed strategy for handling CNF queries in TWINSSE OXT . Before delving into the details, we need to recall some details of the original OXT scheme due to Cash et al. [7]. We refer the reader to [7] for details of the OXT scheme; however, we will try to make the description here as self-contained as possible. The OXT protocol maintains on the server (as part of the encrypted database EDB) a special data structure called a "cross-tag set" (or XSet in short). The XSet consists of several "cross-tags", where each cross-tag xtag id,w corresponds to a document identity-keyword pair (id, w), where xtag id,w ∈ XSet if and only if w ∈ DB(id).
In our handling of CNF queries in TWINSSE OXT , we make blackbox usage the following sub-functions provided by any implementation of the OXT protocol: • OXT.GenXTag(sk, id, w) : The client can use the secret key generated at setup by OXT.Search to generate xtag id,w for any document identifier id and keyword w. • OXT.SearchXTag(xtag id,w ; XSet) : On receipt of a crosstag xtag id,w from the client, the server can look up the XSet efficiently to return a bit β ∈ {0, 1}, where β = 1 if xtaд id,w ∈ XSet, and β = 0 otherwise. Given these sub-routines, our proposal for processing a CNF query q proceeds via the steps outlined below (the setup algorithm again remains the same as TWINSSE.Setup described in Algorithm 1, Section 4, albeit for CSSE = OXT). Note that unlike purely conjunctive/disjunctive queries and DNF queries, all of which required a single round search protocol, our processing of CNF queries now requires two rounds of communication between the client and the server.
• Client: Parse a CNF query as • Client: Identify the candidate disjunctive clause q ℓ with the smallest result set (this can be computed in a straightforward manner from the client state st output by OXT.Setup, which has the frequency of each keyword in the dictionary). • Client+Server (Round-1): Compute the result-set corresponding to the disjunctive clause q ℓ as where EDB is the encrypted meta-database output by TWINSSE OXT .Setup, by directly using the disjunctive search protocol described in Algorithm 3 (Section 4) with CSSE = OXT. • Client: For each id ∈ DB(q ℓ ) and each w i,ℓ ′ for ℓ ′ ℓ in the query q, compute xtag id,w i ,ℓ ′ = OXT.GenXTag(sk, id, w i,ℓ ′ ).
Output the final result set R q = {id ∈ DB(q ℓ ) such that β id = 1}.
Correctness. Correctness of search follows immediately from the correctness guarantees of TWINSSE OXT (Theorem 4.1), and the correctness guarantees of the OXT protocol itself.
Search Complexity. We now present an (asymptotic) analysis of the complexity of handling CNF search queries (more concretely, the computational and communication requirements during CNF query processing) in TWINSSE OXT . Let q be a CNF query of the form where we assume, without loss of generality, that q 1 = (w 1,2 ∨ . . . ∨ w ℓ,2 ) is the disjunctive clause with the smallest result set. Let q mkw = ∨ k ∈[n B ] q mkw,k be the corresponding meta-query when the disjunctive search query corresponding to q 1 is processed using TWINSSE OXT .Search, and assume without loss of generality that mkw (k) i k ,j k is the least frequent meta-keyword within q mkw,k for each k ∈ [n B ] (such that q mkw,k is non-empty).
When processing q using TWINSSE OXT , the computational costs (at both the client and the server) as well as the communication requirements between the client and the server scale linearly as O(γ 0 + γ 1 ), where |q k | denotes the number of meta-keywords in the conjunctive sub-meta-query q k (|q k | = 0 when q k is empty), and γ 1 = |DB(q 1 )| · (ℓ ∈ [2, L]t ℓ ) .
Note that the term γ 0 is computed exactly as in the analysis of disjunctive search query overheads for TWINSSE OXT in Section 4. Moreover, the term γ 1 , which represents computational and communication complexities incurred as a result of the round-2 of the CNF query processing (using OXT.GenXTag and OXT.SearchXTag), is independent of the frequencies of any of the disjunctive clauses other than the "least frequent clause" q 1 .
Leakage Analysis. We state the following theorems for the leakage from TWINSSE OXT when processing Boolean queries in CNF form. where RP, SP, EP and IP leakages for conjunctive queries are as defined in Appendix D, and where Q mkw,1 is a sequence of (sub-)meta-queries of the form Size of final result set (|R q |) End-to-end query time (s) (w1 ∧ w2) ∨ (w3 ∧ w4) ∨ (w4 ∧ w6) (w1 ∧ w2) ∨ (w3 ∧ w4) Figure 15: TWINSSE OXT performance with result set size on Enron dataset for DNF queries. The proof of this theorem is again very similar to the proof of Theorem D.2 described earlier in Appendix E, and is hence not detailed separately.

G.3 Experimental Results over the Enron Email Dataset
We provide experimental results for CNF and DNF queries using TWINSSE OXT in this section. We experimented over the Enron dataset on the same platform (discussed in Section 5) with our implementation of TWINSSE OXT .
DNF queries. We considered multiple queries with two clauses and three clauses with each clause having two keywords. The endto-end query time is plotted in Figure 15, where the blue curve represents the query time for two clause queries and the red curve represents the query time for three clause queries. Observe that the query time for both two and three clause queries increase with more number of ids in the final result set. This increment can be attributed to large result size of the individual conjunctive clauses. Also note that the query time increases for three-clause queries due more conjunctive clauses and follows the same trend of increased query time with the final result size.
CNF queries. For experimenting with CNF queries, we considered two-clause queries with two keywords and three keywords per clause. Since the Enron dataset is relatively sparse in nature, with higher number of clauses in query it often results in small or empty intersection. We plotted the end-to-end query time in Figure 16 for both cases -two keyword clauses and three keyword clauses with the size of the final result set. The blue curve represents the endto-end query time for the queries with two keywords per clause. Similarly, the red curve represents the end-to-end query time for queries with three-keyword clauses. Observe that, in CNF queries also, the end-to-end query time increases with the final result size, due to the increased size of the initial result set. For the threekeyword clauses, the query time is higher than the two-keyword clauses due to the increased size of the initial result set obtained by disjunctive query. End-to-end query time (s) (w1 ∨ w2 ∨ w3) ∧ (w4 ∨ w5 ∨ w6) (w1 ∨ w2) ∧ (w3 ∨ w4) Figure 16: TWINSSE OXT performance with result set size on Enron dataset for CNF queries.

H EXPERIMENTAL RESULTS OVER THE WIKIMEDIA DUMP
We present additional experimental results for TWINSSE OXT over Wikimedia databases 14 in this section. We varied the database size from 6K keywords (60K w-id pairs in the plain index) to 80K keywords (8.2 million w-id pairs in the plain index), and we plot the server storage overhead in Figure 17 and performance figures in Figure 18 and Figure 19. The comparative storage overhead plot (in log scale) in Figure 17 illustrates the quadratic storage overhead for IEX-2Lev; whereas it remains linear for TWINSSE OXT . This storage overhead profile validates our primary contribution of our work, and also illustrates the applicability towards different databases (results on the Enron dataset is presented in the main text Section 5.)

I EVALUATION OF STORAGE OVERHEAD WITH SYNTHETIC DATABASE
We discussed in Section 5 Figure 3 that TWINSSE OXT improves significantly in terms of storage overhead than IEX-2Lev on the Enron database. Figure 20 compares the storage overhead of TWINSSE OXT and IEX-2Lev on a synthetic database that follows Zipf's law and   and IEX-2Lev on a synthetic database that follows a uniform distribution. These databases contain more documents per keyword than the Enron database. This implies that size of the intersections of keyword pairs is much more as compared to the Enron database. Storage overhead of IEX-2Lev hence degrades even more.
To clarify this, the following example of a realistic database can be considered as dense one (as we have described above and in Section 5). Note that, any relational-database is dense if each attribute is low-entropy (i.e., takes only a few values), and hence each attribute-value pair (equivalent to keywords) occurs in a very large number of records (equivalent to documents). Consider the following Covid-19 patient-database (Table 2), where each attributevalue-pair likely occurs in a large number of patient-records.  Observe that, querying any of the attributes would return a large number of records from this example database. Our experimental results show that IEX-2Lev incurs 70× higher storage overhead than TWINSSE OXT for the synthetic database following Zipf's law ( Figure 20) and approximately 150× higher storage overhead for the database following uniform distribution ( Figure 21). The search time also increases for both the schemes; however, the main advantage of TWINSSE OXT compared to IEX is in reduced storage, not in search overheads (which still remains sublinear for TWINSSE OXT ).