Strengthening Privacy-Preserving Record Linkage using Diffusion

Linking personal records from different databases is an essential step in many data workflows. Privacy-Preserving Record-Linkage (PPRL) techniques have been developed to link persons despite errors in the identifiers without violating their privacy. Designing efficient PPRL schemes with high linkage quality and a strong level of privacy protection is challenging. PPRL based on Bloom filter encoding (BF) is currently one of the most popular methods as they offer high efficiency and linkage quality. However, it turned out that these schemes are vulnerable to several attacks, with pattern mining and graph matching attacks considered to be the most seri-ous by far. While several proposals have been made to strengthen BF-based PPRL schemes against these attacks, all these lack a proper security analysis or do not preserve the high efficiency and linkage quality. This paper shows that both problems can be addressed by extending the scheme with an appropriate linear diffusion layer. As opposed to previous schemes, we provide extensive theoretical and experimental analysis that confirms that the resulting scheme provides high efficiency and linkage quality and significantly increases security against the attacks mentioned above.


Introduction
Vast amounts of data have been collected due to the digital transformation of commercial and administrative processes. This data is used for many purposes, including research. Notably, new research interests have been created by the ability to link multiple databases. The process linking entities from different databases while still preserving privacy is called privacy-preserving record linkage (PPRL) [8]. The three main challenges of a PPRL technique are (1) scalability or efficiency, (2) linkage quality and (3) level of privacy protection.
To provide privacy, PPRL schemes usually encode the records and execute the linkage process on the encoded records only. Among different encoding techniques used in PPRL, the use of Bloom filter encoding (BF) has been proposed in [19]. Due to their high efficiency and linkage quality level, BF-based PPRL has now been widely used in real applications [2,5]. However, research [14,17,25,26] has also shown that such schemes can be vulnerable to a variety of attacks. In particular, pattern mining attack [26] and graph matching attack [25] are considered the most practical and advanced threats.
Consequently, various Bloom filter hardening techniques have been proposed to enhance the scheme's security, including salting, balancing, adding random noise, and rule 90 [8,22]. However, all these techniques reduce the linkage quality or the efficiency [8,10]. Moreover, as is remarked by [8], no proper security analysis has been provided for any of these hardening techniques. Thus, it was an open question of whether an extension of BF-based PPRL does exist that is (i) able to preserve their advantages and (ii) increases security against the two attacks mentioned above.
In this work, we positively answer this question. The core idea is to adopt an established concept from cipher design and append a linear diffusion layer at the end of the BF-based PPRL scheme. Consequently, we refer to this concept as an BFD scheme (Bloom filter with diffusion). In a nutshell, the idea is that each output bit of the encoding process is a linear combination of secretly chosen BF bits.
Even though this idea seems to be obvious, it hasn't been fully explored so far. For instance, a similar approach was considered in [22]. The authors proposed to apply the so-called 'Rule 90' [28], so that each bit position in Bloom filter is modified by bitwise XORing the predecessor and successor bit.
In [18], randomly sampling bits from Bloom filters and XORing them was suggested. However, for none of these propositions (or any other hardening technique proposed so far), any security analysis has been given.
In contrast, one of the major contributions of this work is the extensive theoretical and experimental analysis of the proposed scheme. Our theoretical analysis shows superior security compared to the basic BF-based PPRL approach while preserving good linkage quality. As a kind of side result, it also shows that the diffusion induced by 'Rule 90' as suggested in [22] is insufficient. Some of the analysis techniques may have value on their own and could be applied to other schemes well.
Besides theoretical analysis, extensive experiments confirm the excellent properties of the BFD. Thanks to the simplicity of the extension, high efficiency is also still given. To the best of our knowledge, this is the first (BF-based) effective PPRL scheme providing good linkage quality where a thorough security analysis has been conducted. Moreover, we discuss how concrete parameter settings can be derived.
The rest of the paper is organized as follows. In Section 2, we describe the concept of privacy-preserving record linkage (PPRL), emphasizing the variant based on Bloom filter encoding. In Section 3, we explain the commonly considered attacker model and the two most relevant attacks, namely pattern mining and graph matching attacks. Section 4 gives the rationale of the design and a formal description of our BFD scheme. The security analysis of the scheme is given in Section 5. In Section 6, we evaluate the scheme for practical applications by investigating efficiency and linkage quality. Parameter choices are discussed in Section 7. Section 8 summarizes the contribution and gives an outlook on potential future work.

Privacy-Preserving Record Linkage 2.1 General Description
Privacy-preserving record linkage (PPRL) refers to identifying records in different databases that refer to the same person without revealing the identity of any individual in one of these databases.
Without loss of generality, we restrict our description to the case of two databases. We consider two database holders, who own databases D and D ′ , respectively. One assumes that each record in any database can be written as r = (ID, λ, µ) where ID is some randomly generated, unique identifier, λ is so-called linkage data that will be used for linking records, e.g., name, and µ is the microdata that is relevant for the analysis. Essentially, the task of record linkage is to identify records r = (ID, λ, µ) ∈ D and r ′ = (ID ′ , λ ′ , µ ′ ) ∈ D ′ that according to λ and λ ′ refer to the same person and hence shall be linked.
In PPRL, this process is executed by a third party called 'linkage unit' L. To prevent any re-identification of individuals from λ, each database holder applies an encoding algorithm to transform the plain linkage data λ into encoded linkage data, written as [λ]. The linkage unit L receives encoded linkage data [λ] along with their identifiers ID from holders and conducts the linkage using a linkage algorithm. That is, for any pair of tuples (ID, [λ]) and (ID ′ , [λ ′ ]), the linkage unit L computes a similarity function sim on the encoded linkage data. If sim ([λ], [λ ′ ]) is above a predefined threshold, the records are classified as a potential match and (ID, ID ′ ) will be recorded and be sent to another party for actually joining the databases D and D ′ .

PPRL based on Bloom Filters
As the linkage process in a PPRL scheme is applied on the coded linkage data, the choice of the encoding procedure has a significant impact on the quality of the whole scheme. One encoding procedure that became very popular in this context are Bloom filters. Bloom Filters (BF) [3] were developed to allow to test the membership of set elements without the need to access the set directly. A BF is parameterized by a bit string of length ℓ BF (also called the length of the BF), a universe U of elements, and k hash functions Given a set S ⊆ U, first a bit string B of length ℓ BF is initialized where each entry is set to zero, i.e., B[i] = 0 for i = 1, . . . , ℓ BF . Then, for each s ∈ S, (up to) k indexes h 1 (s), . . . , h k (s) ∈ {1, . . . , ℓ BF } are computed and the bits in B at these positions are set to one. That is, B[h i (s)] := 1 for each s ∈ S and each hash function h i .
In the case of BF-based PPRL schemes, Bloom filters are used to encode the linkage data λ as follows. Attributes values contained in λ are represented over some common alphabet Σ. A q-gram (for some integer д ≥ 1) is a string of length q over characters from Σ. To encode λ into a Bloom filter, it is interpreted as a sequence over Σ, and then, using a sliding window approach, the set of q-grams contained in λ is extracted. For instance, when using q = 2 (also known as bigrams), the string "filter" is converted into the set of bigrams S = { f i, il, lt, te, er }. Then, S is encoded into a ℓ BF -bit string as described above, using k hash functions h i : U = Σ q → {1, . . . , ℓ BF }. An example is given in Fig. 1. Intuitively, two linkage data λ and λ ′ are similar if the corresponding sets of q-gram strongly overlap. However, for privacy reasons, the linkage unit L does not have access to these sets directly but only sees the encoded linkage data B = [λ] and B ′ = [λ ′ ]. In BF-based PPRL schemes the similarity of these binary strings is captured by the Dice coefficient Dice. Formally, the Dice coefficient of two binary strings is defined as where supp(B) := {i |B[i] 0} refers to the set of indices of the bits that are equal to 1. If the Dice coefficient is sufficiently large, i.e. greater or equal to a pre-defined threshold τ , two records (ID, λ, µ) and (ID ′ , λ ′ , µ ′ ) should be linked.

Security 3.1 Model
The workflow of PPRL is that each database holder encodes its database and sends the result to the linkage unit L. No database holder sees the data of any other database holder. However, database holders may agree on shared secrets, e.g., the deployed hash functions in the case of BF-based PPRL, before the encoding step. The linkage unit L has access to all encoded databases and possibly knows certain meta-data about the data sets, e.g., that they belong to a specific hospital, the individuals are residents of a particular area, etc. The goal of L as an attacker is to correctly re-identify individuals represented in the databases from the encoded linkage data better than pure guessing [8].
The overall goal of this work is to propose an extension for BFbased PPRL schemes and to show that this modification makes it more secure while preserving its practicability. Ideally, one would rely here on a concise security definition and provide a proof of security based on this definition. Unfortunately, we would face two problems then.
First, the only formal security definition we are aware of is the one provided in [12]. However, it models an interactive scenario while we consider a one-shot-scenario where each database holder obfuscates his database once and sends the result to the linkage unit. The reason for this choice is that this is the predominant use case in practice. For this scenario, there is no formal security definition (yet). For the same reason, we also rule out interactive PPRL based on multiparty computation, e.g., see [24].
Second, proofs of security reduce the security of one cryptographic scheme to the (alleged) difficulty of another problem. Experience shows that provably secure schemes either rely on problems that are not thoroughly analyzed or the resulting schemes are less efficient than other, non-provably secure schemes. Actually, no symmetric cryptographic scheme in practical use, e.g., the encryption standard AES or the hash functions SHA2/SHA3, is provably secure.
Instead, the conventional approach in the symmetric cryptography community is to analyze the resistance of schemes against the best-known attacks. In this work, we follow this approach. This is also motivated by the fact that BF-PPRL exhibits the characteristics of a symmetric-key cryptographic scheme. On the one hand, secrets are shared between the database holders. On the other hand, the efficiency of PPRL is of high importance.
While several attacks have been found against BF-based PPRL (for an overview, see [13,27]), the pattern mining [26] and graph matching attacks [25] are considered to be the most powerful attacks against Bloom filter based PPRL techniques among all other attacks. Consequently, we aim to adapt BF-based PPRL to make these attacks less effective without sacrificing the efficiency and linkage quality.
In the following, we describe pattern mining attacks 3.2 and graph matching attacks 3.3. In particular, we identify for each attack a necessary property of BF-based PPRL. These properties will also serve as the basis of the security analysis 3.

Pattern Mining Attacks
A pattern mining attack [26] is one of the most practical and advanced attack methods as it is quite effective while assuming a weak attacker model. In this attack, the attacker has access to an encoded database and has some knowledge about the distribution of q-grams in plaintext records, e.g., statistics from a population. In particular, the attacker knows the frequency of the most common q-grams.
To explain the attack, we introduce some notation that we will re-use later in Section 5.2 for the security analysis. Let γ denote an arbitrary q-gram, and let Note that I(γ ) usually is unknown to an attacker, due to the use of secret hash functions. However, the attack exploits the fact that frequent q-grams incur frequent '1'-patterns in the Bloom filters. Given this, let γ * denote the most frequent q-gram according to the known distribution of q-grams in plaintext records. The attacker searches the Bloom filters for the most frequent patterns, i.e., a set of indices I * that are jointly equal to 1 with high frequency. The assumption is that the most frequent pattern results from the most frequent q-gram, i.e., I(γ ) ⊆ I * . In particular, if E I * holds for a given Bloom filter, one assumes that the underlying record includes γ . That is, one q-gram is identified from the underlying records.
Analogously, the second-most-frequent pattern is identified and so on. Thus, step by step, certain q-grams are identified in the encoded linkage data and eventually may result in the re-identification of individuals.
In [27], the authors made a taxonomy of all existing attacks and their results. Using the publicly available database called 'NCVR' (North Carolina Voter Registration), in [27] it's stated that at most 27,665 out of 222, 251 records are correctly re-identified by pattern mining attacks.

Graph Matching Attacks
A graph matching attack [25] is considered to be the most powerful attack against PPRL schemes, but assumes a stronger attacker model. The attacker has access to a plaintext database D and an encoded database [D ′ ], being the encoding of some database D ′ . Such a scenario might be given if for example an attacker has some knowledge about a super-set of the records contained in D ′ , e.g., D ′ being a database of patient records from a hospital while D is publicly available phone book.
For a graph matching attack, the databases D and D ′ need to have a significant overlap (in the sense of linkability). This is expressed by the overlap rate 2 · |Matches(D, the set of all pairs in D × D ′ that should/can be linked. The overlap rate ranges from 0% to 100%, the latter being the case if D equals [D]. In [25], it was demonstrated that graph matching attacks are very powerful if the overlap rate is high, i.e., very close to 100%. Given this, the attacker aims to identify elements in D ′ that can be linked to records in D. As D is given in plaintext, this results in a re-identification of the individual encoded in [D ′ ]. A second requirement for graph matching attack is that the encoding scheme preserves the level of similarity. In the case of BF-based PPRL, consider any pair of plaintext records with linkage data λ and λ ′ and let B and B ′ be the corresponding Bloom filter encodings. Let sim be the function that expresses the similarity between plaintext records while the Dice coefficient expresses the similarity between the Bloom filters (cf. Eq. 1). Then, a graph matching attack exploits that these notions of similarity are correlated. We express this by For example, in the case of BF-based PPRL, it holds that Dice(B, B ′ ) grows linearly with sim(λ, λ ′ ) (cf. Fig. 2). Thus, if a set of records in D is also contained 1 in [D] (in encoded form), the relation between them in terms of similarity should have a similar structure in both databases. The attacker constructs "similarity graphs" for each database as follows. The nodes of the graphs are the (encoded) entries, and the edges between nodes are weighted with the level of similarity of their endpoints (the Dice coefficient in the case of BF-based PPRL). The idea is now to look for sub-graphs in both graphs structurally similar, i.e., match. The assumption is that this indicates that the nodes of the sub-graph constructed from [D ′ ] match the nodes in the sub-graph derived from D. As the latter is given in plaintext, these records from [D ′ ] have been re-identified.
Using the same database called 'NCVR' as in [26], the authors successfully re-identified 55.1% of the records using graph matching attacks according to [27]. In [15], the authors pointed out that a successful attack should result in more correct re-identifications than random guessing. An equation given by [15] suggests one re-identification as the expected value for our setting. Thus, any privacy attack that correctly identifies significantly more than one record should be considered a success, at least to some extent.

The BFD Scheme 4.1 Design Rationale
As motivated in Section 1, the challenge we address in this paper is to extend the BF-based PPRL scheme explained in Section 2.2 such that (i) the extension does not change the original scheme too much to preserve its benefits but (ii) one can show the security is given against pattern mining attacks (Sec. 3.2) and graph matching attacks (Sec. 3.3).
As a pattern mining attack is similar to a frequency attack against the substitution cipher, it is reasonable to consider as a countermeasure the same protection mechanism that also helps to thwart frequency attacks -diffusion. In cryptography, the purpose of diffusion [23] is to hide the statistical relationship between the ciphertext and the plaintext. Commonly, this is achieved by requiring that small changes in the input have a potentially large impact on the output. In the case of BF-based PPRL, we use this concept to break the strong correlation between q-grams (input) and bits of the encoded linkage data (output). More precisely, a linear diffusion layer is appended to the computation of the Bloom filters. Given some Bloom filter B of length ℓ BF , a bitstring E, also called an encoded linkage data, of length ℓ ELD is constructed where each bit in E is the XOR-sum of selected bits from B. An example is shown in Fig. 3. There, BF refers to the Bloom filter alone while BFD stands for the encoding of BF by applying a diffusion layer. The diffusion layer may also protect against graph matching attacks. Consider two Bloom filters B and B ′ and their respective encoded linkage data E and E ′ . To preserve linkage quality, it needs to hold that if Dice(B, B ′ ) is high, then this implies a high value for Dice(E, E ′ ) as well. This requirement is because we use the same criteria for deciding whether or not encoded linkage data shall be linked. However, the lower Dice(B, B ′ ), the less shall it be correlated to the value of Dice(E, E ′ ). If this is given, the sub-graphs in [D] do not reflect the structure anymore in D, and hence the attack would fail.

Description
In the following, we provide a complete description of BFD, being the BF-based PPRL scheme extended by a diffusion layer. The BFD scheme comprises three algorithms: (i) an algorithm for choosing the diffusion layer (Sec

Choosing the Diffusion Layer.
During the setup phase of a BF-based PPRL, i.e., when the database holders agree on shared secrets such as the deployed hash functions, also the diffusion layer needs to be selected. We assume this to be done by one database holder, and the result is shared with the others.
Note that the diffusion layer transforms a Bloom filter B ∈ {0, 1} ℓ BF into an encoded linkage data E ∈ {0, 1} ℓ ELD where each bit of E is the XOR-sum of some bits from B. Thus, the diffusion layer is specified by a list I of ℓ ELD sets I j ⊂ {1, . . . , ℓ BF }.
Alg. 1 displays the algorithm for selecting the diffusion layer. To maximize the diffusion effect, we aim for two effects. First, the change of any Bloom filter bit should impact as many bits in the encoded linkage data as possible. Second, each bit should depend on as many Bloom filter bits as possible.
To accomplish the first effect, Alg. 1 ensures in lines 4 and 9 that each I j is exactly of size t. As we are going to discuss in Sec. 3 and Sec. 6, the parameter t will be one of the main parameters to adjust the level of security and linkage quality.
To achieve the second effect, one would require to ensure that for any indices j 1 and j 2 , the intersection I j 1 ∩ I j 2 is as small as possible. However, this would require to solve an involved combinatoric problem. To keep this procedure as efficient as possible, we opt for a greedy approach instead. That is, we aim for that each index bit in {1, . . . , ℓ BF } should appear on average in the same number of index sets I j . This is accomplished by using a set Indices that keeps tracks of indices chosen so far. After being initialized to {1, . . . , ℓ BF } (line 1), pairwise disjoint index sets I j of size t are removed from Indices as long as possible (lines [3][4][5]. Once the number of remaining indices is not sufficient (lines 6-11), the remaining indices are used for the current index set (line 7), Indices is re-initialized to {1, . . . , ℓ BF } (line 8), and the procedure continues as before. if |Indices | ≥ t then 4: Choose random I j ⊆ Indices with |I j | = t

Preliminaries
The following section provides a theoretical and experimental security analysis of the proposed diffusion extension against pattern mining attacks and security against graph matching attacks.
Piling-Up Lemma. In the theoretical analysis, we make use of the so-called piling-up lemma: Lemma 1 (Piling-up Lemma [16]). Given independent binary variables X 1 , ..., X n with Pr(X i = 0) = 1 2 + ε i for i = 1, . . . , n, it holds that The value −0.5 ≤ ε i ≤ 0.5 is called 'bias' of the variable. The smaller |ε i |, the less biased the random variable X i is and the closer to the uniform distribution.
Proof. The claim in (6) is a straightforward consequence of the piling-up lemma. Now, let J ⊆ I ⊂ {1, . . . , n}. As |ε i | ≤ 1/2, it holds that |2 · ε i | ≤ 1 for any i. We have In the experiments, the records have been sampled from the publicly available North Carolina Voter Registration database of around 220,000 records 2 . This database is not only popular in textbooks [8] but has been used already for parameter choice recommendations [4] of BF-based schemes and to investigate pattern mining attacks [9,26] and graph matching attacks [25].
This database has been preprocessed as explained in [7]. The records we used contained several attributes, including names, addresses, date of birth, state, etc. Following the suggestion of [4], attributes to be used as linkage data are first name, last name, street address, and city. We also introduced some errors to check for the linkage quality for similar yet different records. As the ground truth was given, the linkage quality could be computed.
Following earlier work [9,19,25,26], for Bloom filters we use the setting: ℓ BF = 500, 1000 (size of the Bloom filter), k = 5, 10 (number of hash functions). We used databases of size 2000. The database size is sufficient to conduct the attacks since the expected difference in proportions between different samples from the same population of names would be small. 3 BFs were encoded using random hashing [21] and the CLK approach [20]. The length ℓ ELD of the encoded linkage data has been set to the same as the length of the Bloom filters (see also the parameter discussion in Sec. 7).
For the implementation, we used Python 3.8.6. The experiments were run on a computer with a 64-bit Intel Core i7-10750H 2.60 GHz CPU, 32 GBytes of memory, and Ubuntu 20.04.

Pattern Mining Attack
Pattern mining attacks exploit that the encoding of a q-gram results in a (possibly characteristic) bit pattern, i.e., a selection of positions where the bits are all equal to 1 (see also (3) on page 300). Thus, to thwart such attacks, one needs to ensure that in the encoded database, no characteristic patterns occur anymore. Let γ denote any q-gram, r some plaintext record, B its Bloom filter encoding, and E the resulting encoded linkage data. We can reformulate the necessary condition (3) for pattern mining attacks against the Bloom filter by γ occurs ∈ r ⇒ Pr(B To strengthen our scheme against this type of attack, we aim for In other words, no q-gram induces an observable pattern into the encoded linkage data E. In the following, we first show in Sec. 5.2.1 that for any bit E[i] in the encoded linkage data, its value tends to be uniformly distributed with increasing parameter t (see also Fig. 4). Next, we focus on the case that the underlying Bloom filter exhibits a specific pattern. We show that even under this condition, the bit values E[i] still tend towards uniform distribution (Sec. 5.2.2) and that for any two different bits in the encoded linkage data, the probability of these being the same is also getting close to 1/2 (Sec. 5.2.3).

Distribution.
First, we estimate the distribution of the bit values in the encoded linkage data in dependence on the distribution of the Bloom filter bits. Note that having a close-to-uniform distribution of bits is a necessary (but not sufficient) condition for not having any characteristic patterns. Consequently, we determine how close the distribution of single bits in the encoded linkage data can get to the uniform distribution in the best case. For the sake of simplicity, we base our analysis on the assumption that the Bloom filter bits are pairwise independent. While this is not mathematically true, we consider it as a reasonable approximation as these bits are computed from keyed hash-functions (which are independent) and multiple q-grams (which are probably not independent, but where the distribution is unknown). Moreover, this approximation is also backed by our experiments.
In the following, let B denote the Bloom filter of length ℓ BF that has been computed at the beginning of Alg. 2 and E the encoded linkage data of length ℓ ELD that has been derived from B. We assume that for any i ∈ {1, . . . , ℓ BF }, there exists a bias ε i such that The reason for introducing ε i is to be able to discuss later on how much the distribution of bits deviates from the uniform distribution. Note that we do not make any assumptions on ε i . Now, let j ∈ {1, . . . , ℓ ELD } be an arbitrary but fixed index of a bit in the encoded linkage data. From the definition of E[j] and the piling-up lemma (Lemma 1), it follows that where ε I j is defined as explained in Lemma 2. The higher the bias ε I j , the better for an attacker. Thus, we are interested into an upper bound of this value. Recall that for any J ⊆ I j , it holds that |ε I j | ≤ |ε J | (cf. Lemma 2). In particular, it follows that |ε I j | ≤ ε min | with ε min := min i ∈I {|ε i |}. That is, the bias of Pr(E[j] = 0) is at most as high as the smallest bias ε i of all bits indexed by I j (and also any product of selected biases). That means, unless for the extreme case that all biases have an absolute value of 1/2, the bias will go down. In particular, this effect increases with increasing index list size t.
To validate this and get an idea of specific values, we measured these values experimentally. In our experiment, we set the length ℓ BF of the Bloom filters to 500,1000 and the number k of hash functions to 5,10. Then using the database described at the beginning of Sec. 3 with 2000 randomly sampled records, we observed the maximum bias ε := max j ∈ℓ ELD {|ε I j |} for all bits in the encoded linkage data as t increases. The results are shown in Fig. 4. As one can see, the maximum |ε I j | decreases quite sharply as expected. For example, for t ≥ 20, the maximums are all less than 0.1. Therefore, we will consider to what extent a pattern in the Bloom filter may result in a pattern in the encoded linkage data. We assume in the following that the underlying Bloom filters exhibit a certain pattern, i.e., a set of indices I * where the Bloom filter bits are equal to 1. Let E I * denote the event for an encoded linkage data that the underlying Bloom filter shows this pattern. That is, we have We restrict to the case that the pattern has been induced by a single q-gram as this is the approach used in pattern mining attacks [26] and represents patterns with the highest frequency. This means that the size of I * is bounded by the number of hash functions k used for the Bloom filter encoding. Now let us fix some bit index j ∈ {1, . . . , ℓ ELD } in the encoded linkage data. By definition of the bits, it holds Obviously, if I j ∩ I * = ∅, then the pattern has no impact on the value of E[j], that is Hence, we restrict on the case that I j ∩ I * ∅ and define I + := I j ∩ I * and I − := I j \ I * .
This allows to rewrite E[j] as follows: As B[i] = 1 for all i ∈ I * , the value i ∈I + B[i] is equal to 0 if and only if |I + | is even. As we are only interested into the absolute value of the bias of Pr(E[j] = 0) and not its concrete value, we can assume without loss of generality that i ∈I + B[i] = 0. Considering i ∈I + B[i] = 1 instead would result in the same bias but with inverted sign.
Recall from Algorithm 1 that |I j | = t = |I + | + |I − | and from the assumptions we have |I + | ≤ |I * | = k. Thus, if t ≤ k, it may happen that I j = I * and I − = ∅. In this case, the value of E[j] is fixed by the event E I * and one cannot exclude the existence of an observable pattern in E. Therefore, we assume t ≥ k from now on.
Under this condition, also using the same line of arguments as before, we get Also here, |ε I − | is upper bounded by the smallest bias ε min := min i ∈I − {|ε i |} and in general is expected to decrease significantly when the size of I − increases. In particular, the size of I − is at least t − k and increases with increasing t (and hence most likely will decrease |ε I − |).
Using the same parameter settings as previously discussed, we studied the maximum of |ε I − | as t − k increases. The results are shown in Fig. 5. In general, the experiments support the theoretical result that increasing t will decrease |ε I − |. For example, for k = 5, ℓ BF = 500, and from t − k ≥ 5 which means that t ≥ 10, the resulting |ε I − | will be less than 0.1.

Conditional Co-Occurrence.
The previous analysis shows that if t is sufficiently high, the bias of Pr(E[j] = 0) decreases significantly. That is, a single bit cannot be used for recognizing the presence of a certain pattern in the Bloom filter. This, however, does not exclude the case that the values of two or more bits are correlated. For any choice of distinct indices i, j ∈ {1, . . . , ℓ ELD }, we therefore investigate now that is the probability that both bit values in the encoded linkage data are equal under the condition E I * , i.e., the event that the underlying Bloom filter exhibits a pattern given by an index set I * .
From now on, we assume that E I * does hold and aim to estimate the bias of Pr( . Similarly as discussed before, we can re-write this as where I + := I i ∪ I j \I i ∩ I j I ∩I * and I − := I \ I * .
Similar to the analysis conducted in Sec. 5.2.2, we assume that a ∈I + B[a] = 0 and get Note that as opposed to the previous situation, it may happen that I − = ∅. This is possible if I i \I * = I j \I * happens by coincidence: then all the non-pattern bits would cancel out. If we assume again that |I i | = |I j | = t ≥ k = |I * |, then the probability of this event decreases with increasing t.
Results of the largest bias |ε I − | are shown in Fig. 6. The probability of co-occurrence under the condition of E I * decreases as expected and when t ≥ 10, ε will be less than 0.1 except for k = 5, ℓ BF = 1000.  In [1], it was shown that the attack is quite fragile with respect to the overlap rate. As soon as it deviates from 100%, the success rate drops sharply. However, the overlap rate depends on the use case and is out of the control of the PPRL scheme. Therefore, to protect against graph matching attacks, we aim to weaken condition (ii) significantly. That is, the values sim(λ, λ ′ ) and Dice(E, E ′ ) should not be strongly correlated anymore where E is the encoding of λ and analogously E ′ of λ ′ .

Graph Matching Attack
In fact, there is a trade-off that needs to be taken into account. For preserving linkage quality, i.e., correctness of the scheme, it is important that a high value of sim(λ, λ ′ ) has to imply a (relatively) high value Dice(E, E ′ ). However, when sim(λ, λ ′ ) decreases, i.e., the less similar the records are, the value Dice([λ], [λ ′ ]) should not decrease analogously.
In the following, we provide a theoretical analysis that this property is given for the BFD scheme for appropriate choices of t. This is also backed by experiments. As an example, the similarity relation between sim(λ, λ ′ ) and Dice(E, E ′ ) in our experiments is shown in Fig. 7. The three graphs show the relationship for t ∈ {5, 10, 20}. We see that for increasing value of t, the correlation between sim(λ, λ ′ ) and Dice(E, E ′ ) gets weaker. More precisely, except for the case of sim(λ, λ ′ ) being high, the value of Dice(E, E ′ ) is close to 0.5. That is, when an attacker observes Dice(E, E ′ ), she cannot deduce the value of sim(λ, λ ′ ).
As the BFD scheme deploys the Bloom filter encoding as intermediate step and as we know that sim(λ, λ ′ ) and Dice(B, B ′ ) are strongly correlated, we investigate in the following the relation between Dice(B, B ′ ) and Dice(E, E ′ ). Recall that  Let us start with two Bloom filters B and B ′ that are similar, being expressed by the fact that for all i ∈ {1, . . . , ℓ BF }, it holds that for a sufficiently large value δ . First, we show that under this condition, the probability of is also lower bounded by δ . We know already that In the following, we assume that This is given in practice as the number k of hash functions is significantly smaller than the length ℓ BF of the Bloom filter. This means that the probability of a randomly position bit to be set to 1 is smaller than the probability that is has been left to zero. Experiments confirm this: Using our parameter choices, we observed that Under this assumption, it follows We can now deduce the following lower bound: Next, we pick some arbitrary index i ∈ {1, . . . , ℓ ELD } and aim to derive bounds for Pr(E ′ Obviously This shows that the Dice coefficient increases for increasing δ and fixed t. The expression in (55) is actually the cumulative distribution function for the Binomial distribution. In particular, (55) tends towards 0.5 for p → 0.5.

Correlation and Re-identification Rate.
To further analyze the security of the BFD scheme, we conducted several experiments on two features: correlation and re-identification rate.
In statistics, correlation refers to the degree to which a pair of variables are related. For the standard Bloom filter based schemes without diffusion, the correlation between the similarity of plaintext records and the similarity between encoded records is large (see also Fig. 2). For our analysis, we computed the sample correlation coefficient.
Definition 3 (Sample Correlation Coefficient). Given a series of n measurements of the pair (X i , Y i ) indexed by i = 1, . . . , n, the 'sample correlation coefficient' can be used to estimate the population Pearson correlation ρ X ,Y between X and Y . The sample correlation coefficient is defined as: , where x and y are the sample arithmetic mean of X and Y .
It holds for the sample correlation coefficient that ρ Est X ,Y ∈ [−1, 1] where |ρ| ≈ 1 indicates a high correlation and ρ ≈ 0 a low correlation.
In a graph matching attack, the success of the attack is measured by the re-identification rate. It's defined as the number of correctly re-identified records divided by the size of the encoded database. Recall from Section 3.3 that the attacker is assumed to have access to a largely overlapping pair of databases, namely D and [D ′ ]. In [25], the overlap rate of two databases is defined as two times the number of common records divided by the sum of the sizes of two databases.
We conducted experiments to study the correlation between similarities and re-identification rates for varying overlap rates and sizes t of the index lists. The results are shown in Fig. 8 and Fig. 9. The first point on the left shows the results for the scheme with Bloom filters but without diffusion, while the remaining ones show the results for the BFD scheme for varying t. Moreover, the four graphs display the results for different overlap rates: 100%, and 95%. As observed in [1], the re-identification rate decreases sharply when the overlap between the encoded database and plaintext database drops. Our experiments confirm this effect where the decrease is stronger for the BFD scheme than for the original BF-based PPRL scheme.

Other Attacks
Our analysis indicates a stronger resilience of BFD compared to the plain BF-based PPRL against pattern mining attacks and graph matching attacks. More precisely, we identified a necessary condition for each attack in Sec. 3.2 and Sec. 3.3, respectively. Afterwards, we showed in Sec. 5.2 and Sec. 5.3, respectively, that the introduction of a diffusion layer helps that these conditions do not hold anymore (or at least to weaken these). Therefore, we conclude that BFD is (more) secure against these attacks.
Naturally, two questions arise: (1) Is BFD also secure against variants of these attacks?
(2) Is BFD also secure against other (possibly unknown) types of attacks?
For the reasons explained in Sec. 3.1, we do not aim for a provably secure PPRL scheme. Thus, we cannot give any definitive answer to these two questions. However, for any variant of a pattern mining and graph matching attack that relies on the same condition as identified in Sec. 3.2 and Sec. 3.3, respectively, the same analysis applies. In this sense, we conjecture that the diffusion layer should also protect against these attack variants.
Moreover, one can prove that BFD does not introduce any new weaknesses. More precisely, any attack that is possible against BFD is likewise possible against the underlying BF-based PPRL scheme as well. This can be shown with a standard proof by reduction. Let S BF Figure 8: Sample correlation coefficient and re-identification rate of the graph matching attack for varying index set sizes t (overlap: 100%). The first point on the left side shows the result for the plain BF-based PPRL scheme (without diffusion) and the remaining points for the proposed BFD scheme (with diffusion). denote a plain BF-based PPRL scheme and S BFD the BFD scheme that results from S BF by applying a diffusion layer (computed with Alg. 1) to the BF filters. Furthermore, let A BFD denote any attacker against S BFD . We use A BFD to construct an attacker A BF against S BF that has exactly the same effort and success probability.
A BF works as follows. It gets as input one or several encoded databases, where each entry is a Bloom filter B. First, it runs Alg. 1 to create a diffusion layer. Second, it applies the diffusion layer to each Bloom filter B, getting the corresponding encoded linkage data E. These linkage data are given A BFD . Note that A BF perfectly simulated S BFD based on S BF . Thus, A BFD can normally operate on these inputs and return some result (to A BF ). Attacker A BF simply uses this result as its own result and is finished.
The consequence is that if there exists an efficient attacker against S BFD , then this is likewise true for S BF . Thus, S BFD cannot be less secure than S BF .

Evaluation
In the following, we evaluate the BFD scheme for practical applications by investigating its efficiency (Sec. 6.1) and linkage quality (Sec. 6.2).

Efficiency
Bloom filter based PPRL schemes are highly efficient, with only modest time requirements for encoding and linking records. Therefore, large real-world applications linking millions of records are possible. As our proposed BFD scheme builds on that scheme by adding a diffusion layer, the efficiency of our scheme can be evaluated by analyzing the run time of the three procedures described in Sec. 4 and comparing them with the BF-based PPRL scheme. The results of our experiments can be found in Table 2 for a Bloom filter length ℓ BF = 500 and ℓ BF = 1000.
The first thing to observe is that BFD requires in addition the generation of a diffusion layer (Alg. 1). In comparison to the two other procedures, it is significantly more time consuming. However, while encoding and linking needs to be done for each record, the diffusion layer needs to be computed only once. Moreover, as the overhead is significantly less than a second, we consider this to be acceptable.
Note that the size of data that needs to be transmitted here is rather modest. A straightforward approach would be to send the ℓ ELD index sets where each index set contains t values from {1, . . . , ℓ BF }, that is ℓ ELD t · ⌈log 2 (ℓ BF )⌉/8 bytes. For example, if ℓ ELD = ℓ BF = 500 and t = 10 as suggested in Sec. 7, this would sum up to less than 6 kB. If the random choices made in Alg. 1 are derived from some seed, one can save even more data by sending the seed only.
As expected, the effort to encode a record increases linearly with the index size t, i.e., the number of BF bits used to compute one bit in the encoded linkage data. However, as the time effort is still in the range of very few milliseconds and as rather small values for t provide the best trade-off between feasibility and security (cf. Sec. 7), we consider this to be acceptable.
Finally, one can see that the time required for linking two records of the BFD scheme is approximately the same as the BF scheme as they are in the same magnitude, namely, 10 −5 milliseconds. This is also anticipated as ℓ BF = ℓ ELD and as the linkage procedure is essentially the same as before.
Note that another factor that impacts the run time is the length ℓ BF of the underlying Bloom filter. However, as we discuss in Section 7, setting ℓ ELD = ℓ BF is a reasonable choice. This has been adopted here as well.

Linkage Quality
We analyze the linkage quality of the BFD scheme by comparing it with the plain BF-based PPRL scheme, as the latter is known to provide good linkage quality. To this end, we adopt the suggestion from [11] to express the linkage quality by MPR, being the mean of precision and recall: Here, tp refers to the number of true positives, f p the number of false positives, and f n the number of false negatives.
For comparing the linkage qualities of the two schemes, we consider their relative MPR: The linkage quality of BFD is the same or better than the plain BF schemes if the relative MPR is ≥ 1 and comparable if this value is close to 1. Figure 10 shows the relative MPR for varying the index set size t and threshold τ values of 0.6., 0.7, and 0.8. Note that the threshold for the plain BF scheme is set to τ BF = 0.8. However, as discussed in Sec. 5.3.1, the Dice coefficient of encoded records can differ before and after applying diffusion. These differences require the modification of the threshold τ accordingly.
As expected, the relative MPR decreases for increasing t and/or increasing threshold τ . However, apart from the case where k = 10 and ℓ BF = 500, the relative MPR is above 0.9 when t ≤ 10. As we are going to discuss into more detail in Sec. 7, this allows for a parameter selection that offers a good trade-off between security and feasibility.

Parameter Selection
A record linkage process is usually organized by an authority to fulfill tasks like granting permissions or organizing the agreement on secrets between the database holders (like the deployed hash functions in the case of BF-based PPRL). Generally, this authority is also responsible for recommending specific parameter choices. In practice, these recommendations are based on theoretical analysis and experiments.
In comparison to the plain BF-based PPRL scheme, the BFD scheme comprises a couple of new parameters, namely (1) the length ℓ ELD of the encoded linkage data (Alg. 1), (2) the size t of the index sets, being the number of BF bits used to compute one bit (Alg. 1), and (3) the (adapted) threshold τ , used as criteria for whether two records shall be linked or not (Alg. 3).
Even if some parameters of existing BF schemes might be re-used, at least some parameters need to be chosen. In the following, we demonstrate this process in a concrete example, using the setup explained in Sec. 5.1. In particular, we build on a BF-based schemes that uses k ∈ {5, 10} hash functions to compute Bloom filters of length ℓ BF ∈ {500, 1000}. The choice of the first parameter of the new parameters, the length ℓ ELD of the encoded linkage data, is a matter of finding an appropriate trade-off. Decreasing ℓ ELD too much (in comparison to ℓ BF ) may result in a too substantial loss of information, negatively impacting the linkage quality. On the other hand, choosing ℓ ELD too large may render the diffusion layer useless, as an attacker may reconstruct the BF bits by solving a system of linear equations. Therefore, to make both schemes comparable in efficiency and linkage quality, we suggest ℓ ELD = ℓ BF .
A wide range of possible choices exist for the other two parameters, the number t of BF bits that define one bit in the encoded linkage data and the threshold τ . In fact, by defining the index sets by I j := {j}, i.e., t = 1, and by keeping the same threshold as before, the BFD scheme is just the original plain BF scheme. Of course, the question is if better variants can be found.
Note that the value t impacts several properties such as security and linkage quality while τ influences just the latter. Therefore, we suggest fixing first t and then adapting τ . In the following, we discuss the choice of t for the case of k = 5 hash functions and a Bloom filter length of ℓ BF = 500. The process works analogously for the other settings as well.
For security, we always want t to be as large as possible. In the security analysis regarding pattern mining attacks in Section 5.2, we assumed that t ≥ k which, in our scenario, results in a requirement of t ≥ 5. However, we stress that this assumption was only made to ensure that certain potential security risks cannot occur. In this sense, it represents a sufficient but not a necessary condition. Our experiments (cf. Fig. 6) indicate that setting t ≥ 10 provides good resistance against pattern mining attacks.
With respect to security against graph matching attacks, our experiments (cf. Fig. 7) show that t = 5 does not sufficiently weakens the relation between sim(λ, λ ′ ) and Dice(E, E ′ )). From Fig. 8, we see that t ≥ 10 will reduce the re-identification rate of graph matching attack and correlation to half of those of BF.
While security benefits from large values of t, the opposite is true with respect to the computational effort and linkage quality. Recall that the purpose of diffusion (see Sec. 4.1) is that small changes in the input result into large changes in the output. That is, by increasing t, it may happen that the encoding of similar (but not equal) records strongly differ and hence will not be linked. That is, the number of false negatives fn will likely go up. On the other hand, the encoding on non-similar records will remain non-similar. Thus, the number of fp will either remain the same or decrease. This means that by increasing t, precision will remain as it is while recall will decrease. For health-related authorities, usually precision has more importance over recall. However, recall is often more important for criminal systems. Therefore the choices are from the interested parties. Thus, the discussion above can be summarized that t should be ≥ 10 but as small as possible. Naturally, we suggest to use t = 10 for a reasonable compromise of security and feasibility.
It remains to fix the threshold accordingly. Here, experiments (cf. Fig. 10) showed that good linkage quality is achieved for a threshold of τ = 0.6. Concluding, the suggested parameter choices for the setting of k = 5 and ℓ BF = 500 are t = 10 and τ = 0.6. For the other settings, Table 3 displays the suggested choices of t. The process was more or less the same as described above. The only exception is given in the case of k = 10, ℓ BF = 500. In this case, the (sufficient) condition of t ≥ k made for the security analysis against pattern mining attacks could not be met.

Conclusion
Privacy-preserving record linkage schemes based on Bloom filters (BF-based PPRL) are popular in practice and academia due to their high efficiency and linkage quality. However, recent research showed that these schemes are not sufficiently secure. This paper proposed a new PPRL scheme, BFD, using Bloom filters extended by a linear diffusion layer. Extensive theoretical and experimental analysis showed that the BFD scheme is more secure against known attacks than the BF-based PPRL scheme while achieving similar linkage quality and running within tolerable time compared to BF-based PPRL. Thus, BFD could, in the long run, be considered as a replacement for the standard BF-based PPRL scheme.
Besides, as no rigorous security analysis has been conducted on the standard BF-based PPRL scheme (and likewise not for its variants that have been proposed), the analysis methods explained in this paper may have merit on their own. For example, as explained, there seems to be a positive relationship between the maximum re-identification rate of the graph matching attack and the correlation, which could be further investigated. Moreover, other encoding methods than Bloom filters may be analyzed by the methods proposed here.