StyleGAN as a Utility-Preserving Face De-identification Method

Face de-identification methods have been proposed to preserve users' privacy by obscuring their faces. These methods, however, can degrade the quality of photos, and they usually do not preserve the utility of faces, i.e., their age, gender, pose, and facial expression. Recently, GANs, such as StyleGAN, have been proposed, which generate realistic, high-quality imaginary faces. In this paper, we investigate the use of StyleGAN in generating de-identified faces through style mixing. We examined this de-identification method for preserving utility and privacy by implementing several face detection, verification, and identification attacks and conducting a user study. The results from our extensive experiments, human evaluation, and comparison with two state-of-the-art methods, i.e., CIAGAN and DeepPrivacy, show that StyleGAN performs on par or better than these methods, preserving users' privacy and images' utility. In particular, the results of the machine learning-based experiments show that StyleGAN0-4 preserves utility better than CIAGAN and DeepPrivacy while preserving privacy at the same level. StyleGAN0-3 preserves utility at the same level while providing more privacy. In this paper, for the first time, we also performed a carefully designed user study to examine both privacy and utility-preserving properties of StyleGAN0-3, 0-4, and 0-5, as well as CIAGAN and DeepPrivacy from the human observers' perspectives. Our statistical tests showed that participants tend to verify and identify StyleGAN0-5 images more easily than DeepPrivacy images. All the methods but StyleGAN0-5 had significantly lower identification rates than CIAGAN. Regarding utility, as expected, StyleGAN0-5 performed significantly better in preserving some attributes. Among all methods, on average, participants believe gender has been preserved the most while naturalness has been preserved the least.

are concerned about uploading photos on social media, where 53% of the survey participants have refused to upload a photo on social media in the past because of privacy concerns and 81% of them would have used obfuscation methods if they had access to them.
Social media platforms do not provide effective functions for obfuscating or de-identifying faces, mainly because they profit from images, as they can be used for identifying relationships and connecting people, and marketing and advertisements.
In addition, in computer vision, creating datasets or use of highquality images, which include people, is challenging, as due to data privacy regulations, such as the General Data Protection Regulations (GDPR), people need to consent to the usage of their image data. However, many computer vision tasks, such as person detection, gender, race or emotion detection, or action recognition, do not need to identify the people on the images or videos [41].
Due to these reasons, there is a new trend in the development of face de-identification methods, which try to: (1) effectively hide the subjects' identity such that both humans and machines cannot re-identify people; (2) preserve the realism of visual data, i.e., makes de-identified subjects look realistic, to be appealing to people, and (3) preserve the utility of the visual data, i.e., the people's age, gender, pose, and expression in the data. While these three properties are not required in all applications, providing them might encourage users to employ such methods before sharing their photos, and can still enable computer vision researchers and technologies to employ vision tasks that do not need to identify people in the images.
In particular, existing face de-identification methods have evolved from image filtering to more advanced face de-identification methods. Image filtering modifies the information using common image filters, such as blurring [18,45,46,69] or pixelation [8,34,46], which often give unpleasant occlusion. The more advanced face de-identification methods either make imperceptible changes to the photo to evade recognition by specific recognition algorithms [14,17,50,58,59], or substantially modify faces, thus making them unrecognizable for generic recognition algorithms [27,61]. Some recent work has proposed Generative Adversarial Networks (GANs) [21] for face de-identification, where they generate synthesized objects [9,43,44,52,53,53,67]. However, these methods may not preserve the characteristics of the original face. The resulting faces may have artifacts from inpainting faces of unfitting face poses, expressions, or implausible shapes.
StyleGAN [32] is a GAN designed to create high-resolution, realistic but imaginary images. Unlike traditional GANs, it can control the features of the generated image or face by style mixing and transferring, in which the generated image inherits the styles or features of an image. While StyleGAN has not been originally designed for face de-identification, this paper investigates its effectiveness and robustness as a utility-preserving face de-identification method.
We show how StyleGAN can be augmented to generate deidentified faces, transferring the original face's features to the deidentified face. We extensively evaluated the privacy of the generated faces through several automatic machine-based re-identification attacks comparing its results with those of two state-of-the-art GAN-based face de-identification methods. In addition, we evaluated and compared these methods in terms of their utility-preserving property by employing Face++ on all the de-identified faces. To our knowledge, no other work has done such an extensive evaluation of face de-identification methods. Almost all prior studies evaluated the privacy of these methods against either machines [30,41] or humans [39]. Also, this paper is the first to test their utility-preserving property through detection methods and conducting a user study.
Our findings show that StyleGAN performs better than the stateof-the-art GAN-based de-identification methods in privacy and utility preserving if certain style-mixing levels are used. Moreover, since the audience for online photos is human beings, we investigated the human (vs. machine) ability to identify people in the photos and their perception of preserving utility and realism. Our findings showed that, in general, human observers are less likely to verify and identify the de-identified images successfully. In addition, StyleGAN models perform on par or even better than CIAGAN. Moreover, we found that StyleGAN 0-5 and StyleGAN 0-4 preserve utility attributes, such as naturalness, pose, and expression, more than other models. These promising results can inspire the research community to study the properties of StyleGAN and other style-transferring models for developing more advanced utilityand privacy-preserving face de-identification methods. Thus, this paper has the following contributions: (1) Utilized StyleGAN to generate de-identified faces based on the latent vectors of the target's face so that the de-identified faces look different but have the same utility features as the target. (2) Implemented extensive experiments to examine the utility-and privacy-preserving properties of StyleGAN for style-mixing levels under different attack models against re-identification attacks, including verification and identification. (3) Compared StyleGAN with two recent GAN-based face de-identification methods. (4) Carefully designed and conducted a user study to investigate the privacy and utility of de-identified faces from the human perspective. For the first time proposed to use the concept of the police lineup for creating the identification questions. (5) To the best of our knowledge, we are the first to compare the utility-preserving property of face de-identification methods in terms of privacy and utility preserving in a human study and using Face++. (6) Created high-quality datasets for the community to evaluate face de-identification methods.

BACKGROUND AND RELATED WORK 2.1 Traditional and K-same Methods
Traditional methods, including pixelation, blurring, and masking methods, heavily damage the utility of images [36]. Face swapping methods replace a face with another face [7,51], trying to preserve privacy and some faces' attributes. K-same methods, for example, cluster faces based on some attribute so that similar faces appear in the same cluster, then a de-identified face is generated by, e.g., averaging all the faces in a cluster, which is used for replacing all the faces in the cluster [47]. Some approaches [7] create a dataset of usually synthetic faces and then propose an algorithm to replace the target face with a face in the dataset with attributes similar to the target's face. These de-identified faces suffer in terms of quality and their alignment in the background. Also, since the datasets are small, i.e., not having faces with all attribute combinations, the results might not preserve some of the attributes of the faces [22,47]. The state-of-the-art face-swapping methods, including DeepPrivacy [30] and CIAGAN's [41], use GANs to generate and replace faces with similar attributes trying to fix these problems. They do not need to pre-generate a synthetic dataset, and the de-identified faces can be generated in real-time and look more realistic.

GAN-based Face De-identification
Early research on applying GANs for face de-identification began by applying parametric face models [62]. The GAN-based deidentification methods can be divided into two sets: (1) the methods which rely on conditional inpainting [30,55,60,61], and (2) the manipulating facial representation methods [10,19,37,38,65,66,68]. Nousi et al. [49] proposed fine-tuning of deep auto-encoders to preserve utility and privacy. Sun et al. [63] used a hybrid model consisting of facial modeling to separate the identity and maintain facial features and inpainting using GANs to add background and make it realistic. Agarwal et al. [6] proposed to preserve emotion by extracting facial attributes and feeding the non-biometric vector to the latent vector of an auto-encoder. This work used StyleGAN to have a dataset of proxy faces for different facial expressions and poses. The ones with the most similar features replace the target face. Our work, however, examines whether StyleGAN can be used as a face de-identification method. Two state-of-the-art de-identification methods, i.e., DeepPrivacy [30] and CIAGAN [41], leverage conditional generative adversarial networks to remove the identifying characteristics of faces. Since we evaluate and compare the performance of StyleGAN with these methods, these two methods are explained in more detail.
2.2.1 DeepPrivacy. DeepPrivacy aims to generate images that preserve the original pose and image background. In this method, the background photo (with no face) is fed to the generator in each UNet network layer along with a vector containing pose information.
2.2.2 CIAGAN. CIAGAN aims to provide control over the identity of the de-identified face using a vector. Similar to DeepPrivacy, it uses an auto-encoder, which gets background and landmark information as its inputs. The architecture of CIAGAN concatenates a one-hot identity vector to the bottleneck of the generator to ensure generating a face with an identity not present in the dataset. Similar to DeepPrivacy and CIAGAN, StylGAN is a conditional GAN that, with our augmentation, enforces the generated face to inherit styles from both the target and an auxiliary face. While DeepPrivacy and CIAGAN attempt to preserve mainly pose, StyleGAN can preserve more attributes, such as gender and expression.

Evaluation of De-identification Methods
Unlike privacy-preserving methods that can be defined and measured by mathematical frameworks such as differential privacy [16] and information theory [54], to ensure the privacy of individuals in datasets, face de-identification methods cannot provide privacy guarantees. This is because face de-identification is not per dataset but per image, and the image is high-dimensional data and not aggregated statistics. That is why they are empirically tested against face re-identification attacks. Face re-identification is a function that attempts to identify the person associated with a face image properly. Prior research evaluated face de-identification methods by employing a subset of these three approaches: (1) performing privacy vs. utility analysis [23,24], (2) studying human or viewer experience [29,39], and (3) measuring de-identification robustness against adversarial machine learning attacks [12,28]. This paper, however, examines StyleGAN through all three approaches.

STYLEGAN FOR FACE DE-IDENTIFICATION
StyleGAN and its variants are designed as a combination of progressive GAN with neural style transfer, which generates highresolution imaginary images that look authentic [33]. Style transferring refers to the representation of the content of an image in the style of another using Convolutional Neural Networks [31]. Artistic style transfer has been used to create artificial artwork from photographs [13]. In StyleGAN, style transferring can help transfer some features of a face image, such as pose, shape, and gender, to another imaginary face image. Therefore, we propose to utilize this style transferring property of StyleGAN for generating utility-preserving face de-identification. Here, we provide some background needed to understand the architecture of StyleGAN for face de-identification. Here, we provide some background needed to understand the architecture of StyleGAN for face de-identification: Latent Vector: In face recognition models, face images are transformed from high-dimensional data (image) to low-dimensional data (latent vector), which uniquely represents the face. Therefore, the face is transformed from the image space to the latent space. GANs are the inverse of face recognition models; they transform the latent space into the image space. StyleGAN generates an imaginary but realistic-looking face by mixing the styles of two latent vectors. To use StyleGAN for face de-identification, we propose to mix the latent vectors of the target face and an auxiliary face.
Style Mixing for Face De-identification: As shown in Figure 1, StyleGAN generates an image based on the latent vector. The latent vector is first mapped into an intermediate latent space, , using Fully Connected Neural Networks (FCs). Then, using the affine transform and the AdaIN blocks, the style of the latent vector is applied to the feature maps (the output of convolutional layers). The AdaIN block normalizes the feature maps and scales them with the parameters obtained from the affine transform. Noise is added using the scale of to each block. StyleGAN-1 and StyleGAN-2 generators consist of convolutional layers with an increased resolution output, starting from a learned constant (4*4*512) and continuing to a high resolution (1024*1024). Therefore, coarser features of the face are composed of earlier layers and finer ones at the last layers.
Transferring the style of the latent input vector to a specific layer transfers the corresponding features of the corresponding face to the generated face. Since styles of any latent vector can be transferred to any location, we can consider two latent vectors ( 1 and 2 ) and transfer their styles to the desired locations (e.g., Figure 1: Style mixing using StyleGAN [32]. In our proposed de-identification method, 1 and 2 are latent vectors for the target and an auxiliary face, respectively. This example shows StyleGAN 0-2 where styles 0-2 are inherited from the target and the rest from the auxiliary face. Figure 1 shows styles 0-2 are transferred from 1 and the rest, 3-17, from 2 ) so that the generated face has certain features, e.g., coarser features of the first face and finer features of the other face. Coarser features are more related to utility and finer ones are more related to identity. For de-identification, we propose to generate a de-identified face that inherits the target face's coarser features and the auxiliary face's finer features so that the generated face has the utility features of the target and the identity of the auxiliary face.
Style Layers: In StyleGAN-1 and StyleGAN-2, the level of style mixing can be set by ∈ {0, 1, ..., 17}. When setting style mixing to , styles 0 to are inherited from the target face image, and styles + 1 to 17 are inherited from the auxiliary face image. We refer to style mixing 0 − when styles 0 to are inherited from the target and the rest from the auxiliary face. Similarly, StyleGAN 0 − refers to the StyleGAN model that de-identifies faces by inheriting styles 0 − from the target and the rest from an auxiliary face.
Creating Latent Vectors for Face Images: Since the inputs to the StyleGAN are two latent vectors, we need first to obtain the latent vector corresponding to the target face image. Any random latent vector can be used as the auxiliary face. As shown in Figure 2, the enhanced version of StyleGAN encoder [33] is used to generate the latent vectors. This is a ResNet encoder [3] trained on a set of faces and their corresponding latent vectors, which can in return generate an estimated latent vector for a face image. To improve this estimate, an optimization algorithm could be used that computes the Euclidean distance between the generated and the target face images as its cost function. However, this approach is too costly. To overcome this limitation, the optimization is enforced on the feature maps instead of the images. The feature maps are generated using a pre-trained VGG network and the optimization is performed using L2-optimization. The optimized latent vector is considered a very close estimate of the target latent vector [3].

METHODOLOGY
To examine the effectiveness and robustness of StyleGAN-2 for face de-identification, we first used StyleGAN-2 to create a set of de-identified faces, then we tried identifying these faces by implementing several attacks considering various threat models, ranging from white to gray to black-box adversarial settings. We then investigated the quality and utility preserving of de-identified faces generated by StyleGAN-2 by passing them through Face++ [4], a well-known cloud system for face analysis. Moreover, we compared the performance of the attacks and Face++ on faces generated by StyleGAN-2 with those generated by two other state-of-the-art face obfuscation methods, CIAGAN [41] and DeepPrivacy [30].

Threat Model
In practice, the use of face de-identification alone might not be enough to protect the privacy of users because attackers, especially human observers, might be able to use contextual features of the images to infer people's identities. StyleGAN and other face deidentification methods only modify the face images and not the full images. While other face de-identification methods mostly only consider a black-box setting [11,42,56], in this paper, we assume that the adversary can have different levels of capabilities and knowledge. We can divide this knowledge into three categories: black box, grey box, and white box. In a black-box setting, the attacker has no knowledge about the method parameters while they know all the parameters in the white-box attack. If they have partial knowledge, the attack is grey-box. The adversary's knowledge can be about (1) the StyleGAN style levels that are used for generating the de-identified faces, (2) access to a set of target's face photos, and (3) access to a set of auxiliary faces that are used during the generation of the de-identified faces. Table 1 shows different greybox attack assumptions based on the adversary's knowledge. In the threat model 1 , the attacker is assumed to know about the style level and target photos, however, they do not have access to the set of auxiliary photos used for de-identification. Therefore, during the attack, a different set of auxiliary pictures are used to create datasets to train the neural network. For another example, in the threat model, 4 , the attacker does not know the styles used for de-identification so they would use another mixing style for training. We then considered a highly capable attacker that has access to a lot of training data. We used the above-mentioned scenarios in our implemented identification attacks. If the adversary does not have access to any of this knowledge, then the attack is called black-box. The table does not include an adversary who has knowledge about all of the three categories because having all that information, the attacker can easily recognize the identity. Note that if the target photos are known then the attack is targeted, otherwise, it is untargeted. Threat model m7 is the most plausible because it is less probable that the attacker knows any of the photos and style levels beforehand. In addition, since the auxiliary photo can be generated from any random latent vector, the probability that the attacker learns them is negligible. Moreover, the probability that the attacker has access to the exact image of the target is small.

Data Generation
Dataset of Target Faces: CelebA dataset was used for our experiment [40]. This dataset consists of 10,177 identities with 202,599 face images, i.e., about 20 images for each identity. For each face image, there are 40 binary attributes along with 5 landmark locations. This dataset is collected from various sources on the Internet with different levels of quality, enabling us to evaluate the system on photos with various qualities. Because of process time constraints, we randomly selected a subset of 200 identities with at least 20 images, using an equal number of identities for males and females. Therefore, the total number of images in our dataset is 5,007, approximately 25 images per identity.

Dataset of Auxiliary Faces.
We used StyleGAN2 to generate auxiliary faces. Using the trained StyleGAN2, new faces can be generated based on seed numbers given as input. Each seed represents a latent vector. Using this capability of StyleGAN2, 400 faces were generated using the first 400 seeds of the pre-trained StyleGAN2. Then, 20 of these faces, 2 categories of 10, one for training and the other for validation, were chosen. In each category, we tried to have images from different genders, ages, ethnicity, hair colors, etc. Each of these four categories is used for different experiments. For efficiency and to save computation time, we reduced the resolution of images from 1024*1024 to 256*256. Still this size is larger than the sizes of all images in the dataset.

Generating
De-identified Faces. The de-identified faces were created by using StyleGAN2 and mixing 5,007 target faces with the 20 auxiliary faces and using 9 style mixing levels, 0-0 to 0-8. We generated the de-identified faces using various style mixing levels because they determine the level of information (utility) that is passed from the targets' faces into the de-identified faces. For example, in 0-0, the first style, 0, is inherited from the target, and styles 1 to 17 from the auxiliary, so the output image will be almost identical to the auxiliary face, while for 0-8, styles 0 to 8 are inherited from the target and style 9-17 from the auxiliary face.
In this case, the generated face will have some similar features to the target face. There is a trade-off between preserving the utility of a target's face and protecting their identity. Our initial tests show that if mixing levels more than 0-6 are used to transfer the target image's features, the outputs become too similar to the target image, i.e., not preserving privacy. Therefore, in our experiments, to limit the number of unnecessary experiments, we only tested style mixing levels up to 0-8. Each face was mixed with 20 auxiliary faces generating 20 * 5, 007 = 100, 140 StyleGAN-generated faces for each style. Therefore, there will be 9 * 100, 140 = 901, 260 StyleGANgenerated faces in total. In our identification attacks, we use some of these de-identified faces for training and some for validation.

De-identified Faces by DeepPrivacy and CIAGAN.
We used two GAN-based utility-preserving face de-identification methods, DeepPrivacy [30] and CIAGAN [41], to generate de-identified faces for the 5,007 face images. While DeepPrivacy could generate deidentified images for all the faces, CIAGAN failed to de-identify 831( 17%) images due to the face detection failure in its processing phase, where the dataset is prepared for de-identification. Figure 12 shows the generated images of the nine mixing levels (0-0 to 0-8 ) mixed with an auxiliary photo along with de-identified photos obtained from DeepPrivacy and CIAGAN for 4 different target samples chosen from our datasets. The StyleGAN faces change from the most similar to the auxiliary photo (0-0) to the most similar to the target photo (0-8). StyleGAN-generated photos also seem more natural and have better quality than DeepPrivacy and CIAGAN. StyleGAN modifies the background of faces, while other methods do not change it. However, the background can easily be extracted and replaced with that of the generated photo [3].

UTILITY AND PRIVACY ANALYSIS 5.1 Face Detection
Some face de-identification methods cannot even generate a face [25]. Face quality has also been defined as the usability of an image for recognition [25]. Therefore, to measure the effectiveness of StyleGAN in generating recognizable faces, we passed all the deidentified images through the Face++ detection module [4]. Face++ provides an online free API for face detection, verification, and attribute analysis. The face detection function returns the boundary box of faces in an image, and the face verification function gets two faces as inputs and outputs the probability (confidence) score of these faces belonging to the same identity. The attribute analysis determines different face attributes, including landmarks, age, gender, etc. Face++ achieved 99.50 recognition accuracy on LFW dataset [70] and has been widely used by academics for analysis of images [48,59,64]. We found that the Face++ detection module could successfully identify a face in all StyleGAN-generated photos, while it failed for 29 (0.6%) and 44 (1.1%) of images generated by Deep Privacy and CIAGAN, respectively. Note that we had 20 (auxiliary photos) * 9 (styles) = 180 times more images for StyleGAN compared to the other two methods, and still all faces were detected. We manually checked all the photos that Face++ failed to detect a face in them and found that they did not have any face. Therefore, we conclude that the quality of StyleGAN-generated faces is higher than CIAGAN and Deep Privacy.

Utility Preserving Evaluation
Some works have attempted to develop utility-preserving face deidentification methods [37,55,60,67]. Transferring non-identityrelated features or the utility of the face to the de-identified face increases not only the quality of the photo and its visual appeal but also the de-identified photo can still be analyzed by applications, such as recommendation systems that provide services based on these attributes. With advances in image recognition, many service providers, including Face++ [4], offer online face analysis services to extract face attributes, such as age, gender, emotion, smile, eye status, head pose, mouth status, blurriness, ethnicity, etc. These attributes are also called utility. Therefore, we employed some experiments using the Face++ API to check the utility preserving of StyleGAN2. We did not evaluate the effectiveness of StyleGAN in preserving race because this feature does not exist in Face++.

Metrics.
We computed and compared the mean and standard deviation of differences for age, blurriness, and smile. For the gender, the ratio of times when the genders are the same is presented. For the emotion and the eye and mouth statuses, the state with maximum confidence value is selected, and the ratio of times the original and de-identified faces have the same chosen state is computed. For the head pose, the mean of differences for each angle is calculated, and then, the mean of the three numbers is presented as the final value for the head pose difference.

Verification Attack
Face de-identification methods are mostly evaluated against two types of attacks [26]: verification and identification attacks. In a verification attack, the attacker has a suspect and tries to verify if the de-identified face belongs to that suspect. In practice, in this attack, two face images are given to a face recognition system, and the system should report to what extent they belong to the same person. One of the images contains the unknown or the deidentified face, and the other is a face image of the target or suspect. We used Face++ to implement our verification attacks. We define two attack scenarios: (1) when the attacker has the exact image that the target has used for generating their de-identified face, and (2) when the attacker has access to another image of the target.

Experiments:
All the de-identified images with their corresponding non-de-identified images are given to the Face++ compare module, which returns a confidence score for each pair indicating to what extent (from 0 to 100) the identity of the faces is the same. The attack is successful if the system has high confidence in two images belonging to the same person. In our experiments, we specify a threshold variable, ranging from 0 to 100, and for every image pair, if their obtained confidence score is more than a given threshold, then we label that verification attack a success for the attacker and a failure for the de-identification method, i.e., the de-identification is not successful in obscuring the face, and vice versa. Figures 3 and 4 show the average de-identification success rate (or 1 -attack success rate) for different threshold values, for Style-GAN with different style-mixing levels, CIAGAN, and Deep Privacy for the first and second scenarios respectively. The curves have shifted upwards in scenario 2 compared to scenario 1 confirming our hypothesis that it is harder to re-identify a face when the attacker does not have access to the exact image that the target has de-identified. The results for both scenarios are pretty consistent. The higher the threshold, the de-identification success rate is higher. De-identified faces generated by StyleGAN 0-0 and 0-1 cannot be re-identified and the de-identified faces generated by higher stylemixing levels are easier to re-identified, especially those generated by 0-7 and 0-8. Interestingly, the performance of Deep Privacy is better than CIAGAN and the performance of StyleGAN with style mixing levels 0-3 is far better than both. Results for StyleGAN with style mixing levels 0-4 are comparable with those of Deep Privacy and still far better than those of CIAGAN. Although deidentification success rates for StyleGAN 0-5 are worse than those of Deep Privacy, they are near those of CIAGAN.

Discussion:
Remembering the attribute analysis results, Style-GAN with style-mixing levels 0-3 onwards were as good as or better than the state-of-the-art de-identification methods in terms of high utility. Here we observe that the de-identification success rates are far better for 0-3. Therefore, based on these results, we argue that StyleGAN, even though was not designed for face obfuscation, performs on par or even better than the state-of-the-art deidentification methods. In addition, because of the ability to choose the mixing levels, the users have the ability to tune the trade-off between privacy and utility. For example, if the user is more concerned about privacy, they can choose StyleGAN with style-mixing levels 0-3, and still, the output image preserves adequate utility, or if the user prefers better quality and more utility-preserving de-identified faces, they can choose StyleGAN with style-mixing levels 0-4 and still obtain de-identified images that preserve privacy the same as other state-of-the-art de-identification methods.

Identification Attack
In the identification attack, which is an untargeted attack, the attacker aims to identify a de-identified face by matching it to a person in a set of known people. In other words, identification is a 1-to-N comparison where the goal is to determine if the target is one of the N suspects, whereas verification is a 1-1 identification. We used FaceNet [57] to implement the identification attack. FaceNet is a face recognition model, GoogleNet, developed by Google's researchers    which reached a state-of-the-art accuracy, of 99.63, by being trained on the Google image set (200 million images for training and 8 million for validation). FaceNet has been used for evaluating face de-identification methods [41,56]. FaceNet outputs a distinct feature vector, called face embedding, with a size of 128 for each image. Obtaining the embedding for all the images, then, they can be fed to a classifier, e.g., a Support Vector Machines (SVMs) to perform the identification (i.e., classification). We implemented a multi-class SVM classifier with majority voting, i.e., for each possible pair of classes (identities), a binary classifier is trained, and the class of test data is determined by majority voting.

Experiments.
For this attack, we implemented all the threat models listed in Table 1. The number of images for training and validation changed based on the threat model. For example, if the attacker does not know the auxiliary photos, we considered half of the dataset for training and the rest for validation. When the attacker does not know the target photo, we split 70% of images of the target for training and 30% for testing. For testing only obfuscated versions of the remaining 30% were used. When the attacker does not know any of the auxiliary photos or the target photo, both conditions were applied. For threat models, m4 to m7, the attacker does not know the style-mixing level. In that case, we randomly sampled 20% of whole images of other styles for validation with other assumptions applied. For CIAGAN and Deep Privacy, we considered one threat model in which the attacker has access to 70% of the identities and creates de-identified photos for training, and uses the trained model to recognize the identity of the remaining photos. In all the experiments, we used 200 classes (or identities). Table 3 shows the results of the identification attacks for all style-mixing levels and the seven threat models.

Results.
As expected, the de-identified faces generated by higher stylemixing levels are easier to re-identify, especially those generated by StyleGAN 0-6 upwards. For example, in the threat model m2, the attacker has knowledge about both the style levels and auxiliary photos, the accuracy of the identification attack (validation) for style-mixing levels 0-2, 0-3, 0-4, 0-5, 0-6, 0-7 and 0-8 are 2.3%, 3.8%, 6.7%, 18.2%, 67.3%, 82.4%, and 84.4%. However, there are a few exceptions. For example, in threat models m4 and m5, the accuracy increases until 0-5 and reaches 41.2% and 40.2%, respectively, but then it starts decreasing from style levels 0-6, and it gets to 16.2% and 34.2% for style levels 0-8. We see a similar but less severe trend for threat models m6 and m7 as well.
Comparing the validation results for different threat models, we observe that if the attacker has this knowledge about the style levels, i.e., in m1, m2, and m3, the accuracy is much higher compared to other threat models, i.e., m4, m5, m6, and m7. For example, for style levels, 0-6, the validation accuracy for m1, m2, m3, m4, m5, m6, and m7 are 87.6%, 67.3%, 88.6%, 39.6%, 38.9%, 27.5%, and 27.3%. Note that this assumption that the attacker has knowledge about the style levels is a strong assumption and in practice might not be realistic.
Running the identification attack on face images of CIAGAN and Deep Privacy, we obtained an accuracy of 31.2% and 30.1%, where the threat model is black-box, i.e., equivalent to m7. Comparing their accuracy with those of m7, we see that no matter the style level, it is harder to re-identify faces generated by StyleGAN, as the best verification accuracy for m7, is 24.8%. Moreover, even no matter the threat model, we observe that the performance of the attack, when StyleGAN 0-3 and 0-4 are used, is better or on par with that of CIAGAN and Deep Privacy. Overall, through our extensive experiments, we showed that StyleGAN is a better de-identification method, especially if style-mixing levels 0-3 and 0-4 are used; 0-4 preserves utility better while 0-3 preserves privacy better.

HUMAN EVALUATION
We conducted an IRB-approved experiment to evaluate the effectiveness of StyleGAN and other state-of-the-art de-anonymization methods through the eyes of human observers in terms of privacy, utility, and overall quality. Particularly, to examine privacy, we test the following hypotheses. Hypotheses H1-H3 correspond to the verification attacks and hypotheses H4-H6 correspond to the identification attacks. H1: It is less likely for humans to successfully verify a de-identified face. H2: It is less likely for humans to successfully verify a de-identified face generated by StyleGAN compared to other GAN-based face de-identification methods. H3: It is less likely for humans to successfully verify a de-identified face generated by StyleGAN 0-3 compared to StyleGAN 0-4 and StyleGAN 0-5. H4: It is less likely for humans to identify a de-identified face among a set of faces successfully. H5: It is less likely for humans to successfully identify a de-identified face among a set of faces when generated by StyleGAN compared to other GAN-based face de-identification methods. H6: It is less likely for humans to successfully identify a de-identified face among a set of faces when generated by StyleGAN 0-3 compared to StyleGAN 0-4 and StyleGAN 0-5.
We examined the utility of the de-identified faces by asking participants whether the original faces and their de-identified versions share any of the following attributes: gender, pose, expression, and age. We also checked for the overall quality of the de-identified faces asking whether they look natural. Our hypotheses for these features are: H7: The de-identified faces generated by StyleGAN 0-5 are more likely to preserve the utility features compared to those of other de-identification methods. H8: The de-identified faces generated by StyleGAN 0-5 look more natural compared to those of other de-identification methods.
Moreover, we tried to understand the overall participants' preference about using different face de-identification methods, where we de-identified a few samples of images using various face deidentification methods, including StyleGAN 0-5, StyleGAN 0-4, Style-GAN 0-3, CIAGAN and DeepPrivacy. We analyzed the participants' preference ranking and their justifications.

Experimental Design
We used five GAN-based face de-identification methods to examine verification and identification attacks: StyleGAN 0-5, StyleGAN 0-4, StyleGAN 0-3, CIAGAN, and DeepPrivacy. We chose these Style-GAN models because they showed better performance against ML-based attacks. We also tested the baseline condition of no deidentification (as is). Therefore, in total, we had 6 conditions (5 face de-identification methods plus as is).
6.1.1 Metrics. We measured de-identification effectiveness using identification success and confidence, and users' preferences:

Popularity of Face Replacement Compared to Traditional Methods:
We asked "Assume that you are the woman in the middle of photo X, and you want to upload the photo on social media, but you want to respect the privacy of the others in the photo. You have access to four face obfuscation tools (A, B, C, D) to hide the identity of others. One tool replaces all other faces with new faces, which tries to preserve the age and emotion of people (A), the second one removes the intended people (B), the third one blurs them (C), and the last one replaces their faces with emojis (D). Please order the methods based on your preference ( 1 will be your top choice). What will you choose to use?" De-identification Effectiveness: We measure the de-identification effectiveness using the same metrics discussed by Li et al. [39]: hit (the target is present in the choices, and the response is correct), miss (the target is present, but the response is incorrect when a wrong person or "None of above" is selected), correct rejection (the target is absent, and the response is "None of above"), and false alarm (the target is absent, but "None of above" is not selected).
• Verification Success: We measured verification success by asking, "Do you identify the below two faces as the same person?" Two answer choices included "Yes" and "No. " • Identification Success: We measured identification success by asking, "In the people listed below (a-d), please identify the person indicated by X." Five answer choices included four face photos (a-d) and "None of above. " • Confidence: After each Verification and identification, we measured confidence using the question, "How confident do you feel about your answer?" Participants rated their response on a scale from 1 'Completely unconfident' to 7 'Completely confident,' where a higher score meant more confidence.
Utility Preserving Effectiveness: We measured each de-identification method's effectiveness in preserving the utility attributes by asking "What attributes of these two faces are the same?" Five answers were shown as check-boxes and included Gender, Face pose (position), Expression, Age (Below ten years difference), and Looking natural (choose if they both look natural).
Looking Natural or Realistic: We measured and compared the quality of generated images by the five de-identification methods, in terms of realistic looking, by asking " In the picture below, X is a real face. If you want to hide the identity of X while having a realistic-looking or natural face, what is your order of preference? (1 will be your best and five will be your worst choice). " Overall Preference: We measured overall preference by asking "If you want to hide the identity of face X by using one of the faces below, which one will be your top choice?" The answers are all deidentified versions of face X using each of the five de-identification methods. The answers to the two previous questions might be different because some people might prefer methods that do not generate natural-looking images.
6.1.2 Selecting Faces. We selected faces from the CelebA dataset. CelebA consists of photos of more than 10,000 celebrities. Therefore, knowing these celebrities might make it easier for the participants to identify faces, leading to higher identification success and a lower bound for obfuscation success. In verification questions, when the target is present, two face images of the target are selected, one gets de-identified, and both pictures are shown to the participants. In the second scenario, when the target is absent, one image of the target is de-identified and shown along with another person's photo. In identification questions, when the target is present, another photo of one of the people in the choices is selected and de-identified or untouched (in the as is condition) and shown as X.
In verification questions when the target is absent and in identification questions, we add images of other identities to the questions. This selection can impact the results, e.g., suppose images are very different from each other and the target. It could be easier for the participant to select "No" in verification and "none of the above" in identification questions. Therefore, we tried using faces with similar attributes, inspired by the Police Lineups [5]. In a police lineup process, a crime victim or witness's putative identification of a suspect is confirmed to a level that can count as evidence at trial. In this process, the suspect, along with several "fillers" or "foils"-people of similar height, build, and complexion stands side-by-side, facing and in profile. Like Police Lineup, in identification questions, we showed the photo of person X and asked if participants could identify X's face in a line of four other faces with similar face attributes. Similarly, we chose two faces with similar face attributes in the verification target-absent questions.
Face Clustering Algorithm: Since we randomly selected faces to be added to our questions, we also tried to cluster faces so that we could automatically identify faces of similar attributes to be used along with the target face or in choices. To find face images with similar attributes, we applied clustering on all images of the CelebA dataset using the 40 attribute labels provided with the images, including "Arched_Eyebrows, " "Young, " "Bald, " "Blurry, " "Double_Chin, " "Wearing_Necklace, " etc. In addition, participants might identify people just based on having the same pose, age, and emotion. Therefore, to minimize the impact of these factors on participants' decisions, we also used Face++ to obtain the attributes for images on each cluster and only maintained those with similar attributes. Our criteria regarding age were that the age difference between faces of a cluster should be lower than 10, while for the pose, the absolute difference of different pose angles from the cluster's mean should not be more than half of the standard deviation. Since we needed four choices for each identification question, we excluded clusters with less than 4 identities. The final dataset contained 37 clusters, 24 female ones, and 13 male ones. This double clustering made sure that faces in a cluster have very high similarity in face attributes. In the identification target-present questions, for choices, we used an image of the target and images of three other people from the cluster that includes the target. We used another image of the target for de-identification because de-identifying the same image results in an image with the de-identified face but with the same background as the original image, which could give a hint to the participants. To minimize the impact of this factor on the participants' decisions, we used another image of the target for de-identification and presented it as image X in the questions. Consent Form: It contains the contact information about the investigators, the study's goal, IRB approval, research procedure, the collected data, possible benefits and risks, compensation, confidentiality, and the consent statement and question. They are asked if they want to continue the survey. They should select "yes, I would like to continue to the survey, " otherwise, their survey is ended.

Main Survey Questions:
The main part of the survey includes four clusters of questions: (1) verification, (2) identification, (3) utilitypreserving, and (4) method preference. For each condition, we created ten questions, with five females and five males as target faces. For example, to test StyleGAN 0-3 against the verification attack, when the target was present, ten questions using different targets were created, and only one of them was shown to each participant randomly. We included three attention-check questions randomly throughout the survey. We carefully designed these attention checks to be fair. For example, since in our survey, participants observe many verification questions with the exact question text, they might not read the questions but answer them by just investigating the images. Therefore, in attention-checks, instead of asking them to choose a specific choice, we put images with obvious answers and expected to obtain correct answers. For example, for a verification question, we showed the same image for both the original and deidentified images, expecting an honest user to verify that the images belong to the same person. Having answered all the questions, they were given a six-digit number to write down and enter in the box provided on Amazon Mechanical Turk to be compensated.

Results
The experiment was completed by 161 participants. We excluded the data of 8 participants who failed at least one attention check question. Therefore, the final sample size is 153.

Compared with Traditional Methods.
We first examined if participants would find utility-preserving face de-identification methods compelling and would consider using them in the presence of other more common methods, such as blurring, using emojis, or removing the faces (see Figure 14). We asked the participants to put themselves in a situation where they want to upload a photo of themselves at a convention on social media, but they want to respect the privacy of others in the photo. They have access to four face de-identification tools to hide the identity of others, including (A) a face replacement method that preserves the age and emotion of people, (B) a method that removes faces, (C) a blurring method, and (D) replacing faces with emojis. They could order the methods based on their preference. This question was followed with another explanation question asking to explain the reasons for their top favorite method. The results of the ranking are shown in Figure 5. It seems that face replacement and blurring were the most favorite, with being chosen 46 and 44 times as the top favorite methods, respectively, while body removal and emoji replacement were the least favorite ones, with being chosen 31 and 31 times as the best method. We also calculated the means and standard errors of the means for each of these four methods. Face replacement had the lowest mean (higher ranks) of 2.327 (std err= 0.09), and blurring was in second place with a mean of 2.399 (std err= 0.09). Body removal and emoji replacement were the third (mean= 2.392, std err= 0.079 ) and last (mean= 2.882, std err= 0.095). Even though on average, face replacement receives a better ranking than other de-identification methods, many participants chose other methods over face replacement. This might show that people have different criteria for selecting these methods. Next, we analyzed the qualitative reasoning provided by the participants.
Reasons for the Top Favorite Method: We investigated the reasons given by participants about why they chose a particular method as their top method. We applied the open coding process [20] to categorize reasons. Following this process, we define new categories until no new categories emerged. To improve the quality of the categories, we used an iterative process [15] so that new categories were added or existing ones were reorganized.
Face Replacement: About one-third of the explanations provided by the 46 participants, who selected face replacement, were not related or clear. Ten participants stated that they chose this method because of its quality and clarity. Being normal, natural, or not attention-grabbing was mentioned 7 times. Similarity to the original photo, fewer editions were also noted 7 times. On the other hand, the reasons of those who disliked the method included "The idea is creepy, " "disrespecting other people in the photo, " "It's a lie, " "less aseptically pleasing," "The faces look too similar to the original," "Making an alternative-reality is creepy. " Face Removal: Twenty-five participants, among the 32 participants who selected face removal, explained their reasons properly. Nine stated that this method can preserve privacy very well as it completely removes the person, as in "It's like that no one has ever been there. " Seven participants noted that the picture was aesthetically more pleasing by stating "It looks cool, " "nice, " or "pleasant. " Four noted that it was natural and not noticeable. Three indicated that it was clear and clean looking, and another three participants said the picture is closer to the original picture and retains the form. On the other hand, some participants believed that removal "causes confusion" or "shifts the attention to what is missing" because of the background or "doesn't capture the scene. " Blurring: Of 43 participants who chose blurring, 30 provided proper reasoning. A variety of reasons was given for this method.
The most frequent answer was related to the aesthetics of the images, which were mentioned eight times in phrases like "pleasant, " "attractive," "interesting," "cute," and "best looking." The second primary reason was "less distracting or noticeable, " "not affecting or altering the image," "not drastic," which were mentioned six times. One important reason was that it transfers the message to viewers of what we have done or want to hide from them. Other reasons included "professional" ( 3 times), "common and normal" (3 times), "not weird, " "ethical, " "respectable, " "least awkward, " etc.
Emojis: The emoji method got 31 votes with 22 proper explanations. Privacy was mentioned eight times. Three participants stated they just liked it the most. They said that emojis were big and obvious and made it clear that we did that for privacy reasons. Other reasons included being "funny" or "cute, " "simple, " "no tool, " "no editing," showing "emotion" or "expression," "natural," "not odd, " "not removing anybody from the image, " "looks like a person, " and "suitable for blanc faces." Emoji was also the most disliked method: 5 participants stated that it is "childish" and "unprofessional" and "disrespecting the photo and people in it. " Other terms that were mentioned regarding using emojis are "too distracting. " "Silly, " "goofy looking, " "ridiculous looking, " and "gets all focus. " In summary, participants gave varied reasons to support their choices. One reason, considered negative for a participant, might be positive for another. For example, one stated that emojis get all the focus and attention of the image. At the same time, the other said that emojis convey that identity is hidden due to privacy. We believe that some participants might have been confused about face replacement because they might assume the replaced faces belong to real identities while they could be imaginary ones. Moreover, the participants might not be aware that traditional methods such as blurring are not effective against machine learning-based attacks.

Obfuscation Effectiveness:
We classify verification success using three categories: among all cases, among questions where the target is present, and among questions where the target is absent. Having a lower identification success rate means that the de-identification method is more effective and the attacker is unable to identify faces.
Verification Success: Figure 6 shows verification success, correct rejection, and total correct for As Is and the 5 de-identification methods. As you can see, there is a significant difference between the hit rate and total correct of As Is (hit= 71%, total 73%) and those of all the de-identification methods, including StyleGAN 0-5 (hit= 12%, total 52%), StyleGAN 0-4 (hit= 20%, total 56%), StyleGAN 0-3 (hit= 12%, total 49%), CIAGAN (hit= 19%, total 55%), and DeepPrivacy (hit= 24%, total 56%). However, we observe that in contrast, the correct rejection is lower for As Is compared to all other deidentification methods. The reason can be that we intentionally chose images with very similar attributes and it has made it less trivial for the participants to determine a target is absent. We also ran statistical tests to examine H1. Since the questions are provided to the same participants, we used logistic mixed-effects model while clustering on participant ids. Running two separate logistic mixedeffects models on hits and all cases shows that the success rate of As Is, in these two conditions, is higher than all de-identification methods (all < 0.0001). Running the logistic mixed-effects model on correct rejection shows that the success rate of As Is, is lower than all de-identification methods (all < 0.0001). Therefore, these results support our first hypothesis that it is less likely for humans to successfully verify the identity of a de-identified image. To examine H2, we tested six t-test hypotheses with Bonferroni Correction, comparing the verification success rate of three StyleGAN models with Deep-Privacy and CIAGAN. We chose the significant level of 0.05 for the t-test. Bonferroni Correction changes the significant level to 0.0083. The results show that there is a statistically significant difference between the success rates of StyleGAN 0-5 (hit rate = 0.24) and DeepPrivacy (hit rate =0.12) for the Hit Rates ( < 0.007). However, there was no statistically significant difference between the success rates of StyleGAN 0-3 and CIAGAN/ DeepPrivacy, StyleGAN 0-4 and CIAGAN/ DeepPrivacy, or StyleGAN 0-5 and CIAGAN. This shows that these models have a similar performance and H2 is only partially supported. To examine H3, we tested two t-test hypotheses with Bonferroni Correction, comparing the verification success rate of three StyleGAN models. Bonferroni Correction changes the significant level to 0.025. The results show that there was no statistically significant difference between the success rates of StyleGAN 0-3 and that of 0-4 and 0-5. Therefore, H3 is not supported.
Confidence Scores: Figure 7 shows the means of confidence scores for Target-Present and Target-Absent and their combinations. On average, for all the questions and methods, the participants stated that they were confident when verifying images since the means are close to 6, with 7 as "Highly Confident." By performing statistical t-tests using Bonferroni correction (significant threshold = Figure 6: Verification success evaluation. All the methods perform better than "As Is" especially in the hit rate. Figure 7: Average confidence scores. We did not find significant differences in the confidence level of survey participants. 0.0033) on all cases, we found significant statistical evidence that the participants are more confident when verifying CIAGAN images compared to As Is images ( 1 = 5.94, 2 = 5.61, < 0.001244), similarly when answering StyleGAN 0-3 questions compared to DeepPrivacy ones ( 1 = 5.94, 2 = 5.61, < 0.00185). Running the t-tests for the target-present or target-absent questions, we found a statistically significant difference between the confidence score of StyleGAN 0-3 and CIAGAN when the target was present, where people were more confident for StyleGAN 0-3 ( 1 = 5.92, 2 = 5.48, < 0.0033). We examined if participants were more confident when correctly answering the questions. Even though the difference between the mean values is small (about 0.4), we surprisingly observed that participants were more confident when their responses were inaccurate ( 1 = 5.81, 2 = 5.40, < 0.0001).
Identification Success: Figure 8 shows the identification success rates for different obfuscation conditions for the identification questions. Similar to the verification, when the target is present, the hit rate of As Is is far higher than other de-identification methods. The chance hit rate here is 1 5 = 0.2, since participants have 5 options to choose from. The hit rate of As Is is far above this chance hit rate while that of the de-identification methods is near or blow it. The correct rejection rate again is lower for As Is. To examine H4, we employed logistic mixed-effects model while clustering on participant IDs. Running three separate logistic mixed-effects models on hits, correct rejection, and all cases. The results of the model on hits and all cases show that the success rate of As Is is higher than all de-identification methods (all < 0.001). Therefore, these Figure 8: Identification Success. Similar to verification, when the target is present, the hit rate of "As Is" is higher than in de-identification methods. results support H4 such that it is less likely for humans to successfully identify a de-identified face among a set of faces.
To examine H5 and H6, we tested eight t-test hypotheses with Bonferroni Correction, comparing the verification success rate of three StyleGAN models with DeepPrivacy and CIAGAN. Bonferroni Correction changes the significant level to 0.0062. The results show a statistically significant difference between the success rates of StyleGAN 0-4 and 0-3 and CIAGAN for hits ( < 0.005), where it is easier for humans to identify faces generated by CIAGAN. We found a statistically significant difference between the success rates of StyleGAN 0-5 and DeepPrivacy for all cases and hits ( < 0.0004), where it is harder for humans to identify faces generated by DeepPrivacy. We found no statistically significant in other models. Therefore, H5 is partially supported, and it shows that all StyleGAN models perform as good or better than CIAGAN and DeepPrivacy, except faces generated by DeepPrivacy compared to faces generated by StyleGAN 0-5 are harder to be identified. Interestingly, the results show that there is no statistically significant difference between hit rates of StyleGAN 0-3 and StyleGAN 0-4, and StyleGAN 0-3 and StyleGAN 0-5. Therefore, H6 is not supported.
Confidence Scores: Figure 9 shows mean confidence scores and their standard errors for the identification questions. The average rates for all the scenarios are high (near 6 on a scale of 1-7) indicating that the participants felt confident when answering the identification questions. Similarly, performing a statistical t-test using Bonferroni correction for different hypothesis combinations, we found that the total confidence scores of StyleGAN 0-4 and 0-3 were significantly higher than that of CIAGAN ( < 0.0002). There was only one other significant difference where participants felt more confident when they answered StyleGAN 0-3 questions compared with CIAGAN questions when the target was present ( < 0.00001). Similar to verification, participants felt more confident when their answers were inaccurate ( 1 = 5.79, 2 = 5.02, < 2.2 − 16).

Utility Preserving.
We created one question for each method asking if the original photo and its obfuscated version share the same gender, expression, pose, age (less than 10-year difference), and naturalness. We randomly picked 20 paired sample pictures, 10 for males and 10 for females. One random pair was shown to a participant. Participants could select none or all of the facial features. The attribute rates for different methods are shown in Figure 10. The attribute rate here is defined as the number of participants who checked a particular attribute for a certain method divided by the total number of participants. Among all methods, on average, the participants believe gender has been preserved the most while naturalness has been preserved the least. To examine H7, we employed logistic mixed-effects model for each of the utilities, comparing the performance of StyleGAN 0-5 and other face de-identification methods. Results show that the faces generated by StyleGAN 0-5 are more likely to preserve expression compared to DeepPrivacy ( < 8.41 − 05), pose compared to StyleGAN 0-3 ( < 0.002), and naturalness compared to CIAGAN ( < 2 − 07) and DeepPrivacy ( < 0.0006). Therefore, H7 is supported.

Ranking De-identification Methods.
Participants were asked to rank different faces generated by the 5 de-identification methods, especially if having a natural de-identified image is desired. Figure 11 shows the number of times each method was chosen to have a particular rank. Deep Privacy was chosen the most as the top choice (46 times) followed by StyleGAN 0-5 (35 times) while StyleGAN 0-3 was the least chosen (15 times). The means and standard error were 2.64 and 0.11 for Deep Privacy, 2.92 and 0.12 for StyleGAN 0-5, 3.12 and 0.10 for StyleGAN 0-3, 3.12 and 0.11 for StyleGAN 0-4, and 3.20 and 0.13 for CIAGAN. To examine H8, we ran five t-tests, comparing the mean of StyleGAN 0-5 with other methods, while the Bonferroni corrected significant level is 0.001. Our results showed that there is no statistically significant difference between the mean ranking of StyleGAN 0-5 compared with any other model, rejecting H8. Figure 11: Naturalness of generated faces. Participants found DeepPrivacy-generated faces to be more natural, followed by StyleGAN 0-5. We used the open coding processing method to categorize the explanations. The explanations could be categorized into four major groups: utility preserving (84 times), being natural or normal (40 times), better in privacy (35), quality and aesthetics (10 times), not related or proper reasons (24 times), and opposite reasons (3 times). By opposite, we mean reasons that technically are negative but given to support their choices, like "looking unnatural," "not realistic," and "different races." In 14 explanations, participants mentioned their choice maintained a similar facial expression. The number of mentions of the pose, gender, age, race or skin tone, and hair color were 7, 4, 2, 2, and 4 times, respectively. Compared to StyleGAN 0-5, for which the reasons were focused on "the same, " "the closest to the original," and "the most similar," the reasons for StyleGAN 0-4 were more about "close but not the same, " "almost the same but different race, " etc. Thirty-two (64%) of those who selected DeepPrivacy indicated that they preferred it because it is the most similar to the original image maintaining image features and utility the most. This number was 15 (68%), 11 (61%), 18 (45%), 8 (35%) for StyleGAN 0-4, StyleGAN 0-5, CIAGAN, and StyleGAN 0-3, respectively. Nine (50%) of the participants explained that they chose StyleGAN 0-5 because the de-identified faces are natural or normal. This number was 9 (39%), 6 (27%), 9 (22%), and 5 (12.5%) for StyleGAN 0-3, Style-GAN 0-4, Deep Privacy, and CIAGAN, respectively. Eight (35%) of those who voted for StyleGAN 0-5 stated that because it was the best for privacy. This number was 7 (32%), 5 (28%), 9 (18%), and 6 (15%) for StyleGAN 0-4, StyleGAN 0-3, Deep Privacy, and CIAGAN.

DISCUSSION AND FUTURE WORK
Our results show that in general, StyleGAN has better performance compared to DeepPrivacy and CIAGAN. The lower performance of DeepPrivacy and CIAGAN with respect to utility-preserving can be due to their emphasis on preserving specific attributes, such as pose and face shape, whereas in DeepPrivacy, only pose information is inserted into the generator and discriminator. In StyleGAN, however, there is no emphasis on a specific attribute, and depending on the mixing levels, it is possible to mix attributes of two faces. In terms of privacy, DeepPrivacy and CIAGAN both wipe out the face rectangle from the face and used the remaining as the input to the generator. This step excludes the identity information providing that the rectangle is detected properly. However, this might not be done properly, and the face recognition system takes advantage of it. Interestingly, even though StyleGAN does not remove the face entirely, our results showed that by inheriting fewer styles from the target, StyleGAN performs on par or better than DeepPrivacy and CIAGAN in protecting privacy. Employing face de-identification methods does not hide contextual features; therefore, using these methods alone might not protect users' privacy. Implementing attacks regarding context information is out of the scope of this paper. Using synthetic faces to replace all the faces in a dataset can positively and negatively affect downstream applications. The uncontrolled addition of synthetic data can cause unwanted behaviors of downstream models, e.g., adding to the bias. However, adding them in a controlled setting can help mitigate bias. For example, Kortylewski et al. [35] proposed training face recognition models using synthetic faces with different poses to reduce the damage of dataset bias in real data.
In future work, we can examine the use of random vectors in StyelGAN to provide differential privacy and tune the privacy/ deidentification dial. We can also explore augmenting StyleGAN by imposing additional noise, to improve its privacy-preserving property. StyleGAN-generated faces can also be used by face-swapping methods. Classifiers can be developed to determine the target's face attributes, which helps to select a candidate to replace the target's face. One challenge of this approach is to generate faces for all combinations of features and poses. Moreover, it is not trivial to obtain a new face with the same quality as the original image that can be smoothly dissolved into the original background.

CONCLUSION
StyleGAN was initially proposed for generating imaginary faces. Our findings showed that it is also a great candidate for face deidentification. While most existing works focus on evaluating face de-identification methods against either machine learning attacks or human observers, this work evaluated StyleGAN against all, considering seven different attack models. Moreover, for the first time, we systematically evaluated and compared the utility-preserving properties of StyleGAN, DeepPrivacy, and CIAGAN. Our experiments show that StyleGAN with mixing levels 0-5, 0-4, and 0-3 protects privacy on par or even better than DeepPrivacy and CIA-GAN. There is a trade-off between privacy and utility. When privacy against machine learning models has a higher priority, then Style-GAN with lower mixing levels, such as 0-2 or 0-3, can be employed and when utility is more important, then StyleGAN with higher mixing levels, e.g., 0-4 and 0-5 can be used. Interestingly, we found DeepPrivacy outperforming CIAGAN in almost all the experiments, while the model is older. Our user study results illustrated that people have different preferences when picking a face de-identification method. Therefore, providing a face de-identification kit that allows selecting from these methods might encourage and increase data privacy practices among people.

A SURVEY QUESTIONS
A screenshot of the demographic questions is shown in Figure 13. Figure 14 shows the question used to order different obfuscation methods. Figure 15 shows a sample verification question for the As Is condition followed by a question about the confidence level when answering the verification question.
Two attention-checking questions are shown in Figures 16 and  17, which expect obvious "no" and "yes" respectively.
A sample identification question using StyleGAN 0-4 as the identification method is given in Figure 18.
The attention-check question of the identification section is shown at Figure 19 which requires choosing option "c".
A sample where two photos are shown and asked if they share the same facial attributes or not is shown in Figure 20.
A question sample for natural obfuscation method ranking is shown in Figure 21. Finally, the last question which shows a target face and asks to choose the most favorite utility-preserving deidentified version using CIAGAN, Deep Privacy, StyleGAN 0-3, StyleGAN 0-4, and StyleGAN 0-5 respectively is shown in Figure 22.