Detect Your Fingerprint in Your Photographs: Photography-based Multi-Feature Sybil Detection

A darknet market is an online marketplace typically implemented over Tor, where vendors sell illegal products or criminal services. Due to dramatic growth in the popularity of such markets, there is a recognized need for automatic investigation of the market’s ecosystem and identification of anonymous vendors. However, as they often create multiple accounts (or Sybil accounts) within or across different marketplaces, detecting Sybil accounts becomes the key to understanding the ecosystem of darknet markets and identifying the actual relationship between the vendors. 1 This study presents a novel Sybil detection method that extracts multiple features of vendors from photographs in a fine-grained level (e.g., image similarity, main category, subcategory, and text data), and reveals the multiple Sybil accounts of them simultaneously. Each feature is extracted from multiple rich sources using an image hash algorithm, Deep Neural Network (DNN) classifier, image restoration, and text recognition tool; and merged using a weighted feature embedding model. The matching score of each vendor is then calculated to identify not only the exact Sybil accounts, but multiple potential accounts suspected of being associated to a single operator. We evaluate the efficacy of our method using real-world datasets from four large darknet markets (i.e., SilkRoad2, Agora, Evolution, Alphabay) from 2014 to 2015. Because of the anonymity of darknet market, we construct the ground-truth of Sybil accounts by randomly splitting the dataset of vendors into two even parts. We used the first set to train the model, and linked the second set to the original vendor in the first set to evaluate performance. Our experimental results demonstrated that the proposed method outperforms the existing photography-based system with an accuracy of 98%, identifying up to 700% more candidate Sybil accounts than prior work [27]. Additionally, our method detects multiple Sybil accounts for 90% of evaluated test cases, presenting a very different picture of darknet


INTRODUCTION
Over the past decades, there has been a dramatic increase in darknet markets in terms of their volume and activities providing illegal products or criminal services such as drugs, credit cards, pornography, software exploits, and criminal/hacking services [3]. The darknet market is located in the darknet, a subsection of the deep web accessible only through specific browsers such as Tor Browser [1]. Because Tor makes it complicated to track the Internet activities of users by concealing their IP addresses and locations, a number of darknet websites are typically operated anonymously over Tor.
In order to further increase anonymity and avoid tracing of activities of darknet users, the darknet markets frequently transform their IP addresses or domain names [2,6,7], and allow only cryptocurrencies like Bitcoin [4] or Ethereum [5] for the anonymous transactions [8][9][10][11][12][13][14][15][16]. In addition, most of the useful information for user identification is hidden by the darknet market operators or the other additional anonymous technologies [17][18][19][20][21][22][23]. Especially, many darknet market vendors create multiple accounts, called Sybil accounts, within the same market or across different darknet marketplaces. Sybil accounts refer to the accounts controlled by the same person or group for gaining an advantage such as manipulating reputation in reviewing systems or promoting social network accounts. Hence, it becomes much more challenging to investigate the darknet markets and identify their users. Thus, linking multiple accounts and detecting Sybil accounts of the same vendors play crucial roles in accurately identifying vendors, understanding the actual transactions of illegal activities, and evaluating the exact volume of vendors and illegal items advertised across the different darknet markets. Unfortunately, due to the explosive growth of the number of markets [2], it is not feasible to manually analyze and connect multiple accounts owned by the same vendor in practice. Thus, in order to understand the ecosystem of the real-world darknet markets in the presence of such challenges, developing an efficient and autonomous Sybil detection method is of a considerably necessity.
Several approaches have been proposed to address this problem: one is stylometry-based approach and the other is photographybased one. In the former [24][25][26], vendors' writing styles have been utilized; while in the latter [27,28], vendors' distinct photography style (e.g., the product images that vendors have uploaded) have been utilized to detect the Sybil accounts of the same vendors.
Unfortunately, stylometry-based approaches [24][25][26] face several limitations. First, since the only available text data in the darknet market is likely to be a product description or vendor profile, which are often short, repetitive, and sometimes even written by using a specific template (see Fig. 1 [62]), the detection accuracy drops significantly when duplicated sentences are removed [27]. Second, it is inherently sensitive to the language of the dataset. Since the current darknet market is becoming increasingly diverse in its languages [59] as shown in Fig. 2, it is becoming more difficult to apply a stylometry-based method to the other markets using different languages. Third, the previous stylometry-based methods [26,28] only utilized drugs, which constitute only a small portion of various categories of products in the real-world darknet market.
On the other hand, the previous photography-based methods [27,28] only regarded photography as a single feature. Specifically, they extracted only an image feature from the photographs as a single vector, and just matched replicated photos as a proof of Sybil accounts. Thus, the performance of their detection was limited when duplicated images were removed.
In this paper, we propose a novel Sybil detection method to extract multi-features from photographs and leverage each feature to link different accounts to the same vendor simultaneously. Our method characterizes the vendors as follows. First, the similarity between two images is evaluated using the image hash function [32,52,53], which is highly efficient in generating the representation of correlated vendors with equivalent or similar photos [31]. Second, the main and subcategories of images are extracted using the Deep Neural Network (DNN) model [33,34], and the category distributions of the products of each vendor are produced. These 'inventory' features are difficult for any vendors to copy, who may simply pirate images from another vendor or the Internet [40], enhancing Sybil accounts detection accuracy. Third, text data is extracted from the product images (if there are any) to figure out additional features that cannot be captured using the image hash. However, as most of the darknet market photos have low image quality in practice, high accuracy cannot be achieved by simply applying the text extraction tool to darknet market photos [35]. We thus apply a super-resolution [37] method to improve the performance of text recognition [36,38]. All of these features are finally merged using a weighted feature embedding model [26]. Then, the possibility of the vendor with the highest likelihood of being the same person (the primary vendor) and those of vendors with lower, yet significant likelihood of being the same person (the candidate vendors) are calculated, and multiple Sybil accounts are linked to each vendor on the basis of their possibility scores.
Our approach is novel in that it can find multiple candidate accounts that are considerably likely to be of the same vendor, as opposed to the previous methods [26][27][28] that can only determine if a single target account is another Sybil account of a specific vendor.
When the most similar account turns out not to be of the same vendor, the previous schemes have to run the model iteratively after removing mismatched vendors until they find the correct one.
Additionally, our method achieves much higher accuracy and coverage by categorizing most of the items currently available in the real-world darknet market, training the DNN-based detection model with the multiple features extracted from them in a fine-grained level, and detecting Sybil accounts automatically across all vendors in the darknet market, compared to the previous approaches [26,28] that only dealt with drugs and their paraphernalia.
We evaluate our method using real-world datasets from four large darknet markets (i.e., SilkRoad 2, Agora, Evolution, Alphabay [51]). A ground-truth evaluation shows that our method achieved an accuracy of 98%, outperforming the previous photography-based Sybil detection scheme [27] in both accuracy and coverage. Our method covered up to 700% more vendors than the image classification approach using ResNet-50 [34]. In addition, our method could identify multiple Sybil accounts for cases in which a vendor has multiple accounts. Specifically, we could find that 90% of all of the corresponding accounts for vendors have at least two additional Sybil accounts. Further case studies show that our method could also detect previously unknown Sybil accounts in the real-world.
Contributions. In summary, this paper makes the following contributions: -We propose the first method to fingerprint Sybil accounts of darknet vendors by extracting multi-features from their photographs. Experimental result in SilkRoad2 and Agora darknet markets shows that our method outperforms the existing photography-based schemes with an accuracy of 98%. -Unlike the previous approaches, our method identifies multiple candidate vendors simultaneously based on their similarity scores, which are evaluated based on the multiple features extracted from the categorized in a fine-grained level. -Our method has a higher coverage compared to the recent photography-based schemes [27], covering up to 700% more vendors. Additionally, we demonstrate the proposed scheme can be generically applied to various darknet markets such as Evolution and Alphabay regardless of their languages or categories of items, showing high applicability of the proposed method in practice.
Organization. The remainder of the paper proceeds as follows: Section 2 introduces the background and related works. Section 3 shows the darknet dataset we used for our scheme construction and the experiment. Section 4 proposes our Sybil detection method, consisting of feature extraction, feature embedding, and vendor identification. Section 5 and 6 show the experimental results of the performance and applicability of our method. Section 7 discusses the practical implication of our findings, and future research directions. Section 8 concludes the paper.

BACKGROUND & RELATED WORKS
In this section, we introduce the background of darknet market and the previous approaches for darknet user data analysis.

Anonimity of Darknet Market
The darknet market is a specific type of website in the darknet mostly set up by criminals for the purpose of trading illegal goods or stolen items. Frequently traded items include: drugs, credit cards, pornography, software exploits, and criminal/hacking services [3]. With the development of technology for Tor [1], the number of hidden services has continuously increased, and the Tor darknet has had a maximum of more than 120,000 sites, whereas it currently hosts about 80,000, and the number of users was approximately two million in 2019 [2]. In the early days of the darknet market, SilkRoad2, Agora, and Blackbank flourished, where in contrast Alphabay, Vice city, Versus market, etc., are the largest darknet markets nowadays [50]. In order to avoid the traceability of regulatory agencies, the darknet market frequently replaces the address of sites or uses new domain names, and only allows cryptocurrencies such as Bitcoin [4] or Ethereum [5] for its transactions, making it difficult to trace transactions and identify the users involved in them.

Darknet User Data Analysis
In order to analyze the behaviors of users and identify their hidden relationships in the darknet market, several data analysis methods have been proposed in diverse aspects.
Crawling Darknet User Data. Efficient darknet data crawling is of significant importance as a priori technology to monitor the darknet market or forum-based communities. Ghosh et al. [18] developed an automated tool for crawling and indexing content from onion sites into a large-scale data repository with over 100 million pages. Since the darknet sites often have functions to detect and prevent crawling activities, Campobasso et al. [17] recently developed a data crawling tool for semi-automatically learning the structure of darknet forums to overcome the problem.
Finding Hidden Relationships of Users. In order to find the hidden relationship of darknet users, several studies have been proposed to identify key players of the underground forum [19,20]; while some other works analyzed roles, influence levels, and social relationships of key actors [21]. Sun et al. [22] proposed a method that utilizes uploaded text data from forums to reveal private interactions among users, and Bada et al. [23] explored the prominent use cases of the specific search engine Shodan to highlight hackers' targets and motivations.
Detecting Sybil Accounts. Exact tracing and linking multiple accounts of the same user is a key to the identification of the darknet market's ecosystem by understanding their actual behaviors.
In the darknet market, users, typically vendors, often create multiple accounts within or across different markets. Numerous studies have been proposed to detect these Sybil accounts automatically. The previous Sybil detection methods can be classified into the following two approaches: one is a stylometry-based approach and the other is a photography-based one.
Stylometry-based approaches mainly utilize a technique that captures the distinctive writing style of each user. Some of the existing approaches analyze the stylometry based on datasets revolving around vendor introduction, product description [24,25], as well as meta-information such as product category and shipping location [26]. The stylometry-based methods suffer from two significant problems when being applied to the real-world darknet market. First, the lengthy text is required to make stylometry analysis effectively. Unlike the underground forums where rich and diverse text dataset exist, the text dataset extracted from the darknet market is only limited to product descriptions or vendor introductions that are short and repetitive, often following a certain template. As demonstrated in [27], the accuracy of Sybil detection using text data becomes dramatically decreased as overlapping data is removed. Second, stylometry-based approaches inherently depend on the language used in the target darknet market. Considering product descriptions are often written in different languages even within a single darknet market in practice, the detection model needs to be trained or generated again when the target language is changed, impeding its applicability.
In order to overcome the limitations of stylometry-based approaches, Wang et al. [27] recently proposed a photography-based analysis scheme using deep learning models [27], while Zhang et al. [28] suggested a hybrid approach that integrates both the stylometry and photography styles using an Attributed Heterogeneous Information Network (AHIN). However, the previous photographybased methods [27,28] only regarded photography as a single feature. Specifically, they extracted only an image feature from the photographs as a single vector, and just matched replicated photos as proof of Sybil accounts. Thus, the performance of their detection was limited when duplicated images were removed, implying they could hardly capture the unique photography style of the actual vendors in the real world.
Detecting Sybil accounts can be applied to other contexts such as online marketplace (e.g., craigslist, social commerce, online shopping mall) or social media (e.g., Facebook, Twitter, Youtube). Some malicious users create multiple accounts to upload fake ratings or reviews for advertising their brand, and mimic other users' accounts or communities to earn a reputation and upload malicious content [46]. Such malicious users' accounts can be identified by the Sybil detection method. Likewise, results demonstrated in the context of a clandestine marketplace, may generalize to other contexts where analogous de-anonymization strategies could be leveraged against legitimate users of broader online (social) ecosystems and across social platforms.

DATASET
We use the public archive of darknet market datasets [51] to analyze the linkability of multiple accounts of the vendors. It contains the daily or weekly crawling data of 89 darknet markets from 2013 to 2015. In this paper, we intensively explore all of its crawling data and finally choose four large markets containing many image data with various product categories: SilkRoad2, Evolution, Agora, and Alphabay.
Data Pre-processing. We generate a data pre-processing tool to extract meaningful information from the crawled dataset in diverse aspects. Specifically, it extracts the ID, product image, product category, and PGP key of each vendor from the dataset of each market. Since most of the market data were scraped differently with distinct formats and structures, we additionally fine-tuned our pre-processing tool appropriately to each market. Table 1 shows the basic statistics of the darknet market dataset used in our experiment.  Ethics of Data Analysis. Our study follows the standard ethical practices for analyzing our datasets [29,30], and only analyzes the datasets that are made publicly available under the Creative Commons CC0 license [63] as in the previous research [27][28][29][30]. First, the dataset only contains publicly available information without any personally identifiable information. Second, our dataset only covers darknet markets that have been taken down by authorities. Third, our dataset does not contain any form of interaction with human subjects. Finally, our research offers useful tools to researchers or law enforcement for tracing, monitoring, and investigating crimes from darknet markets to make the benefits of our research outweigh the potential risks. The analysis follows the 'beneficence' principle in the Belmont report [43] as also claimed in [42,44].

PROPOSED METHOD
This section describes our multi-feature-based Sybil detection method. We first describe the overview of our scheme and then introduce the detailed approaches to extract features from photographs, embed the features into a model, and determine Sybil accounts.

Model Overview
Let us denote the set of n vendors V as {v 1 , · · · , v n }. We define a target vendor as the vendor whose multiple accounts we are searching for and compared vendors as the set of vendors who are likely to be the same as the target vendor. Each vendor v i has a photograph list P i = {p 1 , · · · , p m }, where m is the number of photos that v i has. Our method extracts four main features such as image similarity, main category, subcategory, and text data from photographs, and embeds the merged features into the vendor identification model. We use an image hash tool to evaluate the similarity of the group of photos that the vendor has uploaded. We calculate the category distribution of the products that each vendor contains. We obtain text data from the photos to figure out detailed features that cannot be extracted from image hashing. All of these features are merged using a weighted feature embedding model. We show the connectivity of vendors with different accounts using final similarity scores and suggest candidate vendors having the meaningful possibility of being the same vendor as well. The overview of our method is shown in Fig. 3. The flow of feature extraction and embedding is represented in Fig. 4.

Feature Extraction
We describe how to capture multiple features (i.e., image similarity, main category, subcategory, text) from the photographs of each product that a vendor has uploaded.

Image Similarity Feature.
We obtain the first feature that is image similarity, using image hash algorithms, which tell us whether two images look nearly identical. Unlike cryptographic hash algorithms (e.g., MD5, SHA), where tiny changes in the image give completely different hashes, image hash algorithms can help generate the representation of correlated vendors with equivalent or similar photos, since they generate similar output hashes when the input images are similar. We carefully investigated the Python implementations of Average Hash (AHASH), Difference Hash (DHASH), Perception Hash (PHASH), and Wavelet Hash (WHASH) [52] as the candidate image hashing techniques, and compared them to determine which is the most accurate for calculating identical hashes from equivalent or similar images of real-world darknet markets (same-origin-photos similarities). After converting images to image hash values using each hash algorithm, images were grouped by hash and hash type. For each group, we applied the Structural Similarity Index Metrix (SSIM) [53] to each possible pair of images belonging to the same group to quantify the level of hash accuracy for image matching in darknet marketplaces. The SSIM is a metric to determine the similarity between two photos (e.g., the SSIM value will be 1.0 for two identical photos and 0 for entirely different photos). After calculating the SSIM value of each image pair in a group, the results were averaged according to the number of pairs. We also calculated the weighted average SSIM for each of the four hash types. The weighted average SSIM provides each group's SSIM value with a weight determined by the number of image pairs from the group of the same hash value and hash type [31], such that a group with a large number of images more greatly influences the overall weighted averaged SSIM than a group with only a small number of images. Table 2 shows the hash analysis results for each hash technique.
For analyzing the different-origin-photos similarities of each hash function as well as the same-origin-photo similarities, we also examined image pairs with different hash values. Consequently, there were cases where hash values were the same but the images  Table 2: Hash analysis result. SSIM is a metric to determine the similarity between two photos, which quantifies the level of hash accuracy for image matching. The weighted average SSIM provides each group's SSIM value with a weight determined by the number of image pairs.
were different; but there were no cases where hash values were different but the images were the same. Therefore, same-origin-photo similarities differed depending on the hash functions, but differentorigin-photo similarities were identical for all hash functions (which is zero). Hence, we only compared the same-origin-photo similarities for choosing the hash function.
To obtain the representation for a vendor v i , we calculated the value of the image hash for all of the photos of the vendor as a list H i = {h 1 , · · · , h m }, where each h j represents the hash value of image p j . Following the results of Table 2, DHASH was determined to be the most effective algorithm for image hashing. After the image hashing results were generated, we calculated Hamming Distance, which is a well-known method for calculating the difference of the hash value, to compare the similarity of hash results to find the most similar vendor for each v i [32]. The Hamming distance is a metric to compare the similarity of binary strings, such that the value of Hamming distance will be 0 for two identical strings. For every different value between the two strings, this number is incremented by one. We evaluated the Hamming distance between each image from all compared vendors and target vendors by generating a similarity matrix. For each image from the target vendor, the lowest Hamming distance value is stored when compared to an image from a compared vendor. We then calculate each compared vendor's average Hamming distance value to determine the vendor with the minimum result as a Sybil account.

Main Category
Feature. The main category of products is the second feature representing what types of products the vendor mainly sells. The distribution of product category of vendors stands for their identity, because sellers tend to highly specialize in some products in order to dominate the market [30]. However, there are several practical challenges to the achievement of automatic and accurate categorization. First, there are too many different names and types of categories depending on the marketplace in practice. Also, categories are sometimes incorrectly chosen since classes are typically self-selected by vendors. Moreover, there are even crawled image data without category labels. Hence, to create a more uniform embedding, we implemented a DNN classifier that was trained on data from SilkRoad2 and Agora, where ground-truth was available via labeling [30].
The DNN classifier was employed to extract features automatically without manually analyzing the feature list. The major challenge of using the DNN model is that it requires a massive amount of training data to be accurate. In the darknet market, the number of photographs per vendor is not enough to train the model, as Figure 3: Overview of the proposed method. First, a data is pre-processed to extract meaningful information from the crawled dataset such as the ID, product image, product category, and PGP key of each vendor. Second, four main features such as image similarity, main category, subcategory, and text data from photos are extracted, and similarity scores are calculated between vendors. Third, all of these features are merged using a weighted feature embedding model. Finally, the connectivity of vendors with different accounts is identified using the final similarity scores, and candidate vendors are suggested as well. Figure 4: Feature extraction and embedding. After extracting features from a photograph, the similarity scores of each feature are used to calculate the final scores for the vendor by applying weighted feature embedding method which sets weight w k for each feature's f k , where f k represents the similarity score of k th feature.
shown in Table 1. Thus, we adopt transfer learning to pre-train a DNN using a large existing image dataset and fine-tune the last few layers with the darknet market dataset by leveraging the fact that the features of the DNN are more generic in the early layers and more data-precise in the later layers. In the fine-tuning procedure of the neural network, we used open tools such as Pytorch. Fig. 5 shows the model training of image categories.
To train the model, we use ImageNet (14 million images) [33], which is the largest annotated image dataset in the pre-training procedure. We then replace the final softmax layer with a new softmax layer using the darknet dataset, of which class is defined as a set of photographs uploaded by the same vendor. We then finetune all layers with back-propagation using the vendor's images by applying a stochastic gradient descent optimizer to minimize the cross-entropy loss function [60]. ResNet-50 [34] is one of the most common selections for generic image classification tasks, so we adopted the ImageNet pre-trained ResNet-50 model based on its outstanding performance in the results of the previous research [27]. We used datasets from SilkRoad2 and Agora, and unified similar categories into one main category by manually observing them. Based on our experiment, the test accuracy of DNN used for main category classification shows around 0.89. In the experiment, we randomly split the dataset with the label for train and test. As a result, as shown in Fig. 6, we defined 22 main categories (e.g., drug, apparel, book, electronics, jewelry), then these were made into a  vector M = {m 1 , · · · , m 22 } for all of the categories. To create category embedding for vendor v i , we produce a category number list N i = {n 1 , · · · , n 22 }, where each n j represents the number of products from j t h category. We then compute the similarity of category distribution between the target vendor and the compared vendor using Cosine Similarity, which measures the similarity between two vectors of an inner product space by the cosine of the angle. The Cosine similarity value becomes closer to 1.0 as the match rate between vectors becomes higher [54]. Cosine similarity is suitable for comparing the category distribution between vendors because it captures the frequency of vector. In addition, Cosine similarity is demonstrated as an effective measure for the redundant dataset [45]. Therefore, it is adopted to measure the main category distribution considering the fact that category datasets can be inherently redundant. In the proposed method, a vendor with the highest Cosine similarity value to the target vendor is determined as a Sybil account for it.

Subcategory Feature.
We take the subcategory of products as the third feature to capture the detailed characteristic of vendors' product distribution. In most darknet markets, subcategories are defined for some main categories, especially for drugs; while most of the other main categories do not have (see Fig. 6). Considering the fact that drugs take a significant portion in trading in the darknet market, detecting Sybil accounts only on the basis of the distribution of main categories may degrade the detection accuracy. Therefore, a more fine-grained subcategory information is used (if there is) to enhance performance. Specifically, the DNN model is used to classify the subcategory of darknet market images for uniform embedding, which is trained using the dataset of SilkRoad2 and Agora. The test accuracy of DNN used for subcategory classification shows around 0.90. For drugs, 20 subcategories information (e.g., cannabis, ecstasy, paraphernalia, dissociatives) were involved in the model training; while the same name as the main category was used for the other categories that do not have their subcategories. These were then made into a list S = {s 1 , · · · , s 41 } for 21 main categories (except drug) and 20 subcategories for the drug. We made a category number vector for the whole list and calculated the Cosine similarity between the target vendor and the compared vendor to detect a Sybil account. The approach is precisely the same as that of the main category feature extraction (c.f. Section-4.2.2).

Text Feature.
Darknet market vendors often put text into images, such as the names of drugs or sellers, as shown in Fig.  7. We extract the recognized text from the photos as the fourth representation of vendors. The critical challenge of applying the text extraction technique to darknet market photos is raised by the image quality, as the resolution of images is often quite low. After detecting the text area within the picture, in order to resolve the problem, we apply a super-resolution technique, which improves the image quality [35]. The text data is then recognized and extracted from the improved image.
When it comes to the implementation of text extraction, we adopted popular open source models for each step: TextFuseNet [36] for detecting text area, SwinIR [37] for applying super-resolution, and SATRN [38] for text extraction, each of which is the state-ofthe-art method with high performance [55].
• TextFuseNet is a text detection method introduced by Ye et al. [36]. Our scheme adopts TextFuseNet model using Pytorch [56] for text detection. Unlike the previous approaches that only perceive texts based on limited feature representations, TextFuseNet makes use of richer features fused for text detection. The feature representations were perceived at three levels: character, word, and global, while still maintaining their general semantics. Following this, a text representation fusion technique was applied to help achieve robust arbitrary text detection. The model then used a multi-path fusion architecture to collect and merge the texts' features from different levels.
• SwinIR is an image restoration method introduced by Liang et al. [37]. We used the repository that contains the official PyTorch implementation of SwinIR [57] to restore the low resolution of darknet images. SwinIR applies transformers, which show improved results on high-level vision tasks compared to the previous methods based on convoluted neural networks. SwinIR consists of three parts: shallow feature extraction, deep feature extraction composed of several residual Swin Transformer blocks, and high-quality image reconstruction.
• SATRN is a text recognizing method of arbitrary shapes suggested by Lee et al. [38]. In our scheme, their official model [58] was adopted for text extraction. SATRN utilizes a self-attention mechanism to capture the dependency between word tokens in a sentence in order to describe 2D spatial dependencies of characters in a scene text image. It can recognize texts with arbitrary arrangements and large inter-character spacing by making use of the full-graph propagation of self-attention.
After extracting the text from images as the final step, a list T i = {t 1 , · · · , t m } for all of the photos of the vendor v i is created, where each t j represents the text of image p j for 1 ≤ j ≤ m. The Jaccard Similarity [39] is then used to compare a vendor v i 's list T i with that of the other vendors. The Jaccard similarity is a statistic used for measuring the similarity and diversity between two sets. If two groups share exactly the same components, their Jaccard similarity will be 1.0; conversely, if there are no members in common, it will be 0. We determine the compared vendor with the highest Jaccard similarity to be a Sybil account for the target vendor. Jaccard similarity takes only a unique set of words for each sentence, while Cosine similarity takes total vectors. In practice, many texts extracted from the darknet images (e.g., vendor ID) can be duplicated in a set T i for each vendor v i . Unlike Cosine similarity of which values are significantly affected by such redundancy in words, Jaccard similarity is not affected. Therefore, Jaccard similarity is adopted in the text extraction step.

Feature Embedding
We apply a multi-feature embedding technique [26] to combine each of the extracted features for each vendor. After extracting features from a photograph list P i of a vendor v i , the scores of each feature (i.e., Cosine and Jaccard similarities, Hamming distance) are used to calculate the final scores for the vendor. For the Cosine similarity and the Jaccard similarity, the larger value implies the higher similarity; however, for the Hamming distance, the smaller value implies the higher similarity. In order to match the scale and characteristic of different features, we slightly change the formula of the image similarity feature, which depends on Hamming distance value. The equation of image similarity score can be written as where 16 is the length of the image hash value.
Since not all the features are equally important, we apply weighted feature embedding method [26] by setting weight w k (range of the weight is a positive number) for each feature's f k , where f k represents the similarity score of k th feature. After combining the features, the weight of each feature is initialized with 1/k and trained using a stochastic gradient descent optimizer with an initial learning rate 0.01, aiming to minimize the cross-entropy loss function.
For obtaining the aggregated multi-feature representation, we used weighted feature embedding to fuse different features as follows: -Category-based embedding value (CB) was created by fusing the main category embedding value (MC) and the subcategory embedding value (SC).

Vendor Identification
After embedding all of the features, our method determines if a given pair of vendors is the same individual by evaluating its score. Specifically, the most similar vendor to a target vendor can be found using the final score multiplied by the score for each feature and weight as follow: where w k is the weight of f k and ⊤ is a transpose of a metric. The vendor with the highest score is evaluated as the same vendor as the target vendor. One major limitation of the previous Sybil detection schemes in the darknet market is that it was impossible to extract multiple Sybil accounts at the same time with different possibilities, even if they had a high possibility of being so. This study employs a quantitative methodology to capture meaningful candidate vendors to overcome such a limitation faced by the previous studies. The final score is also used to find candidate vendors. We will demonstrate it in the next section.

Methodology
Due to the inherent anonymity of darknet market users, sufficient collection of the actual ground-truth of multiple accounts remains an open problem, which is an inherent difficulty that most of the studies in the literature have in common. In order to overcome the limitation and evaluate the efficacy of our method in the real-world setting, we adopt a data synthesis method for the ground-truth, which is a well-known solution to deal with the lack of groundtruth in the literature [26][27][28]. Specifically, we construct the groundtruth of Sybil accounts for training our model based on the original dataset from the SilkRoad2 and Agora in which the vendor ID and PGP key are already known. We then randomly split a given vendor's photos into two even parts, used the first set to train the model and linked the second set to the original vendor in the first set. (Later, we will apply our method for identifying the vendor with multiple accounts using the Evolution and Alphabay datasets to prove its versatility to various darknet markets in Section 6.) Ground-truth Setting. We set only the vendors with the same PGP key as an identical vendor and split the dataset of given vendors into two pseudo pairs for constructing the dataset of Sybil pairs. For testing the feasibility of re-identifying vendors based on their features extracted by our method, we create two versions of the ground-truth dataset to show that our model does not just match the same photos in a naive way. Each product's picture was counted once, and we allowed different products to use the same image. For the duplication version, we consider all of the vendor's photos (potentially including duplicates). For the deduplicated version, we remove all of the duplicated photos that are used for different products by using their base64 values for the duplicate check.
The limitation of the ground-truth setting in our experiment is that several adversarial cases (e.g., adversarial vendors intentionally hiding multiple accounts or impersonating other vendors) were not considered for training, since we split the dataset of identical vendors to construct Sybil pairs to overcome the ground-truth problem. Such adversarial vendors may pose a real-world threat, making it difficult to link multiple Sybil accounts in practice. Hence, we will investigate the cases of adversarial vendors in Section 5.5, and discuss how to improve our method in practical applications in Section 7.
Evaluation Workflow Fig. 8 shows the evaluation workflow of our experiment. We introduce a threshold T v to define the minimum number of photos for each vendor. For vendors with more than 2 × T v photos, we randomly split their photos into two even parts. We add the first set to the training dataset and the second set to the testing dataset. For the other vendors with more than T v images (but less than 2 × T v ), we append them to training set as distractors. We set vendors in the testing group as target vendors and vendors in the training group as compared vendors. While testing the model, we capture the most similar training vendor for each testing vendor, and estimate whether the pairs of train and test vendor are correctly matched.
The limitation of threshold setting in our evaluation workflow is that vendors with lower photos than the threshold are consistently excluded. We do not contain low-activity vendors for practical reasons, but this leads to skewed results such that it can be only applied to high-activity vendors. Thus, we chose low threshold to include more vendors in our experiment in Section 5.4. Figure 8: Workflow of the evaluation. The threshold T v indicates the minimum number of photos for each vendor. For vendors with more than 2 × T v photos, their photos are randomly split into two even parts. The first and second sets are added to the training dataset and the testing dataset, respectively. For the other vendors with more than T v images (but less than 2 × T v ), they are appended to the training set as distractors. Table 3 displays the detailed results for the performance of each feature. It also shows the accuracy regarding duplicated and deduplicated datasets and the threshold of T v . Considering the design choice of threshold T v , a lower threshold allows us to examine more vendors, but there may not be enough training data for each vendor, vice versa.

Feature Extraction
As shown in Table 3, across different markets, the matching accuracy of image similarity shows the highest results among the features. Image similarity accuracy is 0.93 or higher when the duplicated images are not removed for different products. After deduplicating the images, the accuracy is still around 0.903-0.971 for SilkRoad2. Unlike SilkRoad2 which mainly deals with the drug category, the accuracy of Agora which contains a variety of products, is lower. The matching accuracy of the text feature had the secondbest results, all of which indicate that text data from images can capture the characteristics of a vendor.
The results of category features are not significantly different for duplicated and deduplicated data; sometimes, the results from the deduplicated data are even better than those from the duplicated data. Since we compare the distribution of categories for each vendor using Cosine similarity and not just simply compare the total number of items in each category, similar results were obtained for deduplicated datasets. Also, when duplicated products were removed, exact product types from vendors were identified, so we could better capture each vendor's characteristics.
To evaluate the performance of each feature, we conduct additional analysis to empirically demonstrate the distribution of similarity scores predicted by our method. The density distribution of similarity score was drawn by randomly choosing pairwise comparison samples. Also, the number of pairwise comparison used to plot the Sybil and non-Sybil density functions is equal. Fig. 10 in Appendix shows the density distribution of similarity score for each feature. The more clearly the distribution of Sybil and non-Sybil pairs is separated, the higher the accuracy of each feature is, as can be seen in Table 3, because the features represent each vendor's characteristic. For instance, in the case of the image hash feature that showed the highest accuracy in Table 3, since the mean values of Sybil and non-Sybil are conspicuously different, Sybil accounts can be effectually identified. While the mean of the total distribution which was drawn by randomly choosing all pairwise samples (not considering any Sybil/non-Sybil label) is around 0.5, the mean values of Sybil and non-Sybil distribution are 0.8 and 0.2, respectively.

Feature Embedding
To train the embedding model's weight, we slightly modify our workflow of Fig. 8. Given a set of vendors having more than 2 × T v in the training dataset, we randomly put half of the data into the new training set for calculating weight, and the other half into the new testing set. For the other vendors that have more than T v (but less than 2 × T v ), we add them to the training dataset as the distractors. We then train the weight of our feature embedding model and apply the trained weight to our original dataset built on Fig. 8. Table 4 provides the performance of our model on various metrics using the data from SilkRoad2, Agora, and the two markets combined, when we set T v = 20. Our model manages to outperform most of the other models and shows remarkable improvement on all datasets. A single category feature embedding achieves the lowest performance. Moreover, a single feature of image similarity and text performs better than the category-based feature embedding model.
Furthermore, what stands out in this table is the coverage of the model. Unlike the previous research that mainly dealt with only drugs, most of the other categories currently available in the darknet market are included, and the proposed model shows remarkable outcomes. In addition, it produces high accuracy even when deduplicated data was used. This means that the matching accuracy is high even when there is not enough training data for each vendor, and that our method properly captures the hidden characteristics of vendors, which can hardly be achievable by simply matching identical images.
Single Feature vs Multi-Feature. In order to evaluate the advantage of multi-feature embedding model, we compare the performance between our multi-feature method and the previous approach proposed by Wang et al. [27], which used photographs as a single feature. Specifically, we used a pre-trained ResNet-50 model in both our scheme and Wang et al. 's scheme as a baseline model, because it showed the best result in [27]. The ground-truth evaluation was conducted in the same way as in Fig. 8. Table 5 shows the comparison results. As shown in the table, accuracy rate of our method is remarkably high especially in deduplicated datasets compared to the single feature case. The single feature model also achieved high accuracy when duplicated images were allowed, but it dropped when duplicated data were removed. This indicates that the single feature model's high accuracy seems to be the result of matching duplicated images, rather than actually capturing vendors' unique photography styles. Furthermore, single feature model is more limited in its coverage. In the comparison results, it can be seen that ResNet-50 model returns the highest accuracy for T v = 40 after removing duplicated images. However, it is still not as accurate as our model with T v = 10. Our model with a lower threshold (T v = 10) outperforms the single feature model with a higher threshold (T v = 40). It implies that our model covers up to 700% more vendors than the previous research and achieves higher matching accuracy even though there may not be enough training data for each vendor. The advantage is more meaningful when duplicated images are removed.

Vendor Identification
We matched the most similar training vendor for a given testing vendor in the above evaluation. Since not every vendor has a matched Sybil account in practice, we draw a minimal final score threshold T 1 for deciding a Sybil account. Our model will detect a match only if the similarity score between the target vendor and the compared vendor is higher than T 1 . We also set the lowest score threshold T 2 for capturing a candidate vendor. Our model also considers the candidate vendor only when the score of the second-best compared vendor is aboveT 2 . Even if we set two threshold scores in this experiment (i.e., for the best and the second-best Sybil accounts detection), it is important to note that our method is more generic such that it can be easily extended to detect more Sybil accounts if needed. Furthermore, our model can also be applied to other marketplaces (e.g., online marketplace, social media) to detect illicit users who manipulate reviews or upload malicious contents, as described in Section 2. When applied to other marketplaces, Sybil account and candidate account can be detected by finding the threshold in the same way.
The threshold T 1 and T 2 were determined considering the tradeoff between true positives and false positives. To examine this trade-off, the workflow from Fig. 8 was slightly modified. Given a set of distractors [27], we randomly put half of them in the training set and the other half in the testing set. We generate the Receiver Operating Characteristic (ROC) curves by changing T 1 and T 2 . Fig. 9 displays the ROC curve when T v = 20 for distractors. In this paper, we use the elbow point of the ROC curve. The results confirm the    Table 5: Vendor matching accuracy in the single feature model and our multi-feature model. In order to evaluate the advantage of multi-feature embedding model, the performance of the proposed method is compared to that of Wang et al.'s method [27], which used photographs as a single feature. In the evaluation, a pre-trained ResNet-50 model is used in both methods.
areas under the curves (AUC) are close to 1.0 2 , as curves reach the top-left corner of the plots for T 1 . Since we could detect most of the Sybil accounts for distractors from T 1 , there were not enough 2 A random classifier's AUC would be 0.5, and a higher AUC is better.
cases for T 2 to find a candidate vendor. Although the graph of T 2 may not look statistically significant, the results are suitable for our purposes of finding T 2 to choose a minimum value for a candidate vendor. Effect of Candidate Vendor. In order to evaluate the effect of candidate vendors to the accuracy of Sybil detection, we measured if candidate vendors contain the correct match meaningfully. We compared the results with and without a candidate vendor. Table  6 displays the experimental results. As shown in the table, the Fscore becomes higher when the candidate vendor is considered compared to the case without it in all of the markets. We set the threshold T v = 10 in this table to include more vendors to validate the coverage of our model.

Detection of Multiple Accounts.
To examine if our model can detect more than one Sybil account accurately, we slightly adjust the workflow of Fig. 8. For the vendors having more than 3 ×T v , we split their photos into three parts as pseudo vendors. We add the first and second parts to the training set and the third to the testing set. For the other vendors having more than T v (but less than 3×T v ), we add them to the training set. Therefore, each vendor in the testing dataset has two multiple accounts in the training dataset. In this way, for each testing vendor, we identify whether all multiple accounts could be found while also considering a candidate vendor. We set T v = 10 in our simulation to include more vendors. Table 7 presents the number of multiple accounts that our method detected while including candidate vendors. As shown in the table, our model found almost all Sybil accounts for each pseudo vendor pair, taking a candidate vendor into consideration. In the case of SilkRoad2, we found all of the multiple accounts in 253 pairs (91%) out of 277 pairs with a high precision of about 95%. We also found 134 pairs (55%) out of 242 pairs with a precision of about 77% in the dataset from Agora. However, the limitation of multiple account setting in our experiment is that fewer vendors were represented in the testing set since we included the vendors having more than 3 × T v photos.

Market
Candidate vendor Precision Recall F-score Accuracy   Table 7: Performance of detecting multiple accounts (T v = 10, with deduplicated images).

Case Study
We conduct case studies to examine the usage pattern of Sybil accounts of vendors based on the results we observed for deep understanding of the darknet market ecosystem. The case studies also include an interesting case that achieved a high similarity assessment in our model but was actually not the same vendor in ground-truth. We set T v = 20 and examined the number of accounts indicated in Table 3. Then, we found that the percentage of cases showing a higher similarity than the threshold T 1 = 20 but not of the same vendor is less than 1% from within the same market and across different markets. Among them, we manually investigated accounts with high similarity scores (specifically, we analyzed the 10 pairs with the highest similarity score in each market). Table  8 shows the number of candidate Sybil accounts whose similarity score is higher than T 1 .  'Candidate Sybil' includes the pairs whose similarity score is higher than T 1 .
Sybils within the Same Market. Fig. 11 shows an example of Sybil account ('salt-pepper' and 'salt-pepperSa') from SilkRoad2. They do not use identical photos, but share the same logo on their product images which was captured by text feature extraction in our method. Additionally, our model identified that they sell the same categories of products, that is ecstasy, cannabis, stimulants, and dissociatives. Another example, showing the same pattern as the Sybil pair of Silkroad2, is 'Hedera' and 'HederalExpress' from Agora.
Sybils across Different Markets. Fig. 12 shows two vendors using similar vendor names ('drugsforyou' and 'Drugs4you') but operating on different markets, that is SilkRoad2 and Agora, respectively. Both vendors do not use the exactly same photos but put the same logo, which is 'Drugs4you', into their products, and only sell products in the drug category, especially cannabis. Additionally, 'repaaa' from SilkRoad2 and 'RepAAA' from Agora post similar images, use the same logo, which is 'REPLICAAAA', and sell products only in the apparel category. Each pair is thus identified as Sybil account, demonstrating our model can detect additional Sybil pairs across different markets.
Adversarial Vendors. Some vendors operating Sybil accounts may not want these accounts to be identified for several security or economical reasons. Fig. 13 shows such a case that our method found. Specifically, our model detected Sybil pairs from Agora, that is 'RXChemist' and 'Remedyplus'. They seek to intentionally hide the vendor relationship by sharing few identical photos, using different PGP key and vendor ID, and posting vendor descriptions with dissimilar structure. However, our model captured that they sell the same category of products such as prescription, benzos, and opioids, and upload images of highly similar items. In order to demonstrate the correctness of the result, we conducted a manual analysis of it and confirmed that they were actually of the same vendor, because they sell 214 items with the same name, supply products from the same regions (e.g., US, UK, and Australia), and include a few identical contents in the introduction. On the other hand, we also verified there are impersonators copying product images from other vendors. For example, in SilkRoad2, 'kimbe' and 'drugstore' were evaluated as the same vendor by our method, since they use identical photos of drugs with slightly different sizes (specifically, 'kimbe' resizes the image of 'drugstore'). After our careful and manual investigation on the case, however, we found that they are not likely to be the Sybil pairs, because 'drugstore' only sells drugs shipping from Croatia, but 'kimbe' supplies various types of products shipping from Vatican. Such impersonating cases may result in false positives in our method. How to handle the impersonating case will be discussed in Section 7.

CROSS MARKET ANALYSIS
We apply our method to the other darknet market datasets which were not used for our model training to demonstrate the extensibility of the proposed method. We selected datasets from Evolution and Alphabay for the cross market analysis, since a large number of photos across various categories were included in their datasets. Following the evaluation process in Fig. 8, we randomly split the dataset into two parts for vendors with more than 2 × T v , adding the first half to the set of target vendor and the second to the set of compared vendor. For the other vendors having more than T v images (but less than 2 × T v ), we include them in the set of compared vendors as distractors. We then apply our model and set the thresholds T 1 = 20 and T 2 = 16, as was trained by both the SilkRoad2 and Agora datasets. Table 9 presents the results of Sybil detection from those darknet markets on various metrics when applying our trained system. In this case, we used all of the photos of vendors without deliberately removing duplicated images. Also, we set T v = 10 to include more vendors to evaluate the coverage of our method. From the table, it can be seen that our model performs well on various darknet markets. Furthermore, the performance of our scheme is improved when considering candidate vendors.

Market
Candidate vendor Precision Recall F-score Accuracy  Table 9: Performance of cross market Sybil detection (T v = 10, with duplicated images). We apply our method to Evolution and Alphabay datasets which were not used for training our model to evaluate the extensibility of the proposed method.

DISCUSSION
Extension of Multi-account Detection. Our method can detect more than one Sybil account simultaneously according to the similarity score. In our experiment, we set two threshold values (T 1 and T 2 ) to find the best and the second-best Sybil accounts and demonstrated their feasibility (c.f. Section-5.4). However, it is important to note that our method is more generic such that it can be easily extended to detect more Sybil accounts as the applied system requires. For example, we can also set threshold values T n for capturing n th candidate vendors, for n > 2. Our model then considers the candidate vendor only when the score of the n t h -best compared vendor is above T n . Because there should be an inevitable trade-off between false positives and true positives caused by choice, however, finding the optimal number of candidates would be an important open problem in practice. In contrast to the case of adversarial vendors, it is reasonable to assume that some vendors seek to leverage positive brand association across marketplaces by the inclusion of textual cues in product photographs. Such cases would provide a strong signal to our proposed methodology. Potential examples are discussed in the our highlighted case studies (Section 5.5). As discussed, such cases are difficult to decouple from copy-cat behaviour between vendors. As such performance evaluation remains a complicated and open problem and intentionality on behalf of (Sybil) account operators should not be inferred when considering the association of multiple accounts.
Design Choice of Model. We extract four features, that is image similarity, main category, subcategory, and text data from pictures. However, it is important to note that our model can flexibly embed additional features into the vendor identification model. The DNN feature was also considered as a useful feature in our model, because the previous study in [27] showed that it could capture the unique photography style. According to our experiment, however, we observed that the similarity using a DNN feature becomes 'overconfident' to the most similar vendor with a limited accuracy, which could not be improved even if the other features were combined. Therefore, we determined DNN feature is inappropriate for the multi-feature model, and demonstrated our design choice actually outperforms it as shown in Section 5.3.
Adversarial Vendors. One of the challenging issues in the literature is how to handle adversarial vendors. Some vendors with Sybil accounts in the darknet market actively reveal their brands by using logos or identical product photos, which can be easily captured by our model. However, adversarial vendors trying to hide their Sybil accounts or mimic other accounts make Sybil detection difficult. There may be many real-world cases such as accounts belonging to the same vendor but having different vendor ID and PGP keys for hiding their multiple accounts, or impersonators copying the other vendors' product images. For the former case, our model can effectively detect the hidden Sybil pairs; but, for the latter case, our model may wrongly detect them, incurring false positives, as shown in case studies (c.f. Section-5.5). However, we found that even if the impersonators may be able to easily mimic images, they can hardly mimic the categories of all items that are being sold in practice, since it is costly to manipulate. One of the feasible solutions to mitigate this problem is to include more fine-grained category features in our method. It can be achieved by collecting more datasets covering various categories of photos from the other darknet markets, and refining the classification of categories based on them. In addition, the robustness of our method would be improved if the categories of images can be identified more accurately. For example, we can apply a self-supervised representation learning method as an image classification model in our method. Because it does not need labeled dataset for training, it would help to enhance the accuracy of image category identification in the darknet market environment of which datasets are mostly unlabeled. Additionally, our method can also be applied to online marketplaces or social media to detect adversarial users who intentionally hide their accounts or impersonate other accounts for malicious purposes. However, there is also a possibility of misusing Sybil detection approach for de-anonymizing legitimate individuals' multiple accounts. Legitimate individuals may preserve their privacy by considering the result of our research.

CONCLUSION
We propose a novel approach for Sybil accounts detection in the darknet markets. It leverages multi-feature extraction from photographs of darknet vendors to fingerprint their Sybil accounts within and across markets. Specifically, our method extracts various features such as image similarity, main category, subcategory, and text data to generate multi-feature embedding at a vendor level. Also, the proposed scheme can find multiple accounts at the same time, which is the first in the Sybil detection literature. For evaluating our model, we construct the ground-truth of Sybil accounts by randomly splitting the dataset of given vendors into two pseudo pairs and including vendors with photos more than threshold. Even if we chose low threshold to include more vendors in our experiment, the results could be skewed to apply to high-activity vendors. We have shown that our model achieves an accuracy of 98%, and its performance significantly outweighs the previous photography-based study leveraging only a single feature. Additionally, our method covers up to 700% more vendors. Furthermore, due to its fine-grained multiple-feature extraction from photographs, our method can be more generically applied to a variety of darknet markets regardless of their languages and item types. Sybil detection in anonymous darknet markets gives a better understanding of the ecosystem by accurately identifying vendors and actual transactions of items among them. In future work, we will investigate how to resolve the adversarial vendor problem in an efficient and accurate manner based on the insight discovered in this study.
A SIMILARITY SCORE DENSITY DISTRIBUTION Figure 10: The density distribution of similarity score. It was drawn by randomly choosing pairwise comparison samples. The number of pairwise comparison used to plot the Sybil and non-Sybil density functions is equal. 'Total' density distribution of similarity score was drawn by randomly choosing all pairwise samples not considering any Sybil/non-Sybil label. We used the same dataset in Table 3. Figure 11: Case study of Sybil accounts within the same market.