Undermining CoinJoin Anonymity: Clustering Techniques, Risks & Solutions

5 min read

How CoinJoin Anonymity Can Be Undermined Using Clustering

The Nature of Anonymity in Privacy Studies

Understanding anonymity is crucial when examining privacy, and we can conceptualize de-anonymization as a strategic challenge. In this scenario, an adversary possesses some level of information and endeavors to identify which individual from a group was responsible for a specific action within a system. To thwart the adversary’s efforts, we must maintain a level of ambiguity, either by restricting their access to information or by introducing randomness to complicate their guessing game.

Exploring “Guess Who?” as an Analogy

Many are likely acquainted with the game “Guess Who?”, which can be seen as a simple iteration of a broader question-and-answer format akin to “twenty questions.” In “twenty questions,” a player selects an item from a predetermined group, while the opponent attempts to identify it using a maximum of 20 yes-or-no queries. In contrast, “Guess Who?” involves players alternating turns, each aiming to be the first to correctly identify a character from a fixed pool of 24 cartoon figures, each with distinct traits like hair color. Each character is uniquely identifiable by name. The responses to yes-or-no questions can be represented as binary bits—zero or one. With 20 bits, one can convey any integer between 0 and 1,048,575, which is 2²⁰-1. If a set’s elements can be arranged in order, each can be indexed by its position, providing a unique identification. Thus, 20 bits can pinpoint any one of over a million distinct elements.

The Complexity of Real-World Information

However, in practical scenarios, the relationship between questions and answers rarely aligns perfectly. Often, the 20 responses will yield less information than the theoretical maximum. Many combinations of questions do not split the candidate elements independently, and some answers may be interrelated or biased. For instance, consider a question like “Does your character wear glasses?” compared to “Does your character’s name come before [the name of the median remaining character] in alphabetical order?” The latter is a more effective binary search method that maximizes the informative value of each response, efficiently halving the remaining options at each step. This logarithmic approach allows for a rapid narrowing down of possibilities, much quicker than a linear method, which checks each candidate one by one.

Winning by Guessing Correctly

When the objective is to win the game, it becomes clear that the goal is not merely to extract maximum information from the opponent but rather to be the first to guess correctly. This insight suggests that optimizing for information gain per answer is not necessarily the best strategy, particularly in a fair game setting. Similarly, when utilizing games to analyze privacy, it is essential to consider the rational behavior of the adversary. Optimizing for an incorrect outcome can be quite straightforward when the adversary is intent on winning.

The Implications of Dishonest Play

In scenarios where players are not expected to act honestly, it becomes evident that one can manipulate answers without detection. Instead of choosing an element from the set and responding truthfully to each query, a player might consistently provide answers that leave the greatest number of potential candidates. By adaptively choosing responses, players can effectively diminish the rate at which their opponent gains useful information necessary for victory. In such a Byzantine context, strategies shift from those applied in honest play. Here, an adversary would be best served to adhere to binary search tactics, which mitigates the advantages gained from adaptive play.

Understanding Information Through Shannon Entropy

The information content derived from an answer in “Guess Who?”—known as Shannon entropy—measures how unexpected that information is. For instance, after determining that an opponent’s character is bald, discovering that they do not have black hair adds no new information since the probability of having black hair is already known to be zero. When only two candidates remain, identifying one through a yes-or-no question effectively eliminates uncertainty, requiring just one bit of information. This can be computed from the probability distribution, which in this binary scenario follows a Bernoulli distribution where p=1/2.

Calculating Entropy in Game Scenarios

For example, starting the game with the question, “Is it [a random character’s name]?” yields information primarily if the answer is “no.” At that point, the remaining uncertainty over the 23 possible candidates can be quantified as log₂(23) ≈ 4.52 bits. Conversely, a correct guess provides full information, represented as log₂(24) ≈ 4.58 bits, indicating that nearly five bits are necessary to distinguish one character from a pool of 24.

The Complexity of Identifying Names

Shannon entropy is versatile enough to assess non-uniform distributions, prompting inquiries such as “How much entropy is associated with a name?” Estimates suggest that U.S. surnames carry about 15 bits of information, while first names yield approximately 10-11 bits. While a common name like John Smith may contain less unique information, an uncommon name could provide significantly more. Addressing the entire U.S. population necessitates around 29 bits of information.

The Challenge of Personal Identifiable Information

As the global population edges toward 8.5 billion, it raises questions about the bits of information inherent in various personal identifiers. This complexity is compounded by the reality that, unlike in games or modern cryptography where secrets can be randomized or periodically renewed, our real-life identifying attributes are static and often leaked unintentionally. We must trust those we interact with not to disclose our personal data, mirroring the trust we place in professionals like doctors or pilots, though the stakes for our personal information are often considerably higher.

Privacy Systems and Anonymity Sets

Privacy-enhancing technologies enable users to blend in with larger crowds. For instance, if a connection is made to a server from a Tor exit node, it might originate from one of many Tor users. When a de-anonymization adversary observes a specific event, a user’s anonymity set refers to the group of potential users to whom that event could be linked. If the receiver of an anonymous message is viewed as the adversary, they would make educated guesses about the sender based on this anonymity set.

Measuring Anonymity Through Entropy

Two influential studies introduced methods for quantifying anonymity through the entropy of the anonymity set: “Towards Measuring Anonymity” and “Towards an Information Theoretic Metric for Anonymity.” These papers expanded the understanding of how an adversary could guess the correct user from an anonymity set, proposing a model that considers varying probability distributions. The size of the anonymity set can be expressed in bits of entropy; for a perfectly uniform set, converting its size into bits is straightforward, calculated as log₂(n), where n is the set size.

The Challenges of Data Attributes

In examining attributes in a dataset, even seemingly innocuous details can reveal significant information. Latanya Sweeney’s exploration of k-anonymity revealed that individuals could often be re-identified through combinations of minimal attributes. For example, a study found that 87% of the U.S. population could be uniquely identified using just five-digit ZIP codes, gender, and date of birth.

Evaluating Information Leakage

Although five digits might suggest an entropy of log₂(10⁵) ≈ 16.6, the actual number of ZIP codes is lower, leading to a more accurate estimate of around 13.8 bits. Similarly, gender provides slightly more than one bit of information in most scenarios, while date of birth can also vary in entropy based on distribution. Collectively, these measurements indicate that even a small number of attributes can significantly reduce anonymity.

The Risks of Anonymized Datasets

Sweeney’s work highlights the risks inherent in anonymized datasets, demonstrating that attributes in such datasets can unintentionally intersect with publicly available information. The k-anonymity model aimed to ensure that for every attribute combination, there would be at least k entries matching that combination. This approach suggests that log₂(k) bits would be necessary to identify a specific entry among its peers. However, subsequent research has revealed vulnerabilities in this model, as the process of achieving k-anonymity can inadvertently leak information about the original data.

The Fragility of Privacy

Privacy is a delicate construct, and even minor oversights in data redaction can lead to significant breaches. By considering privacy loss in bits, we gain insight into the exponential decay of privacy over time. The implications of these findings are sobering, particularly as they underscore the ease with which privacy can be compromised.

Intersection Attacks in Bitcoin Transactions

In their research, the authors articulate significant concerns regarding the privacy of Bitcoin transactions. They describe two interrelated attacks that illustrate how privacy leaks can accumulate. In Bitcoin, the anonymity set for a coin is defined by the various wallet clusters into which it could be merged. The effectiveness of this anonymity set diminishes when there is additional information that can link transactions together.

Post-CoinJoin Privacy Concerns

When Alice utilizes a wallet supporting CoinJoin, she aims to enhance her privacy by mixing her coins with those of other users. The premise is that by merging inputs from different users into indistinguishable outputs, it becomes challenging to trace any particular output back to a specific input. However, if external information, such as a payment processor’s involvement, links her transaction history to her on-chain activity, her intended anonymity is compromised.

The Risks of Clustering Transactions

Tracking can reveal connections between transactions that should remain separate, creating a scenario where an adversary can cluster Alice’s transactions together, diminishing her privacy. As she continues to use CoinJoin UTXOs for subsequent transactions, even unrelated purchases can become traceable due to common payment processors or tracking methodologies.

Intersection Attacks and Deanonymization

The authors further illustrate how intersection attacks could lead to the deanonymization of Alice’s CoinJoin transactions. By analyzing the clusters of inputs for each CoinJoin transaction, an adversary can determine the likelihood of a user participating in multiple transactions, effectively merging Alice’s transactions into a single identifiable cluster.

The Need for Improved Privacy Strategies

The findings underscore the challenges of maintaining privacy in the Bitcoin ecosystem, emphasizing the need for greater awareness and improved strategies to protect user anonymity. The complexities of privacy management require a proactive approach, as the potential for privacy breaches remains a significant concern in the evolving landscape of digital transactions.