Motivation: Mail auditing is a necessary process in major web mail services that a number of professional staff, referred as "auditors", access user's personal emails for the purpose of auti-spam or qualitative evaluation of novel mail features. Besides, this paper argues that more than 90% of non-spam emails are machine-generated messages from subscriptions (e.g. flight information, order confirmations, event notifications and newsletters etc.), which usually contains some private information such as home address, telephone, order contents or flight information. Therefore, user privacy protection is needed in such mail auditing process. Note that automated subscripts are usually structured documents with HTML formatting and sent in large scale to many subscribers. By leveraging the two properties of machine-generated messages, the authors determine to find a way to preserve recipients' privacy (e.g. home address or phone number in emails) during mail auditing process.
Methods: Generally, it enforces k-anonymity into mail auditing process to achieve the goal of privacy protection. K-anonymity is a situation where a user u is indistinguishable from at least other k-1 users; in other words, emails of the same template are sent out to at least k recipients, and in such a batch of emails only personal information varies, referred as variable content. The example in Figure 1 can illustrate the specific meaning of k-anonymity. Two emails are the same except customers (Jessie and Sergio) and purchased items (Green Mountain Coffee and Bleu de Chanel). It is easy to hide such variable contents by using * in the template to protect user privacy. At the same time of hiding variable contents, they also want to maximize content coverage, e.g. privacy-irrelevant information, for the convenience of mail auditors.
|Figure 1. K-anonymity in mail auditing.|
|Figure 2. Identical Mail-Hash signature for mail (a) and mail (b), and a different one for mail (c).|
Once messages are grouped into equivalent classes, they mask variable contents and obtain a list of templates, where each template represents a class. A masked email sample of a shopping receipt is shown in Figure 3.
|Figure 3. A masked sample of a shopping receipt.|
When applying their approach into real setting, another challenge comes up - k-anonymity requirement needs to be satisfied over time. Since auditors will be exposed to more and more masked emails samples over their lifetime, auxiliary release of information may violate k-anonymity, e.g. auditors may gradually get to know user's personal privacy if they are exposed too much relevant information overtime. Therefore, they design a daily template assignment that controls such potential release of information. In their algorithm, auditors won't have any chance to see templates that have been assigned to them in previous days. Besides, once an auditor has been exposed to a recipient's email template, he/she won't be associated with any other templates of this user.
Experiments: They perform experiments on Yahoo mail traffic and compare it with Min-Hash approach. Figure 4 (left) depicts the content coverage of an email template as a function of its class size. It can be seen that large class has small content coverage, as in real setting, such emails are usually sent out in large scale while contain less useful information. The results reveal that Mail-Hash is able to maintain more contents (irrelevant with privacy) exposed to auditors for review. Figure 4 (right) depicts deduplication ratio as a function of class size, which measures how similar those messages are in one class. It reveals that Mail-Hash is able to group more similar messages into one class.
|Figure 4. Mail-Hash vs. Min-Hash.|
Conclusion: In this paper, they resolve the problem of privacy protection in web mail auditing process, particularly focusing on those machine-generated emails containing personal information in message body. As machine-generated emails always share similar structure and are sent to large size of recipient, this paper proposes to extract general email template by moving privacy-sensitive variable contents. In this way, a list of masked templates are generated that will be shown to auditors. In their online mail audition, k-anonymity always holds. Their method can not only protect privacy but also keep large fraction of content coverage for auditors review. Besides, the messages grouped into same class (same template) by Mail-Hash also maintain large average similarity.