Synthetic Data Set of Human Trafficking Victims could allow big data work without privacy compromises

Synthetic Data Set of Human Trafficking Victims could allow big data work without privacy compromises

To properly prevent human trafficking, those fighting it must first comprehend it, which these days mean data. Unfortunately, there is no handy index of trafficking victims for obvious reasons, even though this personal information is abundant in some ways. With a new synthetic database that contains all of the crucial qualities of genuine trafficking data but is wholly artificial, Microsoft and the International Organization for Migration may have found a way forward.

While each victim is undeniably unique, basic high-level issues like as which nations are rapidly becoming the source or means of trafficking, whose routes and methods are used, and where the victims end up are all statistically quantifiable. Thousands of these unique stories contain the evidence needed to detect trends and patterns, which is vital for prevention. In a news release presenting the data collection, IOM program coordinator Harry Cook said, “Administrative statistics on known cases of human trafficking represent one of the main sources of data accessible, although such material is highly sensitive.”

“Over the past two years, IOM has been thrilled to collaborate with Microsoft Research to make progress on the essential challenge of sharing such data for study while respecting the safety and privacy of victims.”

For items like crime databases and medical information, the technique has traditionally been to redact freely, however, this method of “de-anonymizing” has been proved to be unsuccessful against any meaningful attempt to reassemble the data. The redacted information may be safely supplied thanks to several databases that have been made public or leaked, as well as the availability of computational capacity.

Microsoft Research chose to utilize the original data as the foundation for a synthetic data collection that retains all of the source’s essential statistical associations but none of the identifiable data. It’s not only changing the name “Jane Doe” to “Janet Doeman” and the location of her birth from Cleveland to Queens. Instead, groups of at least 10 persons with comparable or overlapping data are merged to provide a collection of qualities that statistically properly reflect them but cannot be used to identify them individually.

Naturally, this lacks the detail of the original data, but unlike the sensitive source, this information is usable. This data, based on firsthand information, may be linked to as a factual record for addressing this at a policy and diplomacy level, rather than for some task force to evaluate and say “okay, the next smuggling operation will be based on…” Whereas before it might have been necessary to state in a more general fashion that Country X or Government Z was negligent or complicit in these problems, having real facts to back it up lets one say, “36 percent of sex trafficking victims travel through your jurisdiction.”