PII removal

PII identification removal capability - Speech, Text, Images

Link - https://www.jpmorgan.com/technology/technology-blog/use-ml-to-improve-customer-experience

Use ML to Improve Customer Experience Without Data and Privacy Compromise.pdf

PII removal component

I defined , designed and delivered this capability at JPMC.  IT was accepted as firm-wide capability to handle and remove PI data.

Removing PI data has many advnatages e.g.

What is Personally Identifiable Information (PII)? 

PII data is any information that is used to distinguish or trace an individual's identity such as name, passport number, license number, social security number, birth date and place of birth, mother's maiden name, biometric records, and any other information that is linked or linkable to an individual, including medical, educational, financial, and employment information.

Why removing PII data is beneficial

This blog describes how you can use machine learning to remove personally identifiable information.  There are many scenarios in which PII removal help e.g.

Here  we describe how we remove  PI data from text or from speech-text transcription. 

How to remove PII data

Protecting customer’s PII is a fundamental legal, regulatory, and business requirement formany enterprises. While much PII data is in structured columns and hence can easily be removed, sources like customer call transcripts, emails, and messages are examples of unstructured data sources in which a customer may disclose PII such as addresses, names, and social security numbers at any point within a conversation. This creates a challenge for data use, as this information must be safeguarded before the data can be used.

Methods do handle PII data?

PI handling methods fall into following categories 

Following types of entties should be considered for PII removal

Numeric Entities

Numeric identifiers e.g. Phone number, Social Security,  License number , Passport numbers are PII.  These appear as numbers in text or in images.

Approach: Use regular expression to identify and remove

Challenge: If you are using transcript from speech it become more complex e.g.  number may appear as ('one', 'two', ...) or even get mixed.  Example someone on phone may ask/say and speech to txt will transcribe the whole conversation e.g.


Names

Ti identify and discover names in text/speech transcript one can use Natural Language Processing (NLP) technique called Named-Entity Recognition (NER). NER identifies named-entities in sentences and classifies the entities by type. For example, if the text is:

Tom Burphy called Google support desk and cancelled online subscription from Feb 1. 

then the NER would label it as follows:

[Tom Burphy]Person called [Google]Organization support desk and cancelled subscription from  [Feb 1]DateTime

One can mask or remove PII data in square brackers.

How to build NER

There are many open source tools such as Stanford Named Entity Recognizer which uses a statistical modeling technique called Conditional Random Field (CRF). CRF uses a graph model to take context into account


Physical Addresses

Many times city and state name is not PII.  However if you have requirement to identify and remove address - you can use 

Acordingly remove address and replace with  <address tag>


Email Addresses

The format of Internet email addresses is defined by a standards [IETF RFC 5322] published by the Internet Engineer Task

Force (IEFT). Unfortunately the specification doesn't lend itself to simple pattern matching. The PII redaction package uses a

domain database of about 8000 top level domains plus a regular expression pattern matcher.


Tokenization

Many machine learning models that deal with text data perform an additional step called tokenization. Here we proposed a

custom Tokenization process to convert texts to integer sequences in a securer way:

1. Text tokenization: turning original text into sequence of word tokens, for example, ['hello', 'world'].

2. Hashing: applying one-way hashing algorithm such as SHA-256 on each word (token) and then replacing the hashed

value with its token. The algorithm is called 'one-way' because it is mathematically difficult to invert, and is why such

algorithms are also used for data encryption. Each hash is unique across the dictionary of words, and each time the

same word is seen in the stream the same hash value is produced.


3. Sequencing: Mapping hashed code into sequential integers. This is achieved by replacing hashed code with its position

in the original text stream. This step helps to save storage and transmission space by replacing long strings to integers.

Sequential integers are also required when constructing a large word embedding matrix for NLP models.

A common technique to make it more difficult to reverse engineer the original text from the tokens is to randomize the

hashing function. A random seed value or salting can be added at the hashing step. The result is a stream of numbers that is

close to impossible to reverse engineer without the original dictionary.

The combination of PII redaction and tokenization provides high level of resiliency from data attacks such the rainbow attack

or word frequency analysis. In the rainbow attack, a table is a precomputed with all possible input words for reversing

cryptographic hash function. This technique is used, for example, for cracking password hashes. Tables are usually used in

recovering a password (or credit card numbers, etc.) up to a certain length consisting of a limited set of characters. However

without access to the same dictionary, the mapping is not the same after indexing. The salting token provides an additional

safeguard.

In word frequency analysis, suppose that the tokenized stream is intercepted. Could the original text or meaning be reversed

using knowledge of word frequency in common text? This scenario is likewise unlikely since PII is either already redacted of

very low-frequency in the token stream.

Applying PII removal

Industry Solutions

Recently a public cloud provider announced an extension of its speech transcription service in which the transcript is PII

redacted after the transcription is performed and another has a data loss prevention service running on the cloud. A

disadvantage of these approaches is that the on premise data needs to be moved to the cloud before it can be cleaned.

JPMC Solution

A typical use case is an on premise data set which includes PII values in an unstructured text stream. Each block of text is

redacted using the redaction package. Optionally the redacted text can be tokenized.

Conclusion

Using PI removal, PI data can be redacted from text corpus. Using tokenization for data, each string of text is replaces with

non-sensitive token. Using PI removal and tokenization enables teams make small and secure datasets available to ML teams.