PII removal
PII identification removal capability - Speech, Text, Images
Link - https://www.jpmorgan.com/technology/technology-blog/use-ml-to-improve-customer-experience
PII removal component
I defined , designed and delivered this capability at JPMC. IT was accepted as firm-wide capability to handle and remove PI data.
Remove PI data from call center transcripts
Remove PI data from text - email, customer feedback
Remove PI data from check images
Removing PI data has many advnatages e.g.
Reduce risk and data can be moved to Cloud
Data can be shared with teams for data science and analytics team
Save significant cost as in absence of it whole infrastructure need to be PCI compliant.
What is Personally Identifiable Information (PII)?
PII data is any information that is used to distinguish or trace an individual's identity such as name, passport number, license number, social security number, birth date and place of birth, mother's maiden name, biometric records, and any other information that is linked or linkable to an individual, including medical, educational, financial, and employment information.
Why removing PII data is beneficial
This blog describes how you can use machine learning to remove personally identifiable information. There are many scenarios in which PII removal help e.g.
share data with data scientist but remove PII data
reduce risk of data hacking
Ability to publish and share data with wide audeince
Here we describe how we remove PI data from text or from speech-text transcription.
How to remove PII data
Protecting customer’s PII is a fundamental legal, regulatory, and business requirement formany enterprises. While much PII data is in structured columns and hence can easily be removed, sources like customer call transcripts, emails, and messages are examples of unstructured data sources in which a customer may disclose PII such as addresses, names, and social security numbers at any point within a conversation. This creates a challenge for data use, as this information must be safeguarded before the data can be used.
Methods do handle PII data?
PI handling methods fall into following categories
discover,
classify
mask
removing sensitive element.
Following types of entties should be considered for PII removal
Numeric Entities
Numeric identifiers e.g. Phone number, Social Security, License number , Passport numbers are PII. These appear as numbers in text or in images.
Approach: Use regular expression to identify and remove
Challenge: If you are using transcript from speech it become more complex e.g. number may appear as ('one', 'two', ...) or even get mixed. Example someone on phone may ask/say and speech to txt will transcribe the whole conversation e.g.
7 1 to 9 (7 1 2 9)
7 1 to 9 and last digit is 4. (7 1 2 9 4)
7 one to ...(I could not hear, could you repeat). number is 7 1 to and 9 and for. ( 7 1 2 94)
7 on to nine for
Names
Ti identify and discover names in text/speech transcript one can use Natural Language Processing (NLP) technique called Named-Entity Recognition (NER). NER identifies named-entities in sentences and classifies the entities by type. For example, if the text is:
Tom Burphy called Google support desk and cancelled online subscription from Feb 1.
then the NER would label it as follows:
[Tom Burphy]Person called [Google]Organization support desk and cancelled subscription from [Feb 1]DateTime
One can mask or remove PII data in square brackers.
How to build NER
There are many open source tools such as Stanford Named Entity Recognizer which uses a statistical modeling technique called Conditional Random Field (CRF). CRF uses a graph model to take context into account
Physical Addresses
Many times city and state name is not PII. However if you have requirement to identify and remove address - you can use
USPS address dataset to match address
Acordingly remove address and replace with <address tag>
Email Addresses
The format of Internet email addresses is defined by a standards [IETF RFC 5322] published by the Internet Engineer Task
Force (IEFT). Unfortunately the specification doesn't lend itself to simple pattern matching. The PII redaction package uses a
domain database of about 8000 top level domains plus a regular expression pattern matcher.
Tokenization
Many machine learning models that deal with text data perform an additional step called tokenization. Here we proposed a
custom Tokenization process to convert texts to integer sequences in a securer way:
1. Text tokenization: turning original text into sequence of word tokens, for example, ['hello', 'world'].
2. Hashing: applying one-way hashing algorithm such as SHA-256 on each word (token) and then replacing the hashed
value with its token. The algorithm is called 'one-way' because it is mathematically difficult to invert, and is why such
algorithms are also used for data encryption. Each hash is unique across the dictionary of words, and each time the
same word is seen in the stream the same hash value is produced.
3. Sequencing: Mapping hashed code into sequential integers. This is achieved by replacing hashed code with its position
in the original text stream. This step helps to save storage and transmission space by replacing long strings to integers.
Sequential integers are also required when constructing a large word embedding matrix for NLP models.
A common technique to make it more difficult to reverse engineer the original text from the tokens is to randomize the
hashing function. A random seed value or salting can be added at the hashing step. The result is a stream of numbers that is
close to impossible to reverse engineer without the original dictionary.
The combination of PII redaction and tokenization provides high level of resiliency from data attacks such the rainbow attack
or word frequency analysis. In the rainbow attack, a table is a precomputed with all possible input words for reversing
cryptographic hash function. This technique is used, for example, for cracking password hashes. Tables are usually used in
recovering a password (or credit card numbers, etc.) up to a certain length consisting of a limited set of characters. However
without access to the same dictionary, the mapping is not the same after indexing. The salting token provides an additional
safeguard.
In word frequency analysis, suppose that the tokenized stream is intercepted. Could the original text or meaning be reversed
using knowledge of word frequency in common text? This scenario is likewise unlikely since PII is either already redacted of
very low-frequency in the token stream.
Applying PII removal
Industry Solutions
Recently a public cloud provider announced an extension of its speech transcription service in which the transcript is PII
redacted after the transcription is performed and another has a data loss prevention service running on the cloud. A
disadvantage of these approaches is that the on premise data needs to be moved to the cloud before it can be cleaned.
JPMC Solution
A typical use case is an on premise data set which includes PII values in an unstructured text stream. Each block of text is
redacted using the redaction package. Optionally the redacted text can be tokenized.
Conclusion
Using PI removal, PI data can be redacted from text corpus. Using tokenization for data, each string of text is replaces with
non-sensitive token. Using PI removal and tokenization enables teams make small and secure datasets available to ML teams.