Ensuring data security and privacy is one of the most important aspects of working with retrospective clinical data and biospecimens. This is particularly true when working with clinical text data using Natural Language Processing (NLP) tools, because clinical text can contain many HIPAA identifiers. Therefore, accurate automated deidentification is essential to strong translational research programs.

The Center for Disease Control and Prevention (CDC), the Food and Drug Administration (FDA), and the National Cancer Institute (NCI) are working together to improve the nation’s disease surveillance capabilities. They recently held a joint workshop entitled “Natural Language Processing and Machine Learning Workshop for Cancer Surveillance” on December 8, 2016 in Rockville, Maryland.

Rebecca Jacobson, MD, MS from the University of Pittsburgh and Dr. Guergana Savova from Harvard Medical School / Boston Children’s Hospital gave two talks at the workshop – one on deidentification and data sharing, and a second  on cTAKES and DeepPhe.

Both Jacobson and Savova are Co-PIs on the NCI funded Cancer Deep Phenotype Extraction (DeepPhe) project, which runs from 2014 – 2019 and is part of the NCI Informatics Technology for Cancer Research (ICTR) program. The DeepPhe Project aims to use natural language processing “to associate specific genetic, epigenetic, and systems changes with particular tumor behaviors.”  Progress on these research goals requires access to deidentified retrospective clinical text, and Savova and Jacobson recognize the critical importance of accurate deidentification to their work.

Several groups leading the forefront of research in deidentification were present, including those groups developing the MIST, NLM Scrubber and BOB deidentification systems. These groups are taking a variety of approaches to the deidentification problem and generally observing excellent performance. Advances in automated deidentification are encouraging and should lead academic health centers to recognize the value of using these approaches.

As PI of the TIES Project, Jacobson designed the security and privacy measures that are used by the TIES Cancer Research Network, including but not limited to deidentification. “You absolutely need high quality deidentification” she says, “but you also have to think of the deidentification as part of an overall strategy and set of protections to ensure privacy and security.” TIES uses a range of approaches including auditing, quarantining, de-identification, role based access, encrypted communications, and institutional firewalls. Another important aspect of this work is designing the policies and work processes to keep the entire network operating within the privacy and security safeguards that have been established. By taking a holistic approach, Jacobson finds that institutions are more willing to consider participation in a data sharing network.

Excitement was high at the NCI/CDC/FDA workshop for the potential of NLP and machine learning methods to improve our understanding of the burden of cancer and for improving access to the data needed to quicken cancer research. Jacobson and Savova also took away several key thoughts on the NLP and machine learning processes, which can be used to improve the DeepPhe project.

What are some ways you use deidentified data? How can we improve TIES security and privacy?