Written By: Steve Zagieboylo, Senior Architect at Calavista
In this post, I’m going to discuss some approaches to obtain realistic Test Data without compromising the security of any customer’s sensitive data. In the world of health care software, this is referred to as Personal Health Information (PHI), but the concept exists in financial software, document management, … really, everything.
Why is Realistic Test Data Important?
The data you test against must include a lot of completely unrealistic data: the corner cases that you have to make sure you handle correctly, but almost never come up in real life. However, it is equally important to have a good quantity of realistic data, justified by, “You just never know.” I can’t count the times that a special case came up in the real data that we hadn’t considered as a corner case. If your application handles large amounts of real-world data, there’s just no substitute for testing it with large amounts of real-world data.
What Aspects are Important?
Approaches to Create Realistic Test Data
There are three general approaches:
Obfuscation / Data Masking
PHI data is more than just the names and addresses of the patients. It is important that even a determined investigator would not be able to take obfuscated data and reverse the process to figure out whom the data describes. Therefore, not only names but birthdates, dates of visits, locations, even some specific combinations of symptoms might need to be obscured. However, changing the dates too much might change the analysis of the illness: Symptoms of osteoarthritis are much more alarming in a twelve year old compared to the same symptoms in someone who is seventy-three. Proper obfuscation, or data masking, creates characteristically intact, but inauthentic, replicas of personally identifiable data or other highly sensitive data in order to uphold the complexity and unique characteristics of data. In this way, tests performed on properly masked data will yield the same results as they would on the authentic dataset.
Tool Recommendations
SOCR Data Sifter
On open-source tool for obfuscation. It has a setting for exactly how obfuscated the data should be, where the lowest setting is unchanged and the highest setting leaves the data unrecognizable.
Documentation Link
Source Code Link
SQL Server Dynamic Data Masking (DDM)
In SQL Server, as of SQL Server 2016, there is a Dynamic Data Masking feature that allows you to specify data masking rules on the database itself, with no need to change the data. Whether or not a particular user can see the real or the masked data is based on user privileges. Here is the best overview of the feature that I’ve found: Use Dynamic Data Masking to Obfuscate Your Sensitive Data
Roll Your Own
I am usually a big advocate of buy over build. The software you imagine is always bug-free and does just what you want. It isn’t until you’ve spent a bunch of time working on it that it acquires the bugs and feature compromises that make you unhappy with the existing tools. In this case, however, it is a pretty small bit of code and purely internal, so if it chokes, you haven’t embarrassed yourself in front of customers. The biggest risk is that you have to make sure that no sensitive data is getting past it, which is pretty straightforward to check. The biggest bit is to have a serious review of the entire database schema to make sure all-important fields are being modified.
Approaches:
You can randomize the data just with SQL, with clever use of the RAND command. Here is a great article on this approach: Obfuscating Your SQL Server Data.
If the code for your data model is cleanly isolated in a library, you can make a new application, pull in that library, and build your tool with that, using the same technology your developers are already comfortable with. If your code is not so cleanly isolated, a riskier approach is to write the obfuscation code inside the same code base, where it is only possible to trigger it with a completely isolated command. I’d recommend putting what amounts to a two-factor authentication to trigger this code; just think what a disaster it would be if it somehow gets built and then run inside the production system. It should only be included if a TOOLS_ONLY flag is set in the build process, and can only be run if the OBFUSCATION_TOOL configuration flag is set in the application configuration. Of course, neither flag should be on by default. Even so, carefully consider the impact that an error – or even a rogue developer – could have on a production system before using this approach.
This extra code might need other additional controls on it, as well. For instance, if you have tables that are flagged as Insert-only, you will need to change the settings and add new code to your lowest level code to enable your obfuscation code to write the obfuscated versions of the records. Again, these functions should have two levels of enablement, such that they are not even built in the code that goes to production, and if they accidentally do get built, they won’t run because they do nothing if the OBFUSCATION_TOOL configuration flag is not set.
Steps:
These steps should be built into the DevOps tools so they can be run easily, or even are automated to run every day. For one Calavista customer, we set up a process where part of the nightly process was to copy all the production data, obfuscate it, and install it in the QA system. It was incredibly helpful in tracking down tricky bugs, because even though the names and dates were different, the IDs of the records that triggered the problems were consistent, and that was all that was saved into logs.
Additional Consideration: Logging
This is more a general thought for any system with sensitive data, but it intersects with this approach. When logging specific record information, such as when you are logging that an unusual process is being invoked for a certain record, the developers should know to be conscious of what data is sensitive, and never to put that data in the log. For most applications, it is acceptable to put naked IDs in the logs, since those do not mean anything to someone without access to the database, anyway. Fortunately, the IDs are exactly what you want to be able to hand to developers who are going to be debugging against the obfuscated QA database. They still match up to the problem records, whereas the names, addresses, etc. do not.
Conclusion
When dealing with sensitive data of any sort, including health care data, it is important for QA systems to have realistic data upon which to operate. Having small amounts of hand-generated data is going to allow bugs and problems to creep in, and will make them very difficult to track down with they are found in the production system. There is really no substitute for getting the real production data and having a copy of it, properly obfuscated, in the QA environment. Calavista recommends performing this step automatically, as part of the nightly process. Even though getting this set up represented work that had to go on the schedule, it inevitably more than pays for itself in the long run.