Guillermo Lafuente, Security Consultant and Technical Operations Manager
11 mins read
The biggest challenge for big data from a security point of view is the protection of user’s privacy. Big data often contains huge amounts of personal identifiable information, so the privacy of users is a huge concern.
Because of the amount of data stored, breaches affecting big data can have more devastating consequences than the ones we normally see in the press. This is because a big data security breach will potentially affect a much larger number of people, with reputational consequences and enormous legal repercussions.
When producing information for big data, organizations have to ensure they have the right balance between utility of the data and privacy. Before the data is stored it should be adequately anonymized, removing any unique identifier for a user. This in itself can be a security challenge as removing unique identifiers might not be enough to guarantee the data will remain anonymous. The anonymized data could be cross-referenced with other available data following de-anonymization techniques.
When storing the data, organizations will face the problem of encryption. Data can’t be sent encrypted by the users if the cloud needs to perform operations over the data. A solution for this is to use “Fully Homomorphic Encryption” (FHE), which allows data stored in the cloud to perform operations over the encrypted data so new encrypted data will be created. When the data’s decrypted, the results will be as if the operations were carried out over plain text data. So the cloud will be able to perform operations over encrypted data without knowledge of the underlying plain text data.
A significant challenge while using big data is establishing ownership of information. If the data’s stored in the cloud, a trust boundary should be established between the data owners and the data storage owners. Adequate access control mechanisms are key in protecting the data. Access control’s traditionally been provided by operating systems or applications restricting access to the information - this typically exposes all the information if the system or application is hacked.
A better approach is to protect the information using encryption that only allows decryption if the entity trying to access the information is authorized by an access control policy. An additional problem is that software commonly used to store big data, such as Hadoop, doesn’t always come with user authentication by default. This makes the problem of access control worse, as a default installation would leave the information open to unauthenticated users. Big data solutions often rely on traditional firewalls or implementations at the application layer to restrict access to the information.
Big data’s a relatively new concept so there isn’t a list of best practices that are widely recognized by the security community. However, there are a number of general security recommendations that can be applied to big data:
If you’re storing your big data in the cloud, you must make sure your provider has adequate protection mechanisms in place. Check the provider carries out periodic security audits and agree penalties in case adequate security standards aren’t met.
Create policies that allow access to authorized users only.
Both the raw data and the outcome from analytics should be adequately protected. Encryption should be used accordingly to ensure no sensitive data is leaked.
Data in transit should be adequately protected to ensure its confidentiality and integrity.
Access to the data should be monitored. Threat intelligence should be used to prevent unauthorized access to the data.
The main solution to ensuring data remains protected is the adequate use of encryption. For example, Attribute-Based Encryption can help in providing fine-grained access control of encrypted data. Anonymizing the data’s also important to making sure privacy concerns are addressed. It should be ensured that all sensitive information is removed from the set of records collected.
Real-time security monitoring is also a key security component for a big data project. It’s important organizations monitor access to make sure there’s no unauthorized access. It’s also important threat intelligence is in place to guarantee more sophisticated attacks are detected and the organizations can react to threats accordingly.
Organizations should run a risk assessment over the data they’re collecting. They should consider whether they’re collecting any customer information that should be kept private, and establish adequate policies that protect the data and the right to privacy of their clients.
If the data is shared with other organizations, how this is done must be considered. Deliberately released data that turns out to infringe on privacy can have a huge impact on an organization from a reputational and economic point of view. Organizations should also carefully consider regional laws around handling customer data, such as the EU Data Directive.
For example, many big data solutions look for emergent patterns in real time, whereas data warehouses often focused on infrequent batch runs. How do these different usage models impact security issues and compliance risk? In the past, large data sets were stored in highly structured relational databases. If you wanted to look for sensitive data such as health records of a patient, you knew exactly where to look and how to access the data.
Removing any identifiable information was also easier in relational databases. Big data makes this a more complex process, especially if the data is unstructured. Organizations will have to track down what pieces of information in their big data are sensitive and then carefully isolate this information to ensure compliance.
Another challenge with big data is that you can have a big variety of users each needing access to a particular subset of information. This means the encryption solution you choose to protect the data has to reflect this new reality. Access control to the data will also need to be more granular to ensure people can only access information they are authorized to see.
The main challenge introduced by big data is identifying sensitive pieces of information stored within the unstructured data set. Organizations must make sure they isolate sensitive information and are able to prove they have adequate processes in place to achieve it.
Some vendors are starting to offer compliance toolkits designed to work in a big data environment. Anyone using third party cloud providers to store or process data will need to make sure the providers are complying with regulations.
Security is a process, not a product. Therefore organizations using big data will need to introduce adequate processes that help them effectively manage and protect the data.
The traditional information lifecycle management can be applied to big data to guarantee the data isn’t being stored once it’s no longer needed. Also policies related to availability and recovery times will still apply to big data. However, organizations have to consider the volume, velocity, and complexity of big data and amend their information lifecycle management accordingly.
If an adequate governance framework isn’t applied to big data, the data collected could be misleading and cause unexpected costs. The main problem from a governance point of view is that big data’s a relatively new concept and therefore no one has created procedures and policies.
The unstructured nature of the information makes it difficult to categorize, model, and map the data when it’s captured and stored. The problem’s made worst by the fact the data normally comes from external sources, often making it complicated to confirm its accuracy. Organizations need to identify what information is of value for the business. If they capture all the information available, they risk wasting time and resources processing data that will add little or no value to the business.