Researchers at MIT launch the Synthetic Data Vault, a set of open source tools aimed at expanding access to data without compromising privacy.
Every year, the world generates more data than the previous year. In 2020 alone, an estimated 59 zettabytes of data will be “created, captured, copied and consumed,” according to the International Data Corporation, enough to fill about a trillion 64-gigabyte hard drives.
But just because data is proliferating doesn’t mean everyone can use it. Companies and institutions, legitimately concerned about the privacy of their users, often restrict access to data sets – sometimes within their own computers. And now that the Covid-19 pandemic has closed laboratories and offices, preventing people from visiting centralized data warehouses, sharing information securely is even more difficult.
Without access to data, it is difficult to create tools that really work. Introduce synthetic data: Artificial information developers and engineers can use it as a substitute for real data.
Synthetic data is a bit like diet soda. To be effective, they have to look like “real” ones in certain ways. Diet soda should look, taste, and bubble like regular soda. Similarly, a synthetic data set must have the same mathematical and statistical properties as the real-world data set it represents. “It looks alike, and has a similar format,” says Kalyan Veeramachaneni, principal investigator at the Data Laboratory for AI (DAI) and principal investigator scientist at the MIT Decision and Information Systems Laboratory. If run through a model, or used to build or test an application, it performs as real-world data would.
But – just as diet soda must have fewer calories than the regular variety – a synthetic data set must also differ from a real one in crucial ways. If it is based on an actual dataset, for example, it should not contain or even imply any information from that dataset.
Threading this needle is difficult. After years of work, Veeramachaneni and his collaborators recently unveiled a set of open source data generation tools – a one-stop shop where users can get as much data as they need for their projects, in formats from tables to time series. They call it the Synthetic Data Vault.
Maximize access while maintaining privacy
Veeramachaneni and his team first attempted to create synthetic data in 2013. They had been tasked with analyzing a large amount of information from the edX online learning program, and they wanted to bring in some MIT students to help. The data was sensitive, and couldn’t be shared with these new hires, so the team decided to create artificial data that students could work with instead, thinking that “once they wrote the processing software, we could use it in the real data, “says Veeramachaneni.
This is a common scenario. Imagine that you are a software developer hired by a hospital. You have been asked to build a dashboard that allows patients to access their test results, prescriptions, and other health information. But you are not allowed to see any actual patient data, because it is private.
Most developers in this situation will make “a very simplistic version” of the data they need, and they will do the best they can, says Carles Sala, a researcher at the AID lab. But when the board kicks in, there’s a good chance that “everything will fall off,” he says, “because there are some borderline cases that they weren’t taking into account.”
High-quality synthetic data, as complex as what it is intended to replace, would help solve this problem. Companies and institutions could share them freely, allowing teams to work more collaboratively and efficiently. Developers could even carry it on their laptops, knowing they weren’t putting any sensitive information at risk.
Refine the formula – and handle the constraints
In 2013, the Veeramachaneni team took two weeks to create a data pool that they could use for that edX project. The timeline “seemed really reasonable,” says Veeramachaneni. “But we completely failed.” They soon realized that if they built a series of synthetic data generators, they could make the process faster for everyone else.
In 2016, the team completed an algorithm that accurately captures correlations between different fields in a real data set – think about a patient’s age, blood pressure, and heart rate – and creates a synthetic data set that preserves those relationships, without any identifying information. When data scientists were asked to solve problems using this synthetic data, their solutions were as effective as those made with real data 70 percent of the time. The team presented this research at the IEEE International Conference on Data Science and Advanced Analytics in 2016.
For the next step, the team delved into the machine learning toolbox. In 2019, PhD student Lei Xu presented her new algorithm, CTGAN, at the 33rd Conference on Neural Information Processing Systems in Vancouver. CTGAN (for “Conditional Tabular Generative Adversary Networks”) uses GANs to build and refine tables of synthetic data. GANs are pairs of neural networks that “play with each other,” says Xu. The first network, called the generator, creates something – in this case, a row of synthetic data – and the second, called the discriminator, tries to tell whether it is real or not.
“Eventually, the generator can generate perfect [data], and the discriminator can’t tell the difference,” Xu says. GANs are most often used in artificial imaging, but they also work well for synthetic data: CTGAN outperformed classic synthetic data creation techniques in 85 percent of the cases tested in Xu’s study.
Statistical similarity is crucial. But depending on what they represent, the data sets also come with their own context and vital limitations, which must be preserved in the synthetic data. AID lab researcher Sala gives the example of a hotel ledger: a guest always leaves after checking in. Dates in a synthetic hotel reservation dataset should also follow this rule: “They have to be in the correct order,” he says.
Large data sets can contain a number of different relationships like this, each strictly defined. “Models cannot learn constraints, because constraints are very context-dependent,” says Veeramachaneni. So the team recently finished an interface that allows people to tell a synthetic data generator where those limits are. “The data is generated within those limitations,” says Veeramachaneni.
That accurate data could help businesses and organizations in many different industries. One example is banking, where increased digitization, coupled with new data privacy regulations, has “sparked growing interest in ways to generate synthetic data,” says Wim Blommaert, team leader for financial services at the ING. Current solutions, such as data masking, often destroy valuable information that banks could use to make decisions, he said. A tool like SDV has the potential to sidestep the sensitive aspects of data while preserving these important limitations and relationships.
One vault to rule them all
The Synthetic Data Vault combines everything the group has built so far into “a whole ecosystem,” says Veeramachaneni. The idea is that stakeholders – from students to professional software developers – can come to the vault and get what they need, be it a large table, a small amount of time series data, or a mix of many different types of data.
The vault is open source and expandable. “There are a lot of different areas where we find that synthetic data can also be used,” says Sala. For example, if a particular group is underrepresented in a sample data set, synthetic data can be used to fill in those gaps – a sensible endeavor that requires a lot of finesse. Or companies may also want to use the synthetic data to plan for scenarios they have not yet experienced, such as a large increase in user traffic.
As use cases keep popping up, more tools will be developed and added to the vault, Veeramachaneni says. It may occupy the team for another seven years at least, but they are ready: “We are only touching the tip of the iceberg.”
This would affect all aspects of HR functions like how HR professionals embark and hire people, and the way they train them.
Artificial intelligence (AI) is changing every aspect of our lives and that too at a rapid pace. This includes our professional lives as well. Experts hope that in the coming days, AI will become a more important part of our careers as all companies are making progress in adopting such technology. They are using more machines that use AI technology that would affect our daily professional activities. Very soon, we would see machine learning and deep learning in HR as well. It would affect all aspects of HR (human resources) such as the way HR professionals embark and hire people, and the way they train them.
Impact on onboarding and recruiting
Companies are also using machine learning and deep learning in HR to help provide on-the-job training to employees. Just because you’ve gotten a job and settled into it doesn’t mean you know everything. You need to get job-related training so you can keep improving. This is where experts expect AI to play a major role in the years to come. It will also help a generation of professionals in an organization transfer their skills to their successors. This will ensure that no company ever suffers from a skill shortage. Increase in the workforce Robotics in human resources will play an important role in improving the people who work in organizations in which the management applies that technology. One of the main reasons people are so afraid of using AI in an organization is that they feel that it would replace them and that they would do everything they can do now. This will consequently lead to job losses. However, in today’s scenario, AI is all about augmenting that workforce. This means that it would help you do your job more efficiently. Contrary to popular opinion, she would not replace you.
Surveillance of the workplace
Companies can also use machine learning and deep learning in HR to improve their workforce surveillance work. This is uncomfortable for many employees as they feel that such technology would invade the privacy of their workplace. Recently, Gartner conducted a survey that found that more than half of companies with an annual turnover of more than $ 750 million use digital tools to obtain data on their employees’ activities and monitor their overall performance. As part of this, they analyze their emails to find out how engaged and happy they are with their work.
The use of workplace robots
Aside from robotics in HR, companies today also use physical robots that can move by themselves. This is especially true for warehousing and manufacturing companies. Experts hope that this will soon become a common feature in many other workplaces as well. Mobility companies are creating delivery robots that can move around the workplace and deliver items directly to your desk. Tech companies are also developing security robots. Experts believe they would become commonplace because they can ensure the security of commercial properties against intruders. Companies are also developing software to help you park your cars in your office.
By: Kathryn Mayer | August 10, 2020
When COVID-19 first appeared in January, Jo Deal, director of human resources for software company LogMeIn, began meeting daily with the company’s CEO and general counsel about the situation. Her original questions were logistical and scenario-based: Do we let people travel? What about employees returning from a conference?
Things progressed rapidly as the number of cases increased and the World Health Organization declared COVID-19 a pandemic in March. When that happened, Deal began meeting with the CEO and the general counsel about the looming crisis three or four times a day.
“Things were moving very fast at the time,” he says. “We still meet daily, although many months later.”
Although the first conversations revolved around logistics (for example, which employees would work from home and what was the best way to move workers remotely, for example), the questions quickly evolved to more personal matters: How do we help to the employees? How do you feel? What can we do?
“We talk a lot about flexibility and empathy and we work with our leaders to train them to try to meet people where they are,” says Deal. “And really, every day, it just survives.”
Months after the coronavirus pandemic, HR leaders have been a clear and resonant voice for their companies. They are important partners for C-suite executives, leading the way in initiatives such as relocating workers to remote locations and rethinking benefit offerings.
“HR is playing the role it has always played, but it is playing it exponentially,” says Jill Smart, president of the National Academy of Human Resources and former CHRO of consulting giant Accenture. “And because they are doing so well, I think the HR profession will come out of this [stronger] because they are going to play a key role.”
The pandemic has given HR executives elevated key roles in their organizations and a prominent voice amid the turmoil, but they have also become an important source on how to treat employees, carry on the culture, and lead. the road at a time when employees are collectively experiencing more shock in their personal and professional lives than ever before.
The role of HR leaders in organizations has historically been organization-centric: maintaining compliance, mitigating risk, enforcing policies. Employees traditionally are not comfortable with human resource leaders.
Many CHROs insist that seasoned HR leaders have long walked the line between being the ally of the employees and the organization. But they also recognize that a triple threat of crisis – the pandemic, social unrest and the ensuing economic upheaval – is driving them to focus more on employees than ever before. They focus on connection, empathy and the mental health of employees. And it’s sink or swim time for HR leaders who haven’t prioritized employee wellness in the past.
Archives
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- January 2024
- November 2023
- October 2023
- September 2023
- August 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- January 2022
- December 2021
- November 2021
- September 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- February 2020
- January 2020
- December 2019
- September 2019
- August 2019
- July 2019
- June 2019
- May 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- September 2018
- August 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018
- December 2017