Data-centric AI: How to Get Better Performance and Fairness in AI
Artificial Intelligence (AI) and data are inseparable. You can’t have AI without data. All AI is centered on and derived from data. In AI, data is implicit. So, if AI by its nature is data centric, why are we talking about data-centric AI ❓ It’s like saying electricity-powered light bulb as if a light bulb could be powered by something else. And yes
Data is to AI what electricity is to the light bulb.
Nevertheless, data centric AI has become an important concept. There’s a growing attention to it driven by the need to achieve better / improved performance, robustness, fairness and remove bias in AI models in the real-world. I first encountered the concept when I attended an online seminar by the renowned Andrew Ng, (Founder & CEO of Landing AI, Founder of deeplearning.ai, Co-Chairman and Co-Founder of Coursera). He passionately believes that engineering data is the next big break through in building powerful AI applications.
Data centric AI is predicated largely on two facts.
- Significant gains in the performance of AI models can be achieved by focusing more on engineering high quality data sets that existing state-of-the-art models can be further trained on (fine tuning) rather than devoting time to building new models or improving other aspects of existing models (like the architecture or engineering new loss functions). If you are not Meta or Google, you should never try to build a new model from scratch. Your own kind of research should be focused on finding which models have been developed that do what you want to do or something similar and then customize them by transfer learning on your specialized datasets.
- High performing existing models do not always perform the same across all real-world situations in which they need to be deployed. This means that a model that performs well in one real world situation, doesn’t always perform well in another. That is why the biggest caution in the deployment of AI models is that even if a model does exactly what you want with a reported high performance, you should never deploy it as a plug and play module.
Why Models should not be Deployed as Plug and Play Modules?
I work for a company called Trust Stamp. Trust Stamp is a highly innovative company in the biometric identification industry, building privacy-ensuring biometric identity solutions that are powered by high performance models. Here are some of the things we encounter very often.
- Face detection/recognition models do not perform the same for white and black people. This is quite common and somewhat expected. Most of the data that researchers use to develop these state-of-art models were collected largely in the West and dominated by white people. Hence, they are more suited to detecting and recognizing white people. So, it will be disastrous to deploy such models where people who have darker skin tones will be subject to this bias. What we do is to further train such models (fine tune) with images of people with darker skin tones. This customizes the model to work for new use cases / environments and is much cheaper and easier than training a new model from scratch.
- A face recognition model deployed on mobile devices may perform well on some devices because of the quality of their pictures and not perform well on other devices. We sometimes find that the same model deployed on a recent Samsung performs better than on an older version of the same phone. Or the same model performs better on some phone brands and worse on others. This is because of the different camera sensors in phones. Some phones have better cameras and so produce high quality images more suited for computer vision tasks.
- The sensor issue is quite common in image and computer vision AI systems. The quality of the sensor generally dictates how well or poorly a model performs. Even for images generated by more complex systems like x-ray machines. A model trained to detect a disease from x-ray images may perform well on images from some x-ray machines and perform much worse on images from other x-ray machines. Imagine bringing an AI solution developed in the developed world with images from newer generation x-ray machines and deploying it as it is in a developing country still using older generation x-ray machines. This could spell death as the disease can easily go undetected. What should be done? Bring the solution but first fine tune it by further training with images from the older generation machines.
Data-centric AI is therefore the means by which already existing models that do exactly what you want or similar can be customized to work better, more fairly and robustly under different situations and environments than the ones they were previously trained in/for.
Despite the fact that we have data now more than ever before, the most limiting bottleneck to building highly usable AI solutions today is still data and it is a bigger challenge in some places/industries than in others. Getting high-quality, well labelled data that is relevant and representative of the real-world in which the solution is going to be deployed remains the biggest challenge in developing AI solutions. If you have good quality data, then with little fine tuning (further training), you will be able to customize an existing model to achieve acceptable performance for your unique / new use case.
And because you are not training from scratch, you don’t even need so much data to achieve acceptable performance. Take for example our x-ray scenario above, we could easily adapt a solution built for to detect disease in images from new generation x-ray machines to work for images from older generation machines with as few as hundreds of high-quality well labeled images from older generation machines just by further training the model with the new data.
With Data-centric AI we can solve the problems of high performing models performing poorly, or with bias, and unfairness when deployed in new use cases, scenarios or environments. We can also easily improve model performance and adapt existing models to work for us only by collecting or engineering new, better and representative data.
To speak with an analogy, consider that we are trying to produce the most energy efficient car. We have devoted a lot of time to improving the engine and now have fairly efficient engines. To improve the engine further, will require a lot more work. Data-centric AI simply says instead of trying to further improve fuel efficiency by still focusing on improving the engine (the model), let us now turn instead to improving the quality of the fuel (the data) that this improved engine burns. Why? Because a hundred hours of work on further improving the engine might give us a 1% improvement in energy efficiency, but just 10 hours of work on the quality of the fuel, can easily get us that 1%. The complexity is much smaller, easily manageable and suitable for companies that just want to develop useful solutions without getting drowned in crazy research.
Reference
[1]: https://mitsloan.mit.edu/ideas-made-to-matter/why-its-time-data-centric-artificial-intelligence?