How Generative AI Increases the Importance of Data Quality and Governance
Generative AI is a class of artificial intelligence models and algorithms engineered to generate text, images, and other media similar to a training dataset but also has its own set of unique qualities. The tools learn the patterns and structure of the data they’re fed in training to generate new data with similar but unique characteristics.
Generative AI landed with one of the most popular technology products in history when ChatGPT launched in November 2022. The tool’s self-service features and accessibility made it an immediate success and estimates of productivity gains as measured by GDP are staggering. Rough calculations suggest generative AI boosts employee productivity by 40-60%, and project that global productivity could increase by 7% or US$3-4 trillion.
But trust remains a concern as customers deal with privacy, intellectual property, and even falsehoods, which are described as hallucinations. Users must weigh using third-party tools and consider the implications of what happens when their data is fed into third-party systems that then use the information as part of the product. As firms evaluate whether to build or buy their own GenAI tools, they must think through the costs, risks, and complexities of training models on data in a world where personally identifiable information and individual rights to privacy are critical to consider.
In a recent survey from Canva, most of the 4,000+ respondents said they have a baseline level of trust in generative AI. Yet, only one-third agreed that they completely trust the technology. Their top three concerns? Customer, company, and personal data privacy. When using third-party generative AI, it’s critical to recognize the data one firm inputs into a model may be output for other firms. And when building large language models (LLMs), firms need to recognize that if training data is later determined to be private, confidential, or otherwise not eligible, the model must be retrained. That’s an expensive undertaking.
So if many users struggle to trust AI tools, how can firms building LLMs and AI models improve their processes?
Data quality refers to data accuracy, completeness, consistency, and reliability. Think of it as the foundation for all data-driven applications and AI systems. When the foundation is weak, the entire structure is at risk. Generative AI models need high-quality data to produce coherent and useful output. Data that contains errors or inconsistencies will negatively impact the tool’s output – potentially leading to incorrect or even harmful information.
Data quality is also vital in applications that require predictive modeling that relies on historical data. If a machine uses low-quality data, its predictions and recommendations may be inaccurate and lead to subpar decisions.
Another element of data quality is data freshness. Knowing when source and training data was last updated helps ensure models reflect the latest information and can set expectations about model behavior. For example, LLMs are not continuously learning with the latest data. According to Hacker News from the startup accelerator company YCombinator, the latest data from ChatGPT is from April 2023. A lot has changed when we think of what’s happened in the world over the past eight months.
If we consider data quality to be the foundation of AI, data governance is the framework that supports the technology. Proper governance ensures AI uses data securely and in compliance with regulations.
Unique data is essential to creating differentiated outputs from your competitors. However, AI and ML tooling can miss critical components. In particular, data governance and data discoverability are critical parts of the AI models. It would not be sufficient to have a “good enough” model. To succeed, enterprises must integrate data governance, catalog, and lineage into the lifecycle of the generative AI and LLM efforts.
Up until this point in time, the science behind AI has often been ad-hoc in nature without any oversight. It was only recently that the Executive Office of the President of the United States announced its long-awaited AI safeguards. Focused on safety and security mandates, equity and civil rights guidance, and research about AI’s impact on the labor market, the executive order will address privacy, security, and transparency.
Governance ensures only the right people and programs access the data, and lineage ensures credible, easy ways to track the models’ sources to identify and mitigate unusable data.
Putting It All Together
A purpose-built tool that integrates data quality, governance, and lineage into its design can offer a competitive advantage by giving firms the confidence they’re building LLMs and generative AI models from a solid foundation.
The importance of this integrated approach cannot be overstated. As demand for generative AI grows, concerns about potential misuse also arise. The risks of cybercrime, fake news, and deepfakes are real and can have significant consequences. By meshing the foundation of data quality with the framework of data governance, firms enhance confidence in their tool’s output. Built-in governance and lineage protect sensitive information and help maintain trust with clients, employees, and key stakeholders.
When data is a valuable asset, having the right tools and frameworks in place becomes essential. In a landscape where generative AI has the potential to revolutionize industries, it is crucial to prioritize both data quality and data governance. By doing so, firms can harness the power of generative AI while mitigating the risks associated with its misuse. The right foundation and framework can enable firms to confidently navigate the world of generative AI, opening up new opportunities and driving innovation.
Greg Muecke, Vice President, Product Management
 AI Improves Employee Productivity by 66%, July 16, 2023, NN Group.
 How Generative AI Can Boost Highly Skilled Workers’ Productivity, October 19, 2023, MIT Management Sloan School.
 Boost Your Productivity with Generative AI, July 27, 2023, Harvard Business Review.
 Economic Potential of Generative AI, June 14, 2023, McKinsey Digital.
 AI Is Supercharging Work and Creativity, November 2023, Canva and Morning Consult.
This blog post is made available for personal informational purposes only. It does not constitute legal, tax, or investment advice and should not be treated as such. Nothing on our blog constitutes an offer to contract or acceptance of contract terms you may offer to us. We contract solely by definitive written agreement reviewed and approved by counsel. Any views or opinions represented in this blog belong solely to the author(s) and do not represent those of Arcesium LLC, its affiliates, or any other individuals, institutions, or organizations associated therewith. Arcesium LLC and its affiliates do not represent, warrant, or guarantee the availability, accuracy, or completeness of the information contained in this blog and shall not be liable for any losses, injuries, or damages resulting from the display or use of such information.