LLMs & Data Generations Power Couple Series

Part 1

Author: Ronald Dyer

The advent of LLMs has created several opportunities for organisations to automate routine processes and increase productivity. This has been a boon both for the technology driven AI companies as well as their users. While large language models (LLMs) and agents are becoming more prevalent for routine tasks there remain some missed opportunities. One such opportunity is the use of LLMs to create synthetic datasets.

What is synthetic data?

Synthetic data is data that is artificially generated to mimic real-world data generally utilizing AI techniques. Its value rest primarily in being a placeholder in scenarios where real-world data cannot or should not be used due to GDPR or other primary data safety concerns. They can also be used in machine learning as training data where real data is scarce or unavailable. They fall into three main categories:

1. Fully Synthetic – AI generated that includes no real-world data identifiers

2. Partially Synthetic – Partly derived from real-world information but portions are replaced that contain sensitive information

3. Hybrid – Combining real datasets with synthetic to create insights without identification of sensitive organizational specific information

Applicability to Project Management/Project Data Analytics

A probable question might be: What is the value of synthetic data creation in project management? The answer rest with the ability to create a “digital twin” of real project datasets that potentially present risk of security if used wholesale in AI environments. By developing synthetic twins of datasets project organsations and by extension project managers can leverage synthetic data to create improved delivery scenarios and pre-determine potential outcomes. Moreover, they do so without risking the safety of their proprietary datasets, which can potentially end up in the wrong hands given historical issues with LLMs.

Scenario: NovaStruct Construction is embarking on a smart-rail expansion project as part of UK critical infrastructure development. The project includes several complexities, with fluctuating resources, unpredictability and access to the location of critical rail infrastructure data. They wish to model potential risk to critical rail infrastructure in an AI-based environment as part of scenario analysis without revealing the location of critical assets. NovaStruct uses a custom LLM-based synthetic data generator to simulate realistic project scenarios. Here's how:

1. Data Augmentation for Risk Models: training the LLM on existing project documentation, timelines, and incident reports using pseudocodes & classifications.

2. Scenario Planning: Project managers use these to run “what-if” simulations and stress-test contingency plans on critical assets based on weather, terrorism etc., without revealing real locations.

3. Training AI Assistants: NovaStruct utilizes the trained synthetic data to create a project assistant chatbot to support Q&A with actual delivery

Utilising these approaches NovaStruct can generate multiple scenario-based outcomes to support delivery that can reduce project delivery forecasting accuracy, decrease reliance on incomplete historical data and speed project manager onboarding using simulated scenarios.

Synthetic Data Benefits

Customisation – tailor the data to meet the specification of a project even before actual delivery to assess possible outcomes

Efficiency – reduced the time to test scenarios using real data gathering techniques

Increases/Maintains data privacy – Data resembles real world but is a facsimile

Richer Data – allows for multiple layers of augmentation

Conclusion

Key considerations to start utilising synthetic data are:

1. Start small – Utilise smaller projects as test cases for dataset generation

2. Build ethics and bias reduction in at the onset – by incorporating lessons learned from previous projects of similar design (remember to anonymise critical components)

3. Verify – Check/test data to validate post generation through robust project workflow development

About the Author

Ronald Dyer is a Senior Data Analyst Tutor with a passion for helping others master data-driven decision-makin