AI training: A Backward Cat Picture is Still a Cat Picture

AI training: A Backward Cat Picture is Still a Cat Picture

Genes constitute only a minor portion of the human genome. There are large regions of DNA between them that tell cells when, where, and how much each gene should be employed. These regulatory motifs are biological instruction manuals. That sounds complicated, and it is.

The instructions for gene regulation are inscribed in a convoluted code that scientists are attempting to decipher using artificial intelligence. Deep neural networks (DNNs), which excel at detecting patterns in vast datasets, are being used to understand the laws of DNA control. DNNs are at the heart of well-known AI technologies such as ChatGPT. Thanks to a new tool developed by Cold Spring Harbor Laboratory Assistant Professor Peter Koo, genome-analyzing DNNs can now be trained with far more data than can be obtained through experiments alone.

“With DNNs, the mantra is more data, better,” adds Koo. “We really need to see a variety of genomes for these models to learn robust motif signals.” However, in certain cases, biology is the limiting factor since we can’t produce more data than exists inside the cell.”

If an AI learns from a small number of instances, it may misunderstand how a regulatory motif affects gene activity. The issue is that certain themes are unusual. There are very few instances in nature.

To address this restriction, Koo and his colleagues created EvoAug, a novel way of supplementing the data used to train DNNs. EvoAug was inspired by an unnoticed dataset: evolution. The process begins by generating artificial DNA sequences that nearly match real sequences found in cells. The sequences are tweaked in the same way genetic mutations have naturally altered the genome during evolution.

With one important assumption, the models are then taught to detect regulatory motifs using the new sequences. It is expected that the great majority of changes will not interfere with the sequences’ functionality. Koo compares this kind of data supplementation to training image-recognition software using mirror pictures of the same animal. The computer realizes that a reverse cat image is still a cat image.

According to Koo, some DNA alterations do affect function. As a result, EvoAug incorporates a second training process that exclusively uses real biological data. This directs the model “back to the biological reality of the dataset,” according to Koo.

Koo’s team discovered that models trained with EvoAug outperform those trained just on biological data. As a consequence, scientists may soon have a greater understanding of the regulatory DNA that dictates the laws of life itself. Ultimately, this might lead to a completely new understanding of human health.