"Generative AI’s reliance on extensive data has led to the use of synthetic data, which Rice University research shows can cause a feedback loop that degrades model quality over time. This process, called ‘Model Autophagy Disorder’, results in models that produce increasingly distorted outputs, highlighting the necessity for fresh data to maintain AI quality and diversity. Credit: SciTechDaily ((ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)
When we create new models for large language models, LLMs, we teach them.
When AI learns something, it makes a new database. The databases involve data, that can be fresh. Or it can be the new application for the old data. So connections between databases allow the system to connect databases into the new entirety. The AI and Large Language Model, LLM require lots of data.
That they learn to operate as they should. There are lots of limitations to data that developers can take from the network. And the AI training requires some permissions. The answer is to use synthetic data for LLM training. The synthetic data means the faces and other things, created by artists. And then researchers can connect those images with certain things.
The problem with AI training using the data from the network is this: the data that the LLM sees is separated from reality. The abstraction must connect to the right databases. And that is quite challenging. If we want to make the LLM that follows spoken language, we must connect all ways to say something to the database.
The system must turn dialect words into literary language. And then the system can connect this thing to some actions. That means that the system washes dialects to the literary languages. And then it can follow orders on what to do. Then it just connects words with some actions that are programmed in its databases.
In the learning process, the system makes the loop. In that loop surrounds the dataset. Then it can connect new data to the loop. The problem is that the system or LLM doesn't think. It doesn't make a difference between synthetic data and real data.
When human uses imagination we use our memory cells to tell, if the memory is true or false. When our vision center in the brain works with synthetic memories or imagination we know the vision comes from imagination, because there is no image in memory cells that handle kinetic senses like touch.
The LLM knows that the data is real and that its source is marked as things like cameras. The system sees that the camera sends the signal, and then it marks it as real. But the problem is this. The synthetic data that has carrier ID as it comes from the camera will not separated from the real information. And if a camera takes images from the screen the system will not separate those images from real data.
"Richard Baraniuk and his team at Rice University studied three variations of self-consuming training loops designed to provide a realistic representation of how real and synthetic data are combined into training datasets for generative models. Schematic illustrates the three training scenarios, i.e. a fully synthetic loop, a synthetic augmentation loop (synthetic + fixed set of real data), and a fresh data loop (synthetic + new set of real data). Credit: Digital Signal Processing Group/Rice University" (ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)
"Progressive transformation of a dataset consisting of numerals 1 through 9 across 20 model iterations of a fully synthetic loop without sampling bias (top panel), and corresponding visual representation of data mode dynamics for real (red) and synthetic (green) data (bottom panel). In the absence of sampling bias, synthetic data modes separate from real data modes and merge." (ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)
"This translates into a rapid deterioration of model outputs: If all numerals are fully legible in generation 1 (leftmost column, top panel), by generation 20 all images have become illegible (rightmost column, top panel). Credit: Digital Signal Processing Group/Rice University" (ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)
"Progressive transformation of a dataset consisting of numerals 1 through 9 across 20 model iterations of a fully synthetic loop with sampling bias (top panel), and corresponding visual representation of data mode dynamics for real (red) and synthetic (green) data (bottom panel). With sampling bias, synthetic data modes still separate from real data modes, but, rather than merging, they collapse around individual, high-quality images." (ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)
"This translates into a prolonged preservation of higher quality data across iterations: All but a couple of the numerals are still legible by generation 20 (rightmost column, top panel). While sampling bias preserves data quality longer, this comes at the expense of data diversity. Credit: Digital Signal Processing Group/Rice University"(ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)
The upper image tells also the problem with the fuzzy logic. The fuzzy logic is not possible to make for computers. There are only lots of descriptions that are connected with certain actions.
When the system reads something like hand-written texts there are images about things like numbers or letters. Each hand script requires its image. In that case, there are images about possibilities of how to write some numbers. The last paragraph's marks cannot connected with numbers even with the best willingness. But if the AI makes decisions using those marks, the result can be a catastrophe.
If the AI starts image recognition from paragraph 20 and then fills it. The problem is that in the natural environment, nothing is a perfect match with the models.
And there are only lots of images that the system can compile with camera images. And in the last characters, you can see points, that have no match with the numbers. If there is some kind of dirt on the surface, the system can translate the image to something else than it should.
When the AI sees some image, that image activates some action. And when AI reads things like postal codes the system reads the number in parts. If the number is 921, the first number is "9". It routes the letter to the line where there are letters for section 9. Then if the second number is "2" it sends it to the sub-delivery line "2". And then the last number "1" to area number "1".
When the AI makes some reactions it requires two databases. The database that it uses to compile situations that it sees. And that database activates another database there is an action, that is connected to the action that the system sees.
When LLM and the operating system interconnect databases they require a routing table or routing map. Large-scale systems that can respond to many things require lots of connections and lots of databases. The database connection maps are databases as well as other action and reaction database pairs.
When the data travels in the loop. It increases the number of databases. So data that surrounds the loops increases data mass even if there is no new data for the system. The collapse in the system happens when allocation units in the hard disks are full. There are trillions of allocation units, but each database requires one.
Comments
Post a Comment