Skip to main content

The problems of large language models (LLM) programming (training).


"Generative AI’s reliance on extensive data has led to the use of synthetic data, which Rice University research shows can cause a feedback loop that degrades model quality over time. This process, called ‘Model Autophagy Disorder’, results in models that produce increasingly distorted outputs, highlighting the necessity for fresh data to maintain AI quality and diversity. Credit: SciTechDaily ((ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)


When we create new models for large language models, LLMs, we teach them. 


When AI learns something, it makes a new database. The databases involve data, that can be fresh. Or it can be the new application for the old data. So connections between databases allow the system to connect databases into the new entirety. The AI and Large Language Model, LLM require lots of data. 

That they learn to operate as they should. There are lots of limitations to data that developers can take from the network. And the AI training requires some permissions. The answer is to use synthetic data for LLM training. The synthetic data means the faces and other things, created by artists. And then researchers can connect those images with certain things. 

The problem with AI training using the data from the network is this: the data that the LLM sees is separated from reality. The abstraction must connect to the right databases. And that is quite challenging. If we want to make the LLM that follows spoken language, we must connect all ways to say something to the database. 

The system must turn dialect words into literary language. And then the system can connect this thing to some actions. That means that the system washes dialects to the literary languages. And then it can follow orders on what to do. Then it just connects words with some actions that are programmed in its databases. 

In the learning process, the system makes the loop. In that loop surrounds the dataset. Then it can connect new data to the loop. The problem is that the system or LLM doesn't think. It doesn't make a difference between synthetic data and real data. 

When human uses imagination we use our memory cells to tell, if the memory is true or false. When our vision center in the brain works with synthetic memories or imagination we know the vision comes from imagination, because there is no image in memory cells that handle kinetic senses like touch. 

The LLM knows that the data is real and that its source is marked as things like cameras. The system sees that the camera sends the signal, and then it marks it as real. But the problem is this. The synthetic data that has carrier ID as it comes from the camera will not separated from the real information. And if a camera takes images from the screen the system will not separate those images from real data. 



"Richard Baraniuk and his team at Rice University studied three variations of self-consuming training loops designed to provide a realistic representation of how real and synthetic data are combined into training datasets for generative models. Schematic illustrates the three training scenarios, i.e. a fully synthetic loop, a synthetic augmentation loop (synthetic + fixed set of real data), and a fresh data loop (synthetic + new set of real data). Credit: Digital Signal Processing Group/Rice University" (ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)




"Progressive transformation of a dataset consisting of numerals 1 through 9 across 20 model iterations of a fully synthetic loop without sampling bias (top panel), and corresponding visual representation of data mode dynamics for real (red) and synthetic (green) data (bottom panel). In the absence of sampling bias, synthetic data modes separate from real data modes and merge." (ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)

"This translates into a rapid deterioration of model outputs: If all numerals are fully legible in generation 1 (leftmost column, top panel), by generation 20 all images have become illegible (rightmost column, top panel). Credit: Digital Signal Processing Group/Rice University" (ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)


"Progressive transformation of a dataset consisting of numerals 1 through 9 across 20 model iterations of a fully synthetic loop with sampling bias (top panel), and corresponding visual representation of data mode dynamics for real (red) and synthetic (green) data (bottom panel). With sampling bias, synthetic data modes still separate from real data modes, but, rather than merging, they collapse around individual, high-quality images." (ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)

"This translates into a prolonged preservation of higher quality data across iterations: All but a couple of the numerals are still legible by generation 20 (rightmost column, top panel). While sampling bias preserves data quality longer, this comes at the expense of data diversity. Credit: Digital Signal Processing Group/Rice University"(ScitechDaily, Could AI Eat Itself to Death? Synthetic Data Could Lead To “Model Collapse”)

The upper image tells also the problem with the fuzzy logic. The fuzzy logic is not possible to make for computers. There are only lots of descriptions that are connected with certain actions. 

When the system reads something like hand-written texts there are images about things like numbers or letters. Each hand script requires its image.  In that case, there are images about possibilities of how to write some numbers. The last paragraph's marks cannot connected with numbers even with the best willingness. But if the AI makes decisions using those marks, the result can be a catastrophe. 

If the AI starts image recognition from paragraph 20 and then fills it. The problem is that in the natural environment, nothing is a perfect match with the models. 

And there are only lots of images that the system can compile with camera images. And in the last characters, you can see points, that have no match with the numbers. If there is some kind of dirt on the surface, the system can translate the image to something else than it should. 

When the AI sees some image, that image activates some action. And when AI reads things like postal codes the system reads the number in parts. If the number is 921, the first number is "9". It routes the letter to the line where there are letters for section 9. Then if the second number is "2" it sends it to the sub-delivery line "2". And then the last number "1" to area number "1". 

When the AI makes some reactions it requires two databases. The database that it uses to compile situations that it sees. And that database activates another database there is an action, that is connected to the action that the system sees. 

When LLM and the operating system interconnect databases they require a routing table or routing map. Large-scale systems that can respond to many things require lots of connections and lots of databases. The database connection maps are databases as well as other action and reaction database pairs. 

When the data travels in the loop. It increases the number of databases. So data that surrounds the loops increases data mass even if there is no new data for the system. The collapse in the system happens when allocation units in the hard disks are full. There are trillions of allocation units, but each database requires one. 

https://scitechdaily.com/could-ai-eat-itself-to-death-synthetic-data-could-lead-to-model-collapse/

Comments

Popular posts from this blog

New AI-based operating systems revolutionize drone technology.

"University of Missouri researchers are advancing drone autonomy using AI, focusing on navigation and environmental interaction without GPS reliance. Credit: SciTechDaily.com" (ScitechDaily, AI Unleashed: Revolutionizing Autonomous Drone Navigation) The GPS is an effective navigation system. But the problem is, how to operate that system when somebody jams it? The GPS is a problematic system. Its signal is quite easy to cut. And otherwise, if the enemy gets the GPS systems in their hands, they can get GPS frequencies. That helps to make the jammer algorithms against those drones. The simple GPS is a very vulnerable thing.  Done swarms are effective tools when researchers want to control large areas. The drone swarm's power base is in a non-centralized calculation methodology. In that model, drones share their CPU power with other swarm members. This structure allows us to drive complicated AI-based solutions. And in drone swarms, the swarm operates as an entirety. That ca

Hydrogen is one of the most promising aircraft fuels.

Aircraft can use hydrogen in fuel cells. Fuel cells can give electricity to the electric engines that rotate propellers. Or they can give electricity to electric jet engines. In electric jet engines. Electric arcs heat air, and the expansion of air or some propellant pushes aircraft forward. Or, the aircraft can use hydrogen in its turbines or some more exotic engines like ramjets. Aircraft companies like Airbus and some other aircraft manufacturers test hydrogen as the turbine fuel.  Hydrogen is one of the most interesting fuels for next-generation aircraft that travel faster than ever. Hydrogen fuel is the key element in the new scramjet and ramjet-driven aircraft. Futuristic hypersonic systems can reach speeds over Mach 20.  Today the safe top speed of those aircraft that use air-breathe hypersonic aircraft is about Mach 5-6.   Hydrogen is easy to get, and the way to produce hydrogen determines how ecological that fuel can be. The electrolytic systems require electricity, and electr

The neuroscientists get a new tool, the 1400 terabyte model of human brains.

"Six layers of excitatory neurons color-coded by depth. Credit: Google Research and Lichtman Lab" (SciteechDaily, Harvard and Google Neuroscience Breakthrough: Intricately Detailed 1,400 Terabyte 3D Brain Map) Harvard and Google created the first comprehensive model of human brains. The new computer model consists of 1400 terabytes of data. That thing would be the model. That consists comprehensive dataset about axons and their connections. And that model is the path to the new models or the human brain's digital twins.  The digital twin of human brains can mean the AI-based digital model. That consists of data about the blood vessels and neural connections. However, the more advanced models can simulate electric and chemical interactions in the human brain.  This project was impossible without AI. That can collect the dataset for that model. The human brain is one of the most complicated structures and interactions between neurotransmitters, axons, and the electrochemica