I think natural language is not "natural". It is not natural in the sense of how human perceive the world. People perceive the world frame by frame, second by second, chunk by chunk, serialised as some unified embedding of images, audio and feelings.&nbsp;People see the world as streams of data, arranged by time.&nbsp;Without this understanding you cannot develop some "autoregressive" or "autonomous" learning agent at human level.&nbsp;The language cannot convey meaning beyond the language, or it is not efficient enough. You of course can represent the image pixel by pixel, RGB format in plain text, or try to convert every little image into text with location annotation and subtitles, but for human and machine it is not recommend to do this. We don't do this. We do this with visual cortex.This is not language, this is pure data. I'm thinking of a better way to handle visual data, text data and audio data in a unified way, and looking forward to unified modality models like&nbsp;<a href="https://github.com/mlfoundations/open_flamingo">OpenFlamingo</a>,&nbsp;<a href="https://github.com/OFA-Sys/OFA">OFA</a>&nbsp;and&nbsp;<a href="https://github.com/microsoft/unilm">UniLM</a>.Models with recurrent hidden states are also preferred, like&nbsp;<a href="https://github.com/BlinkDL/RWKV-LM">RWKV</a>, because they can have "infinite" context length, also batched training. from -&nbsp;https://github.com/Significant-Gravitas/Auto-GPT/issues/346#issuecomment-1510572325

Unpod

I think natural language is not "natural". It is not natural in the sense of how human perceive the world. People perceive the world frame by frame, second by second, chunk by chunk, serialised as some unified embedding of images, audio and feelings.&nbsp;People see the world as streams of data, arranged by time.&nbsp;Without this understanding you cannot develop some "autoregressive" or "autonomous" learning agent at human level.&nbsp;The language cannot convey meaning beyond the language, or it is not efficient enough. You of course can represent the image pixel by pixel, RGB format in plain text, or try to convert every little image into text with location annotation and subtitles, but for human and machine it is not recommend to do this. We don't do this. We do this with visual cortex.This is not language, this is pure data. I'm thinking of a better way to handle visual data, text data and audio data in a unified way, and looking forward to unified modality models like&nbsp;OpenFlamingo,&nbsp;OFA&nbsp;and&nbsp;UniLM.Models with recurrent hidden states are also preferred, like&nbsp;RWKV, because they can have "infinite" context length, also batched training.from -&nbsp;https://github.com/Significant-Gravitas/Auto-GPT/issues/346#issuecomment-1510572325

How people perceive the world, relative to GPT?