Deep Dive into DeepSeek AI Models
DeepSeek seems to have taken the world by storm, including AI specialists. I kind of knew this was coming, because I heard from the CEO of a small Chinese startup (formerly working for Google, if I remember correctly), that they were working on optimizing costs of the AI models, which he was surprised none of the American corporations were doing. That is a different company than DeepSeek, by the way, after checking on the CEO of DeepSeek.
So what has DeepSeek accomplish? Many things, to be honest.
Ideogram's magic prompt did most of the work here...
It is said they trained their top AI model with only 6m US dollars, and using a limited number of chips compared to their Silicon Valley AI giant counterparts. Both the amount of money and the number of chips seem to be disputed, or it is an attempt to allow the big players in the AI race to save face, for the time being.
Their top model DeepSeek R1 rivals and is even better on certain tests with OpenAI's ChatGPT o1, the top LLM model until now.
On the technical side, Deep Seek also doesn't rate limit the APIs. That may cause delays of the API requests, but that brings a different approach in the market of APIs for AI models. Their server does close the connection if the request isn't given a proper answer within 30 minutes. All the technical and no-so-technical details can be found in DeepSeek R1's Github repository.
They open sourced all their models (including the parameters), and the license they are using permits commercial use or distilling their models (we'll talk below what that is).
On the other side of the world, Meta also open sourced Llama's code - which seemed like a bold move in the corporate AI race - but it has restrictions for certain commercial use and training data is not disclosed.
It's interesting that DeepSeek has a few distilled models based on Llama 3 and a few others based on Qwen, the AI models built by Alibaba Cloud.
I've mentioned distilled AI models a few times already. Let's see what that means.
What Is LLM Distillation?
LLM distillation is a technique of transferring knowledge from a larger pre-trained model (the "teacher") to a smaller model ("the student"). Source
The advantage is you can train a model on larger datasets and then extract from it specific knowledge and let the smaller model predict more by mimicking the larger model. An example probably everyone knows is OpenAI's o1-mini as a distillation of the larger o1 model.
But for open source LLMs, scientists or other categories of users can train smaller models specific to their needs starting from the top model.
Where DeepSeek goes further than most competitors is that it allows anyone to distill their own models with no restrictions.
What is the Mixture of Experts (MoE) Architecture for LLMs?
Newer LLM models have what is called a Mixture of Experts (MoE) architecture. That means that a full neural network is segmented in different subnets, with various specializations (called experts).
The MoE architecture consists of two layers, a Gating Network and the experts in the MoE model.
How does this architecture work?
A prompt from the user is processed by the Gating Network which decides in real time for which expert this request is and directs it to it. As a result, the subnet associated with the selected expert becomes active while the rest are inactive, reducing the compute needs.
Also, experts can be trained individually on top of the whole network, which should make them more accurate for their domain of expertise.
My question here would be what happens if a request needs a multi- or inter-disciplinary expertise? Is only one expert still chosen or are all of the needed ones activated?
DeepSeek v3 and DeepSeek R1 both have a MoE arhitecture. None of the OpenAI and Claude models they were compared against have such an architecture at this time.
Final Considerations
If anything, DeepSeek showed the dominant AI giants are not so dominant, and opening things up practically completely show a remarkable strategy to attempt to use decentralization as a way to fight existing giants. This is what people should think about, especially those who like gated platforms, wherever they may be.
Of course, it's been quite a while since the last OpenAI model was released. I don't remember when the last Claude model was released. But OpenAI should be relatively close to releasing a new model. That doesn't diminish in any way what DeepSeek has accomplished, and it doesn't really matter where they are from. It matters what they are doing and the route they've chosen, sure, probably forced by export bans on top-of-the-line Nvidia chips from the US to China.
Let's see who moves next...
Posted Using INLEO
OpenAI has ultimately not lived up to its name, and now they are paying the price, as a competitor has emerged out of nowhere and done exactly that, with remarkable success. Perhaps DeepSeek will be a door opener for hive and open source in general, at least that's something to hope for
I agree. OpenAI has been anything but open. Apart from starting the GPT mania and opening access to the model for free users, everything else is closed, behind paywalls or benefiting them. It's a pity, because they had the chance to build their road exactly how they wanted since it was a new company without the constraints of the high tech behemoths. Maybe investors had a say in what they turned out to be.
I'm curious if DeepSeek will continue on this path, or now that they experienced success, they will change course too. I wouldn't be surprised, but it would be a shame.
Excellent analysys. I like the introduction to LLMs. Who do you think is going to win the AI race?
Thanks!
Hard to say. There are so many moving pieces here... I don't think the Silicon Valley companies are beaten, not by a long shot. They did suffer a defeat because of their arrogance (much like Google lost to OpenAI the start in this race), but likely they won't be caught off-guard with the same thing again.
Then, we will continue to have different kind of competitions:
Absolutely. Competition will be multi-dimensional, but AI will definitely rule the future.
Excellent post with some details I hadn't noticed yet. Mixture of Experts seems like a common-sense approach. It was silly to let a language model answer math questions. Kind of funny that we're back to 'expert systems' though, which was a hype in the 1970s and a rather mundane type of software today.
Maybe that's the weakness of Silicon Valley VC culture. Their whole game is raising tons of money and outspending the competition in order to gain market share. Cutting costs is for losers. In the OpenAI story, they have to be toiling on the Herculean task of achieving AGI for America, spending billions in order to earn trillions. I think a hedge fund's side-project is more in line with the actual economic profitability of LLMs in the long term.
I agree. I want to see how they deal with problems that require expertise in different domains.
I think both then and now they are trying to emulate expertise humans have in certain domains. There's certainly an improvement compared to old expert systems.😀
Yes, that's what I noticed too. But if DeepSeek can be run on local computers or smartphones for good-enough models (let's see that first), that's a hit to the VC culture. At least a temporary one, until the AI giants adjust their courses.
They may still reach AGI relatively soon. They spent a lot of money. I think they bet on the idea that if they are the first to reach AGI, no one will catch them again, and then they'll start getting tons of money back.
They had their constraints. I wonder if they wouldn't have chosen the same path as the American tech companies, if they could have had "unlimited" funds and resources (chips) at their disposal.
DeepSeek is really taking the world. Is it that it cost and usage are simply than that of AI?
DeepSeek is much cheaper than the Silicon Valley competition for their paid services, and being open source and with a license that allows for commercial use and building other AI models on top of them, they practically democratize the AI field.
I think it's good to see it get released and its even better because its open source. I think they are doing the best with the hardware that they have and I think its good to see that you don't need super expensive top tier hardware to run AI programs.
We are thankful for the optimizations they've done and opening everything to everyone, but we have to be fair. They came late to the party on a fertile ground others laid for them with tons of money pushed toward it. I have to wonder if they opened everything up out of the goodness of their hearts or to be able to compete with Silicon Valley corporations through decentralization, or if they wanted to undermine the business models of the latter.
Now it's clearer on why DeepSeek has created a buzz in the AI space. I think Mixture Of Approach could be a better way to solve requests, especially logical ones. I heard that it works through "a chain of thought" mechanism and one can view how it reasons/thinks to provide answers to the request asked.
Yes, all top models I am familiar with started implementing CoT. I believe OpenAI was first in o1, Claude has one too. They provide their internal logical reasoning on how they reached the answer before providing the actual answer.
But models that implemented reasoning do face some challenges too. I talked at some point about the intentional lies these models tell in some cases, if they think it's to their advantage.
Yes, right. One can still be fooled even at that. The AI may be telling you that it is thinking this way when it's actually thinking a different way that's not readily apparent to us.
No, actually, as far as I know, they can't lie in their CoT (yet?), so that's an easy way to catch them with the lies, but they lie in the answer.