In Case You Didn't Know, WeChat's Servers Just Moved To The Cloud For Good Today

take 38 minutes to read
Home Points Main article

Paul Buchheit, Gmail's first product manager, said that the best products make it impossible to imagine life without them once you've used them. This statement has been implemented in Gmail, which has nearly 2 billion users worldwide, and if we were to find a sample in China, WeChat could not be more appropriate.

A person who lives in China without a WeChat account is now enough of a story, but so is the torment of a national product. on the morning of June 16, WeChat Pay briefly appeared to be on the hot seat when an anomaly occurred, and any flicker in it triggered a collective discomfort. That caution has prevented WeChat from being a product that sniffs at features.

But it still needs to be proactive in seeking changes to keep up with the times, it's just an extremely narrow path of trial and error for WeChat's development team. People can't go back to a time when there was no WeChat, and WeChat would be wise not to remind them of that.

This happened in 2013, when a knock by a construction team in Shanghai prevented the "only" 300 million users from sending and receiving messages for nearly five hours. The bottom line was tightened again on the eve of Chinese New Year in 2020, and if the 2013 incident was a passive accident, the trial two years ago was a necessity.

In mid-January, the WeChat team conducted an aggressive test of its virtual servers after scaling them up for a "Chinese New Year guarantee" stress test, and the servers reached their limits when the number of simultaneous accesses reached only half of their expected capacity. New Year's Eve that year was January 24, and if the problem was solved within two weeks, it would mean another massive downtime for WeChat before the New Year's bell rang.

The darkness did not eventually surface, and now when that day of WeChat is brought up, occasionally someone will remember that it was the day the exclusive bonus cover first went live, and everything was fine with each other.

After the 930 changes, open source collaboration and self-research on the cloud became Tencent's new strategic direction, and likewise became an opportunity for WeChat to go to the cloud. WeChat is Tencent's most cautious and careful business, which can be seen from its order of going to the cloud within Tencent - last. WeChat completed the replacement of physical machines with virtual machines in two years, and then gradually moved away from its original internal cloud platform system to the more open-source K8S. For WeChat, which has landed as a living color, this is a huge change that cannot be flaunted. It is only now that the process of getting the WeChat infrastructure into the cloud is gradually being completed that a complex path is showing up behind it.

The physical machine, Yard, and that old microsoft

In hindsight, the year 2013 vaguely draws a line in the sand with WeChat.

In mid-January of that year, the WeChat team announced on Weibo that the number of WeChat users had finally surpassed 300 million, making it the most downloaded and most used communication software in the world at the time. This was even a few days shy of the two-year anniversary moment of WeChat's first launch. In less than two years, the People Nearby and Shake features brought WeChat its earliest users with the initial dry feeling of the mobile internet, followed by the emergence of the Friends and Video Chat features in 2012.

Before 2013, the WeChat we're now familiar with had largely taken shape, except for that orange envelope in the dialog box.

One light and one dark, Tencent Soso was sold off in 2013. The product, which came out after Google and Baidu in 2006, ended up with no success and was packaged and injected into Sogou seven years later. Tencent's search business came to a temporary halt, with the confusion therein turning into more heartfelt efforts on the star business. Wen Jie, the engineer who led the establishment of the entire structure of Tencent Soso and made it to the sale, entered the technical architecture department of WeChat in the same year as the backbone.

WeChat strives for simplicity and use-it-or-lose-it, and the tens of billions of messages sent and received daily, with tens of thousands of servers, is another story behind the implementation of this boom. WeChat's server capacity needs to meet the pressure ceiling, and CPU usage is not always at its peak; 9pm is the time period when messages are sent and received at their highest, and after a few hours gone into the early morning hours, CPU usage is left at 3%, with a 15x limit drop. The vast majority of the server's computing power is wasted.

With the third 100 million users, WeChat took less than four months, and a near-imminent explosion was already foreseeable. A new resource distribution logic was emerging within WeChat, and Wen Jie and the entire Technical Architecture Department would lead this transformative R&D. At the end of 2013, Yard, a self-published cloud platform system, began to appear in internal discussions.

 Source: WeChat Official Source: WeChat Official Yard is an acronym for four English words, Yet, Another, Resource and Dispatcher, which together mean "just another resource distribution system". Another, Resource and Dispatcher, which together means "just another resource distribution system". Yard uses container technology to fine-tune the CPU isolation of the WeChat server, allowing the deployment of multiple functional modules on the same server.

This means that there is a more efficient way to mix online and offline, with offline tasks quickly freeing up server resources when there are bursts of traffic demanded online, and the utilization of CPU resources in the WeChat cluster under Yard reaches over 40%.

This approach worked, and Yard held up WeChat's next breakout period. at the end of 2016, the combined monthly active users of WeChat and WeChat reached 889 million, a year when the size of China's Internet population reached only 731 million.

But as WeChat finished the most important leg of its user growth and began to focus more on the breadth of its business, Yard's disadvantages began to emerge.

WeChat in early 2014 was still three years away from the launch of the first app, and there wasn't even WeChat Pay yet. The door to a platform that would accept guests from all over the world had not yet been opened, and Yard was developed without much consideration for compatibility with external technology tools. In fact, Yard was born with a very specific goal in mind, to reduce costs and increase efficiency by doing flexible scheduling of virtualization for server CPUs and storage - in other words, Yard was born to address a very specific need that was strongly related to WeChat's original infrastructure.

But with the influx of more business, Yard, which is not open source, is like a non-standard

WeChat's business has rapidly widened in a few years, and the business has become more involved in more areas, with each team relying on different preferences for technology tools, and the customization requirements bring a lot of unnecessary workload. Big data-related businesses are mainstreamed more towards Hadoop or Spark technologies; teams doing AI training prefer Tensorflow or Pytorch, but these frameworks have to be manually re-adapted the first time they access Yard, and even after each framework upgrade, the same thing has to be done again. The more new technology tools are introduced, the more Yard's limitations in openness are exposed.

After the 930 changes, divesting physical machines became the start of moving to the cloud, but that was only the first step. With the infrastructure moving to the cloud as a whole, WeChat is bound to go to an open source environment this time, and the Kubernetes system looks like the most appropriate path.

the way the wind is blowing

Yard really started to take hold within WeChat around 2013 or 2014, which was the beginning of WeChat's journey to the cloud. It was also the year when the global open source trend finally started to warm up.

The other penguin in the northern hemisphere, Linux, was in full swing, and Nadella, who was elected as Microsoft's new CEO in 2014, immediately raised "Microsoft loves Linux" after taking office; in the same year, GitHub, which had been online for six years and hosted more than 10 million repositories, gradually became the living room of Microsoft, Google and other Silicon Valley giants. The same year, GitHub, which has been online for six years and hosts more than 10 million repositories, gradually became the living room of Microsoft, Google and other Silicon Valley giant technology companies.

 Source: The Verge Source: The Verge All signs were there early on, in 2013 A draft of the White House's Open Data Policy was posted on GitHub in the middle of 2013. Before that, a government policy document had never been hosted on a private company's server. GitHub, and the open source ideas behind it, came to prominence with Chris Vancekras.

Previously Microsoft, or rather mainstream tech voices as a whole, have been on the opposite side of open source, just as Windows and Linux have been on opposite sides of the security fence for a long time. But here's the fascinating thing about technology, the superiority of open source is evident in an era where all scenarios tend to be virtualized, and once consensus is reached, the shift is in a flash.

From giants to indie developers, the idea of open source is clearly heating up. Making code collaborative, and even making the very thing of writing code community-based, is becoming the new way of managing projects in the information world.

Also in 2013, the first version of the Docker project was uploaded to GitHub, open sourced under the Apache 2.0 license and maintained at GitHub.Docker kicked off the history of containers as a virtualization technology, before hardware overperformance became an increasingly visible problem as hardware performance evolved, and hardware virtualization became the the first solution to come out of the woodwork. Traditional virtual machine technology virtualizes a set of hardware and then runs a full operating system (Guest OS) on it, on which the required application processes are then run. But the Guest OS itself is a very memory-hungry and repetitive system that needs to be installed on all virtual machines, and this approach seems heavy-handed. By contrast, application processes packaged into a container can run directly in the host kernel without their own kernel inside the container and without the need for hardware virtualization, a logic of encapsulated isolation that seems lighter and has better scaling resilience.

Thanks to containers, hardware virtualization, i.e. virtual machines with large memory Guest OS, is no longer necessary to achieve efficient resource allocation. But containers are more of a technical approach, one that ultimately addresses the application side of the equation, and therefore requires a higher dimensional scheduling tool on top of a large container infrastructure cluster.

At DockerCon Europe in October 2017, Solomon Hykes, CTO of Docker, announced that the next version of Docker will support for the first time an external scheduling platform, Google's, in addition to its own scheduling engine, Swarm Kubernetes.

Kubernetes, also called K8S (due to a total of 8 letters), is an open source system for container applications for automated deployment, elastic scaling, and management. The main feature is container orchestration for production environments. k8s was officially announced to the public in June 2014 after Google cloud expert Eric Brewer unveiled the new open source tool at a launch event in San Francisco, and after iterating to v 1.0 on July 22, 2015.

Docker, which pioneered the container concept, approached K8S three years later, a move that sent shockwaves through the industry as much as the phrase "Microsoft loves Linux". This means that in the market for container scheduling tools, K8S has won the battle against Swarm and Mesos to become the industry standard.

 image source: The New Stack image source: The New Stack Somehow In a way, WeChat Yard is somewhat similar to Windows in that both were once closed source pieces that were technology-first but completely inward facing. As WeChat grew into a platform that connected more and more complex businesses, a change from closed source to open source was inevitable. Coincidentally, Microsoft acquired Github for $7.5 billion in 2018, the year WeChat decided to start moving from Yard to K8S.

This process did not happen overnight, and the migration to K8S required the necessary support for the hardware environment, which the Tencent team responsible for building the cloud environment began to build in 2018. At the same time, using the 930 change as a boundary, Tencent began to change its server provisioning model internally, from providing physical machines to providing CVM virtual machines.

As already mentioned, virtual machines have no performance advantage over physical machines, and the value of getting rid of physical machines is to reduce costs. There is no depreciation, no need to buy physical servers or specially set up server rooms, which will save hundreds of millions of dollars. This step goes through in 2020. That's also when a Yard, which runs entirely in the cloud, begins its migration to K8S.

Steering K8S

K8S wasn't around when Yard started to take shape in 2014, and at the time it was designed WeChat's internal positioning of YARD was to just meet its own needs, without the need to do more generalization, or further cloudification. Doing the conversion from two systems that seemed somewhat disconnected with a bunch of complex features, compatibility became the most important issue in this migration process.

One of the most typical conflicts is the deployment of two functional modules on a single server with the K8S architecture, which are to be completely isolated, a basic assumption formed by K8S or contemporary cloud platforms from a security perspective. But this was not particularly emphasized in the early design of Yard, whose split-core deployment logic fully serves WeChat, and the two functional modules in one machine are able to communicate with each other by some means such as sharing memory.

In mid-2020, there was a major platform-wide downtime once within WeChat during a migration of an internal effectiveness tool.

"At that time, the top ran twenty to thirty services, all at once all the services are abnormal, my phone and enterprise WeChat all beating, are looking for me", WeChat to WeChat payment business a whole year of downtime failure budget only a few minutes, for WeChat payment platform architecture center engineer lucienduan, this time For lucienduan, an engineer at the WeChat payment platform architecture center, this early internal trial was one of the few "dark clouds" moments in the experience.

This mishap was eventually traced back to a poorly written task where an insignificant line of error code caused an overload on the gateway and ran it straight into a hang.

In the early days of the initial move to K8S, this migration process was immature, and the entire architecture team had to work with this huge potential risk from time to time.

Fortunately, this mishap was only one of only a few incidents and did not affect outside WeChat users, which is the bottom line WeChat drew for this process of going to the cloud. For the billion users who are using WeChat, they have absolutely no need to know what's going on behind the green dialog box in their hands, but replacing the self-developed Yard with the K8S, something that again had to happen at the same time as the normal daily operation of WeChat.

So at the beginning of the migration process, the WeChat team did smoke tests in advance, and all the WeChat features originally formed on Yard needed to be put on K8S for a run beforehand to sift out some obvious problems.

Determining compatibility was the first step in the migration of Yard to K8S, followed by the alignment of all features in the two systems, including the ability to support triple campus disaster recovery, a lesson that has been very conspicuous throughout the history of Microsofts products.

On July 22, 2013, the main fiber in WeChat's Shanghai data center was accidentally tapped, which led to a collective crippling of 2,000 servers. WeChat had previously been deploying three inter-standby instances of the core module of its single messaging system in the same server room, a redundant design that was unremarkable in the early stages of WeChat's rapid growth, but that one incident caused a nearly five-hour outage of the messaging and friend circle services.

 Tencent Qingyuan Data Center Image source: WeChat team Tencent Qingyuan Data Center Image source: WeChat team After that incident, WeChat started to decentralize its servers, and the disaster recovery model of placing separate server rooms in three different buildings emerged. This was also a focus of the K8S aligned Yard.

"The ability of K8S to support the three parks well was the first thing to consider at the time." Prudently, the WeChat team had a clear internal requirement for this migration, that every step of the migration operation should be able to roll back Yard. "The capacity of the YARD platform should be able to withstand the traffic from the K8S platform rollback at all times to ensure no loss of business", said the WeChat team.

All that remains is what the K8S will bring to the microsoft when it replaces the Yard.

Coder to Owner

Software development deployments in the DevOps era are urgent in frequency to weekly or even daily, but the fragmentation of the development and Ops segments is gradually becoming an obvious efficiency issue within WeChat. Although Dev and Ops are written together, the actual operation is done by two teams. The development team finished writing and packaging the code and handed it over to the Ops team to deploy the core online, with the result that the Ops staff was unfamiliar with the code logic and the developers did not understand the online. Such problems occur frequently within WeChat and often require pulling many people together to deal with urgent issues.

" Something like this pulls down the R&D efficiency of the entire team," many in the WeChat business team mentioned at the same time.

The most obvious change for WeChat developers after the migration to K8S is here. The full-stack deployment makes the role of operations and maintenance largely merged with that of developers. In the words of edselwang, a WeChat infrastructure engineer, "business code writers have gone from being pure Coder to being Owners of a business module ".

And because K8S has more comprehensive virtualization support, after the entire R&D system is completed in the cloud, the node deployment is detached from the virtual machine, and the CI/CD (Continuous Integration/Continuous Deployment) process in the development process can be realized more completely as a pipeline-like automatic delivery process, which can be interpreted as a "self-healing" capability.

edselwang gave an example that if a node deployed on a virtual machine is broken, because the virtual machine does not have the property of direct node migration, so it needs the operation and maintenance staff to manually give the node to do the transfer between the two virtual machines. But if the node is deployed on the K8S platform, the system can replace the manual to do automatic scheduling for the node.

The scheduling of WeChat's entire operations and maintenance team working overtime in front of the server during the peak of red envelope grabbing on New Year's Eve will also ease off when the whole thing goes to the cloud.

On a larger level, WeChat was not the first to go to K8S within Tencent. Tang Daosheng, who single-handedly raised QQ, entered the new CSIG division after the 930 change, and QQ subsequently became Tencent's first internal business to go to the cloud comprehensively. The IEG division, where many star game studios are located, also started to put its architecture on the cloud a few years ago.

 Pic source: from the web Pic source: from the web Tencent's overall K8S environment was built before the WeChat migration, which means The latter, after jumping out from Yard, will be further more integrated into Tencent Cloud's native facility system in terms of infrastructure development, and the decision cost for new business becomes lower, both in terms of resource scheduling and adaptability of system tools.

Such a complex infrastructure ultimately points to a more advanced productivity tool that unlocks the value of people.

Stephen Liu, head of technical architecture at WeChat, expects a fully cloud-native WeChat to eventually become an "autopilot" in the sense of resource scheduling.

"If WeChat before 2014 was Level 0, with Yard it's now Level 1 , and after the whole de-exploitation of the various capabilities of K8S in 2021, I think we should now be at Level 2." Stephen Liu envisions the future of WeChat's Spring Festival guaranteed scheduling to be completely dominated by system scheduling, which must be based on a completely cloud-native WeChat.

2019 is the last time WeChat will request physical servers, and with the usual four to five year depreciation schedule, it's no surprise that this last batch of physical servers will be out of warranty around the end of 2023, which happens to be 10 years after Yard started building. At that point WeChat will have truly moved its entire body to the cloud.

Everything was immobile and WeChat became the new WeChat.

When A 37-Year-Old Song Meets 'Stranger Things'
« Prev 06-17
Framing The Mechanical Beauty Of Cameras: Disassembling And Mounting The Pentax SP
Next » 06-21