Summer 2023 Internship ReviewsBy Dongjin Park
I applied to CUop, a collaboration between universities specializing in science and technology, and worked as an intern at Lablup for 8 weeks.
I wrote about my experiences through onboarding, developing Backend.AI, and attending PyCon.
Motivation for applying
I first learned about Lablup through a PyCon presentation session I stumbled upon. I could tell that the company had a lot of technically deep and passionate members. I applied to Lablup because I was interested in Python and asynchronous programming.
During the first two weeks, we conducted an onboarding process.
We went through the Realtime Web Chat implementation, Backend.AI development environment, and code base seminar.
Realtime Web Chat
This is an assignment to familiarize yourself with Python asyncio. You will develop a real-time chat app using aiohttp, an asynchronous web framework, and redis, an in-memory database. The task also includes setting up docker compose to build python and redis at once. For more information, see the Github Readme.
To broadcast the messages through redis, we used Redis pub/sub. pub/sub acts as a platform that delivers messages without storing them. Since we didn't have a requirement to store messages, we used Redis pub/sub. We also registered the Redis pub/sub process as a task using asyncio.create_task() to run it in an event loop. Realtime Web Chat Launch Screen
I was able to understand the basic behavior of asyncio. I was able to ask questions and solve the difficult parts. I think Lablup has a great environment for interns and junior developers to grow because they can freely ask questions through Microsoft Teams.
Build the Backend.AI development environment
I installed Backend.AI on a VM Farm and a local VM and tried to run it myself. I read the official documentation, but the process was not smooth. I encountered various errors, which I solved by sharing with my fellow interns and asking questions on Teams.
💡 A VM farm is an environment where virtual machines are managed and run. Lablup has its own VM Farm.
This was my first experience of developing with a VM Farm and VSCode connected via SSH. To develop Backend.AI, I need to run multiple processes (Manager, Agent, Storage proxy, Web server, etc.) and Docker Containers. If I use a laptop, it's a bit of a battery drain. However, with VM Farm, you can develop Backend.AI lightly because all you need is an SSH connection. In fact, I was able to develop for a long time using VM Farm when I was out of the office and couldn't charge my laptop.
Code Base Seminar
After looking at the code, focusing on the difficult parts of Backend.AI, I prepared a seminar based on my understanding. I was in charge of presenting the Manager, Agent, and GraphQL parts.
Since Backend.AI is open source, the official documentation is well written. I studied the architecture of Backend.AI by looking at the official documentation to understand the overall structure and asking the employees directly if I had any questions. Since the topic of the presentation was session/kernel creation and scheduling control of Backend.AI Manager, I studied the manager code and analyzed the logs of the manager process. A sequence diagram I drew in preparation for a seminar presentation.
Analyzing the logs, I found a bug that caused the session state to change from Preparing back to Pulling. It was rewarding to analyze the logs one by one. It was difficult to analyze the logs in order because of the asynchronous code base, but drawing a call graph and a sequence diagram was very helpful.
After onboarding, I started working on Backend.AI. I worked on GitHub issues and volunteered to help or found and resolved issues myself.
I chose high-priority issues that I was confident in addressing. I wanted to make as many contributions as I could in the short time frame of two months.
I created a PR to fix a bug I found while preparing for a seminar. It was an easy PR to fix the API parameter. However, it was a good experience to learn about branch name convention, commit convention, and news fragment, and to experience the CI (Continuous Integration) process with GitHub Actions and to do some Git-related shoveling beforehand.
💡 A news fragment is a one-sentence Markdown description of what the branch created by the PR is trying to do. You want to keep it simple and clear so that if you see this PR again in the future, you'll know what it was trying to do.
PRs for vfolder
When I heard that Teams had an open issue for an intern, I jumped at the chance to apply. I had to learn a new concept called vfolder, but I knew it would be important to get to know the product.
Only admin can create a project type vfolder. It should be possible to create a vfolder regardless of the max_vfolder_count of the keypair resource policy, but if the user type vfolder exceeds the max_vfolder_count, the project type vfolder cannot be created. At first, I was confused by the terminology, but by analyzing the code and asking questions, I was able to interpret the terminology.
Fixed new bugs discovered while addressing PR (1).
I found an issue related to the PR (1) issue. DB migration and GraphQL were new to me, but I wanted to try them out, so I supported them. I used a DB migration tool called Alembic, studied GraphQL schema concepts, and modified my query and mutation code to support backward compatibility. I tried to use cURL to test the modified code, but GraphQL has a much longer request format than the REST API, which was cumbersome. I wrote the test code by asking questions to interns and employees who are familiar with GraphQL. I wrote python code to test the modified query and mutation in the form of CLI to make testing easier.
WSProxy related PRs
Supported an issue in Teams. There was a bug in the WebUI where it was not possible to delete a session if the wsproxy address of the resource group was invalid. I also wanted to get some experience with WebUI development.
I read through the WebUI code to troubleshoot the issue, but I couldn't quite grasp the concept of wsproxy. I realized that wsproxy has v1 and v2, but it was not easy to understand the difference between the two, so I asked the employees. The main difference between v1 and v2 is the path of the traffic: v1 goes through the manager to communicate with the container, while v2 can communicate directly with the container without going through the manager, which is faster. Once I understood what wsproxy does and the difference between v1 and v2, it was easier to understand how the code worked, and I realized that a lot of people didn't know the difference. I realized that questions that seemed easy to ask might not have been asked in the organization.
However, since the WebUI needs to be backwards compatible with Backend.AI 22.09, we had a review from our CEO that it should also handle HTTP Status 404, so we made it handle both 404 and 500.
However, after the code was merged, a bug occurred: when setting up v1 wsproxy, the return value for wsproxy-version disappeared. This happened because I was modifying the core code and didn't handle all the branches. I fixed the code in a hurry, but it was a simple mistake, and I realized that I should write test code to prevent such mistakes.
PRs related to Manager
In preparation for the seminar, I came across an issue with manager that I had been studying. With my internship coming to an end, I thought I could contribute by resolving the issue on the code I knew best.
This PR has been heavily modified by code reviews. Initially, we designed scheduler health check and scheduler trigger as the same API. After receiving code reviews, I split the two functions into different APIs to separate responsibilities. We originally stored health information only for the schedule function, but we also stored health information for the prepare function and the scale_services function because we needed to create the trigger API after storing health information for the three functions that run periodically according to the scheduler's global timer to get a complete picture of the scheduler's health. We also changed the design so that we can store the scheduler's health based on the manager ID because there may be multiple manager processes.
The scheduler's storage was also reviewed. Initially, we looked at the code in the existing manager state API to store manager state in etcd and did the same for scheduler state. etcd is good for storing configuration information that needs to be kept consistent, but it is slow to write. redis is a volatile database, but it performs well even with many reads/writes. Since the scheduler state is read/written periodically and does not need to be consistent, we switched to storing it in redis.
Now that I had a good understanding of the Manager part of Backend.AI, I wanted to understand another important component: the Agent. I came across an issue about the Agent, so I took a look.
While Backend.AI was running, there was a bug where the internal state of the Agent did not match the state of the actual working container. As a result, when creating a session, an InsufficientResource Error was thrown during the resource allocation process, even though there were actually enough resources. When the error occurred, we needed to improve the logging to understand what went wrong during the resource allocation process.
It took a while to figure out the resource allocation process. The concurrency issues were difficult, and it took a lot of Q&A with the CTO to get a general idea of the flow and what to log.
A few weeks after my internship ended, the CTO merged it with more than 10 commits, adding refactorings and test code. What impressed me was that he wrote test code to reproduce the error. I had to go through a complex process (see PR) to reproduce the error myself, which took a lot of development time. I could see the difference in productivity here. Of course, I thought about writing test code, but I thought the implementation would be too complicated, and I thought that writing test code would be the end of my internship. In the future, I shouldn't be so intimidated by writing test code and just try it out and learn as I go. Also, it seems that the refactorings were done with a focus on code readability. The functions in the part I modified were too long and not very readable, but after refactoring, the functions were shorter and the logging was cleaner. I realized that I shouldn't stop at implementation, but try to make good code.
On August 12 and 13, Lablup will have a booth at PyCon. Companies that sponsor PyCon are given the opportunity to have a booth. I was an intern, but I wanted to participate in the booth activities and listen to some of the talks. The company had some PyCon tickets left over, so I was able to participate.
At the Lablup booth, we had an event where we challenged Llama2 to write a 10-line pyramid at a prompt. The challenge wasn't that easy, and it was important to explain it in a way that Llama2 could understand. Two lucky people who submitted the correct answer were drawn to win a Nintendo Switch and a Logitech mouse. My role at the booth was to help direct PyCon attendees to the event, and since there were a lot of employees at PyCon, I was free to attend any talks I wanted to hear. Since Rableup is an open source company, we encourage people to contribute to open source and participate in conferences. In fact, four of our presenters attended PyCon, so we value participating in conferences.
Lablup Inc. booth
During the RustPython session, a tool called ruff was introduced as a replacement for the python lint tools flake8 and isort. Since ruff is composed of Rust, it is 100x faster than flake8. At Backend.AI, we were using flake8 and isort for lint, but after our CTO reviewed ruff, I watched him adopt ruff for the Backend.AI project right on the stairs of COEX. I realized that he was really good at the process involved in coding, applying new tools to the project in a short time and even revising the official documentation. I realized that I want to be a proficient developer someday. After PyCon, I read the updated official documentation, applied ruff to the Backend.AI development environment, and experienced linting 100x faster. If I hadn't participated in PyCon, I wouldn't have gotten up to speed on these great tools so quickly. I hope to continue participating in developer conferences in the future. Group photo with members of Lablup Inc.
End of internship
During my internship, I tried to get as much experience as possible, and I wanted to contribute a lot. In the end, I was able to experience a lot because I tried to contribute a lot. I was quick to volunteer for issues that came up in Teams, so I was able to understand the core components of Backend.AI: vfolder, wsproxy, web-ui, manager, and agent. I also learned new concepts like DB Migration, GraphQL, and etcd. Although it was a bit physically demanding to attend the conference from morning to evening on the weekend, it was fun to listen to more than 10 presentation sessions, get inspired, and meet various people through booth activities.
During my internship, I think I actively asked questions about things I didn't understand, which helped me to solve issues quickly. I think the reason why I was able to ask a lot of questions was because there was a culture of horizontal rabble-rousing, and there were many people who were kind enough to answer my questions, so I was able to actively ask questions. I would like to take this opportunity to thank the members for their support.
I was able to experience a variety of things, including asynchronous programming experience, GitHub collaboration, presenting English seminars, and attending conferences. I feel like I've grown a lot as a developer through this program. I recommend the Lablup internship to anyone who is thirsty for growth.
This post is automatically translated from Korean
22 November 2023
2023 Lablup DevOps Summer RetrospectBy Gyubong Lee
In this post, I'll share my experience as a developer at Lablup over the past 9 months.
Table of Contents
- Motivation to apply
- From Intern to DevOps!
- rraft-py Development
- Open Source Contribution Academy Regional Sprint Backend.AI Mentoring
- Attending various conferences
- 2023 Open Source Contribution Academy
- Presenting at PyCon
Motivation to apply
Even before I joined Lablup, I knew that I wanted to have a career where I could continue to help others through the programs I develop, whether as a hobby or during work hours.
Open source was particularly appealing to me because it meant that not only could my code help others, but that they could freely modify and utilize it if they wanted to.
One of the things I realized after working on my own project, Arvis, for my graduation project, is that it's not really easy to keep a project going simply because it's something I love to do, as it keeps growing in size. I tried to plan and execute the project carefully from the beginning, but in the end, I realized that I underestimated the time and effort required to maintain the project.
In that regard, Lablup, which actively encourages and supports open source-related activities and even develops core parts of its source code as open source, was the company of my dreams.
From Intern to DevOps!
The last three weeks of My internship at OSSCA Lablupwere spent studying and researching distributed systems, specifically implementing the Raft algorithm. Although my job title changed from intern to DevOps, I still felt like I was expanding on my internship learning, including Raft, to solve issues I worked on during my internship.
I've been involved in a variety of other activities that I'll mention below, but my main work at the company to date has been writing a Python binding of the Raft algorithm implementation to replace the existing distributed lock structure, including writing rraft-py, and thinking about how to integrate it with Backend.AI.
rraft-py is a Python binding implementation of
tikv/raft-rs, and you can read more about it in the GitHub Readme / Wiki. I'll also be presenting some technical details on the topic in my PyCon 2023 KR talk next month, if you're interested.
For now, I'm going to focus on my experience as a Lablup developer, leaving aside the technical details of what I learned while developing rraft-py.
I had to think a lot about rraft-py because it was not just about fixing an issue in Backend.AI, but also about creating a separate project and integrating that project with Backend.AI.
Overall, there were several mile stones in the project, and I feel like I was able to move forward with the project with a little more stability after each mile stone. There was definitely a high sense of accomplishment each time, but there were also many times when I was frustrated because I realized later that the code I had initially written didn't work the way I intended. But Lablup allowed me the time to do these shoveling sessions, and I think I've gotten to where I am today because of the things I've learned that I would have otherwise dismissed as "shoveling".
Results of running the rraft-py example code
There's still a long way to go to integrate rraft-py into Backend.AI, but the bottom line is that it's great to have the experience of thinking for yourself and making your own decisions as you continue to evolve your project, and for developers who like this kind of experience, Lablup could be one of the best options out there.
Open Source Contribution Academy Regional Sprints Backend.AI Mentoring
While rraft-py development was my main focus, as it required more time than I had anticipated, I also had the opportunity to work on a variety of other projects.
One of the most memorable experiences was participating in the 1st Daegu Open Source Contribution Academy Regional Sprint as a Backend.AI mentor.
In fact, I participated as a mentor without a deep understanding of Backend.AI, and to make matters worse, the sprint period was only 2 days, so I was worried about many things.
In order to make sure that the mentees learn at least one thing and go home as satisfied as possible, I had to think about how to explain Backend.AI to those who don't know it at all, and how to build a development environment on different platforms (personally, I usually only develop on macOS + docker desktop environment, but some of the mentees were working on Windows environment, so I had to shovel while building the development environment). I had to think about a lot of things and prepare.
In conclusion, I was able to learn a lot more than I thought because I was unfamiliar with these processes, and the mentees followed along better than I thought, so I think it was a meaningful time for everyone to create more than one PR.
The 1st Daegu Open Source Contribution Academy Regional Sprint
Participation in various conferences
We had the opportunity to participate in various conferences and exhibitions such as AI Expo, AWS Summit, and Next Rise. It was great to learn how to explain Backend.AI to different types of people, and it was also interesting to see the different technologies of other companies.
AI EXPO KOREA 2023
As a company with an open source culture, Lablup actively participates in the Open Source Contribution Academy every year. This year, I also participated in the Open Source Contribution Academy, which encourages participation in various other projects besides the Backend.AI team, so I've been working on GlueSQL as a mentee.
I think this culture of freedom is very attractive to developers with a strong desire to grow.
(In addition to myself, there are two other people involved in other projects in the 2023 Contribution Academy).
Based on my experience in developing rraft-py at my company, I was also given the opportunity to present at 2023 PyCon KR.
Personally, I'm a bit nervous because it's my first time presenting in public, but I'm doing my best to prepare. For those who are interested in presenting, I am looking forward to sharing not only the presentation materials but also the source code and work history through GitHub.
Lablup is a company with a strong open source culture, encouraging participation in various open source and community-related events such as the Open Source Contribution Academy (https://www.oss.kr/contribution_academy) and PyCon, and giving developers the opportunity to take initiative in their work.
I hope to continue to participate, learn, grow, and contribute to open source activities of various nature at Lablup.
This post is automatically translated from Korean
18 July 2023
Learning from historyBy Sanghyeon Seo
Learning from history
ChatGPT and other language giants that have gained traction in the last half-decade didn't just fall out of the sky out of nowhere. We've seen it many times throughout history, where a cumulative development of technology reaches an inflection point and rapidly transforms society. Sometimes, the path to that inflection point is strikingly similar for technologies that evolved in very different times and contexts.
Mankind has long wanted to fly - it's no coincidence that Civilization 6 has "Dream of Flying" as its theme song.
How did the Wright brothers realize their dream? In 1899, Wilbur Wright writes to the Smithsonian Institution and asks them to send him what is known about airplanes. After three months of reviewing the material he receives, Wilbur concludes that not much is known about flight, except that it's a problem. Plausible theories have turned out to be false and improbable theories have turned out to be true, he writes, so he can't believe anything he hasn't seen for himself.
What Wilbur wanted to know from his literature review was this: what do we need to know about flying? What of it is already known? What are the remaining problems that need to be solved? Surprisingly, Wilbur was able to answer all of these questions from his literature review, something his competitors were unable to do.
In a 1901 lecture, Wilbur summarized his conclusion: "There are three problems with flying. You need wings to make the airplane float, you need an engine to make the airplane go, and you need a way to control the airplane."[^1]
[^1]: Some Aeronautical Experiments (Wilbur Wright, 1901).
Wilbur saw that the wing problem and the engine problem had been solved to a certain extent, so he needed to solve the piloting problem. To solve the piloting problem, he needed an airplane, and to build an airplane, he needed to solve the piloting problem. Wilbur concluded that he could solve the problem of controlling an airplane by taking the problem of controlling a glider.
To test the glider, he needed high hills and strong winds, and they had to be sand dunes for the safety of the experimenters. In 1900, Wilbur requested data from the Weather Bureau to review the windiest places in the United States. The staff at the Kitty Hawk Weather Station wrote back that the beach next to the station was unobstructed and would be suitable for the experiment."[^2]
[^2]: Letter from J. J. Dosher, Weather Bureau, to Wilbur Wright, August 16, 1900.
The 1901 experiment was a disappointment: the wing didn't have enough lift. The Wright brothers had used data from the Auto Lilienthal to calculate the area of the wing, and they had become suspicious of its accuracy.
After analyzing their experimental data, they concluded that John Smeaton's value of the proportionality constant, which had been used without question for over 100 years, including by Otto Lilienthal, was incorrect.
In order to systematically analyze the lift of wings without the time-consuming and laborious glider experiments, the Wright brothers built a wind tunnel. Their analysis showed that the data from the Auto Lilienthal was correct, except for the incorrect value of the proportionality constant, but the wing used by the Auto Lilienthal was inefficient.
The 1902 glider that resulted from this analysis had a larger area to offset the revised value of Smithen's constant and a flatter shape to increase efficiency. (They changed the flatness of the wing from 1/12 to 1/24.) The new glider flew very well.
That's how the Wright brothers were able to make their historic first flight at Kitty Hawk in 1903.
Humans have long wanted to talk to machines - countless science fiction novels and movies bear witness.
To create AI, OpenAI had to solve three problems. Computing infrastructure, models, and data. You can think of the computing infrastructure as the engine, the models as the wings, and the data as the controls.
To manage the compute infrastructure, OpenAI used Kubernetes, but it wasn't something they could just grab and go. When they hit 2,500 nodes, they ran into problems with the Linux kernel ARP cache overflowing,[^3] and when they hit 7,500 nodes, they had to fix a bug in Kubernetes to enable anti-affinity.[^4]
[^3]: Scaling Kubernetes to 2,500 nodes, January 18, 2018.
[^4]: Scaling Kubernetes to 7,500 nodes, January 25, 2021.
(Advertisement: Lablup's Backend.AI has already been used in practice on large clusters for AI training and inference, solving a number of scaling problems and implementing its own scheduling algorithms to support features like affinity and anti-affinity.)
The scaling law for AI is the equivalent of the lift equation for an airplane. Just as the lift equation describes the lift of a wing as the area of the wing, the lift coefficient, and the proportionality constant, the scaling law describes the loss of an AI model as the size of the model, the size of the data, and the power law exponent.
Just as the Wright brothers discovered that John Smeaton's constant of proportionality was 0.003, not 0.005, the power law exponent of scaling law was initially thought to be 0.73,[^5] but was actually found to be 0.50.[^6] The incorrect value was calculated because the learning rate was not adjusted for the size of the data.
[^5]: Scaling Laws for Neural Language Models, January 23, 2020.
[^6]: Training Compute-Optimal Large Language Models, March 29, 2022.
OpenAI knew that control of the model was an important issue, so before we trained our first GPT, we were already working on reinforcement learning from human preferences,[^7] which we applied to the control of robots, reminiscent of the Wright brothers referencing bird flight for control of airplanes and first applying it to gliders instead of airplanes.
[^7]: Deep reinforcement learning from human preferences, June 12, 2017.
To apply this research to language models, human preference data was collected, resulting in InstructGPT.[^8] It's hard to know exactly because OpenAI hasn't published its research since GPT-4, but research is showing that it can learn not only from explicit feedback, but also from implicit feedback, such as users retrying to create or continuing and stopping conversations.[^9] If so, OpenAI could create a positive feedback loop where it improves its models to gather users, who then gather users to improve their models.
[^8]: Training language models to follow instructions with human feedback, March 4, 2022.
[^9]: Rewarding Chatbots for Real-World Engagement with Millions of Users, March 10, 2023.
In this article, we've compared how humans went from flying to talking to machines, and we've seen some very similar patterns in the evolution of technology throughout history.
What other examples will we see in the future as AI technology advances and we work to make it more accessible to more people? Can Rableup and Backend.AI help accelerate that process, allowing people to experiment and realize what we've learned from history more quickly? We're in the middle of this inflection point.
This post is automatically translated from Korean
12 July 2023
Developer Advocate in LablupBy Jongmin Kim
The content of this post is based on my own personal experience, which is not necessarily true in all cases, and opinions may vary.
What is DevRel?
As the name implies, DevRel's primary purpose is to create relationships with developers. This includes not only the developers who write the code, but also the planners, designers, and all the production people involved in creating services and products. The specific role of a DevRel varies a lot depending on the nature of the organisation and the industry, but often the objectives converge to promote (product, technology) or recruit.
The primary roles of a DevRel are to
- Product promotion and technical evangelism
- Build, run, and support the community
- Gathering user input and feeding back into product development
- Provide a variety of resources to help users better use the product
Developer Advocate 은 DevRel 안에서도 기술 전파 및 리소스 제공 에 조금 더 초점을 맞추고 있는 역할이라고 볼 수 있습니다.
Developer Advocate can be seen as a role within DevRel that is a little more focused on skills evangelism and resource provision.
In order to make their products and services known to the public, companies conduct marketing and public relations activities through various channels and events. These activities are called PR - Public Relations, which literally means creating a relationship with the public. In companies that develop or service IT products, the main consumers are engineers (developers). And products aimed at engineers often need to provide more detailed and technical information, which requires different resources and strategies than general PR activities. DR - Developer Relations, as the name suggests.
The community is arguably the most important driver of an IT ecosystem, especially for open source products like Backend.AI. Moreover, when new technologies of a similar kind come along, the size of the community is often a factor in choosing and adopting them, in addition to performance or maturity. One of the important roles of DevRel is to help and engage with these communities.
A great analogy I like to use to explain the importance of community is the relationship between an artist and their fan club. An artist's primary role is to create music and perform, but it's through their fanbase that viral and secondary creations are born, and it's an important driver of their career and the entertainment business. Nowadays, entertainment companies are well aware of the importance of fandom and support various activities.
Image created by Midjourney
Similarly, the DevRel role in an IT organisation is to support and communicate with the various communities and ensure that the product and organisation don't lose focus.
Developer Advocate in Lablup
Lablup has been active in the Korean developer community for a long time by Wooyoung Yoo, and our CEO, Jeongkyu Shin, is also active in various developer communities, so we have already done a lot of DevRel activities for an organisation of our size. Backend.AI is a great piece of software, and it's an important tool, especially in this era of big AI. However, as a tool for AI learning and inference, it requires a lot of preparation and resources to set up, which can be daunting, especially if you're new to AI (which I am 😅).
Since Backend.AI does not have a large user base yet, we have been conducting various community activities such as the [Open Source Contribution Academy] (https://www.oss.kr/contribution_academy) in conjunction with the theme of open source culture. In order for our own community to grow, information needs to be produced and disseminated among users in addition to the one-way transmission of technology from creators to users. In order for such activities to take place, it is my main task to continue to create and improve various resources and exchange opportunities that anyone can easily use. I will continue to strive to communicate with you in various opportunities and ways.
This post is automatically translated from Korean
24 March 2023