Is Korean really a low-resource language?

Engineering

Jul 1, 2026

Engineering

Is Korean really a low-resource language?

Wonik Cho
Staff Engineer @Samsung Electronics
Youngsook Song
Researcher

Jul 1, 2026

Engineering

Is Korean really a low-resource language?

Wonik Cho
Staff Engineer @Samsung Electronics
Youngsook Song
Researcher

Imagine a researcher who just started working on Korean natural language processing. Training a parser or building a sentiment analysis model requires a dataset, so the researcher starts looking for one. Not much usable material turns up. Colleagues offer the standard explanation: Korean is a low-resource language, and there simply isn't enough data available.

The researcher decides to trace the history of Korean language resources to find a dataset. The findings don't match the common belief. KAIST built tree-tagged corpora and morpho-syntactic annotation corpora back in the 1990s. The National Institute of Korean Language assembled the 300-million-word Sejong Corpus. The government still funds hundreds of training datasets through AI Hub today. Korean data is not scarce in any absolute sense.

The real problem lies elsewhere. The data exists, but it's hard to find. Research felt difficult not because resources were missing, but because they were invisible.

To work through this problem, this article draws on the recently updated Open Korean Corpora: A Practical Report, first published in 2020. The report collects Korean datasets into a single list with usage details. Based on that report, this article looks at what Korean data is actually usable right now.

Is Korean genuinely resource-poor, or does it only look that way because resources are scattered and closed off?

What 'Low-Resource' Leaves Out

According to Ethnologue's 2026 figures, Korean has about 82 million speakers. Grouped with Chinese and Japanese as CJK, Korean plays a growing role in multilingual research, and industry demand for it is substantial. Yet compared to the steady stream of Korean NLP papers coming out of international venues, follow-up work building on those papers is noticeably thin. Recent NLP work has stressed the importance of clean, well-organized evaluation data for benchmarking. Follow-up research depends on that kind of data being clearly documented and accessible. This is where Korean falls short. Without a clear map of what exists and under what terms it can be used, the data itself becoㄹmes hard to use, regardless of how much of it there is.

Good Data Exists, But the Door Is Narrow

Institutional producers account for a large share of the Korean resource ecosystem. Here are the main organizations and what they've built.

KAIST

KAIST produced many of the earliest resources in Korean computational linguistics. Its tree-tagged corpus, morpho-syntactic annotation corpus, transliteration and translation evaluation sets, and Korean-Chinese multilingual corpus served as reference material for parser training and shared tasks worldwide, and later fed into derived resources like Universal Dependencies treebanks. But as the hosting websites were redesigned over the years, many original distribution links broke, making it hard for new researchers to find the material.

Linguistic Data Consortium(LDC)

The LDC has distributed standardized Korean resources worldwide since it was founded in 1992. Its holdings span text and speech, including Korean Newswire, Korean Treebank, Korean Propbank, and the more recent Penn Korean Universal Dependency Treebank. While academic hosting sites tend to fade over time, LDC's model of permanent archiving and standard licensing keeps citation and access stable.

National Institute of Korean Language

The National Institute of Korean Language sets the standards for Korean while also building large datasets. Its best-known outputs are the Korean dictionary and the Sejong Corpus, and it recently released a roughly 300-million-word corpus covering sentence-level tasks like similarity and entailment. As of February 2026, about 169 datasets are maintained and updated based on user reports and academic feedback.

ETRI

ETRI has spent years collecting and refining language processing and speech training data. Through the Exobrain project, it provides semantic analysis and question-answering databases, part-of-speech tagging and semantic role data, plus construction guidelines.

National Information Society Agency (NIA)

The NIA runs AI Hub, a large-scale data platform. At the government level, it collects labeled and parallel corpora across real-world domains: law, patents, common sense, open dialogue, machine reading comprehension, and machine translation. This includes roughly 1,000 hours of speech corpora and wellness and emotional dialogue data. As of July 2026, AI Hub hosts 968 AI training datasets, 182 of which are Korean-language datasets.

Resources from these institutions are built under clear guidelines by trained annotators, which keeps quality high. But they often come with restrictions. Many are limited to domestic researchers, and even the ones open internationally usually require an application. Modification and redistribution are frequently restricted too, which makes it hard for follow-up research to build on the data.

It helps to narrow the focus here. Rather than counting every Korean resource that exists, looking specifically at what researchers and developers can actually use freely gives a clearer picture of where things stand.

Sorting Out What's Actually Usable

Anyone trying to use a dataset in practice tends to ask three questions: where to get it, whether it can be used commercially, and whether it can be modified and redistributed. The Open Korean Corpora report organizes publicly accessible resources around these three questions and tags each one accordingly.

Documentation

This marks whether public documentation exists explaining how a dataset was built and what it's for. A dataset is tagged int'l if it has documentation an international researcher could check directly, like an English paper, blog post, or GitHub README. It's tagged dom. if only a domestic site or Korean-language guide exists, and none if no official explanation exists at all.

Usage

Each dataset comes with usage restrictions. It's tagged all if both academic and commercial use are allowed, academic if restricted to academic use, and unknown if the terms aren't clear.

Redistribution

A dataset is tagged rd if redistribution with modification is allowed, rd/mod-x if only unmodified redistribution is allowed, none if redistribution isn't allowed, and unknown if the terms aren't specified.

The Open Korean Corpora report attaches a tag like [int'l, all, rd] to each resource, so researchers can quickly identify what fits their needs.

Sorting Open Datasets by Category

Resources gathered under this framework fall into 10 categories covering nearly every area of Korean NLP work. If you already know what you want to build, the table below points you to the right data. Sentiment analysis calls for NSMC, hate speech filtering calls for KOLD or K-HATERS, and Korean LLM evaluation calls for KMMLU.

Category	Count	Representative Datasets
Benchmark research	8	KLUE, KoBEST, KMMLU, HAE-RAE Bench
Parsing and tagging	6	UD Korean KAIST, OpenKorPOS, KoNEC
Entailment, similarity, paraphrase	7	KorNLI/KorSTS, ParaKQC, StyleKQC
Intent understanding and sentiment	11	NSMC, 3i4K, KOTE, KPoEM
Hate speech and bias detection	15	BEEP!, KOLD, K-HATERS, KoBBQ
Question answering and dialogue	12	KorQuAD, CLIcK, KorNAT, K-Viscuit
Summarization, translation, transliteration	10	Korean Parallel Corpus, XL-Sum
Korean within multilingual corpora	9	PAWS-X, TyDi-QA, MASSIVE
Speech corpora	9	KSS, Zeroth, ClovaCall, OLKAVS
Other domains	13	LBox Open, KorMedMCQA, KoCHET

Hate speech and bias detection leads with 15 datasets, followed by question answering and dialogue with 12, and intent understanding and sentiment analysis with 11. The "other domains" category, covering fields like medicine, law, and cultural heritage, reaches 13. Korean resources have moved well beyond basic tasks into many specialized directions.

A chart showing releases by year, distribution by category, and usage license and documentation status by category.

The full landscape in one view. (a) shows releases by year and (b) shows the distribution by category, together tracking when and where resources grew. (c) shows usage licenses and (d) shows documentation status, indicating whether each resource can actually be used. Both panels also let you check, by category, whether commercial use is allowed and whether English documentation exists.

What the Numbers Show

This survey identified 100 corpora in total: 82 Korean-only text corpora, 9 multilingual corpora that include Korean, and 9 speech corpora. Looking at accessibility, 53 percent allow commercial use, 86 percent provide public documentation an international researcher could check directly, and 81 percent allow some form of redistribution

These numbers show both strengths and weaknesses. The 86 percent documentation rate marks real progress compared to earlier surveys, when many Korean resources lacked an English paper or public README of any kind. Benchmark, multilingual, and speech corpora all had accessible documentation, likely because evaluation resources need to be internationally accessible for reproduction studies to happen. Parsing and tagging lagged at 50 percent, probably because much of that material came out of domestic competitions or institutional projects before open science practices took hold.

On redistribution, 68 percent allow full redistribution with modification and 13 percent allow unmodified redistribution only. Just 5 percent explicitly prohibit redistribution, mostly parsing and tagging resources bound by licensing terms from their source treebanks. Another 14 percent leave redistribution status unclear, meaning some datasets labeled as open still have vague or missing license terms.

Looking at the timeline, Korean resources grew in three phases. The early period (2015-2017) saw only four resources released, when Korean NLP work was largely confined within institutions. The growth period (2018-2021) brought foundational resources like parsing corpora, sentiment datasets, and speech corpora as pretrained language models like BERT spread, reaching 35 cumulative corpora by the end of 2021. The acceleration period (2022-2025) added 65 new datasets in just four years, with notable peaks of 21 in 2022 and 25 in 2024.

The 2022 surge came from comprehensive benchmarks like KoBEST and KLUE spreading widely, along with growing hate speech detection datasets as concern about online toxicity increased. The 2024 peak reflects how quickly the Korean community responded to the large language model era, with LLM-specific benchmarks like Ko-H5/Open-Ko-LLM, HAE-RAE Bench, and KMMLU, plus culturally grounded evaluation resources like CLIcK and KorNAT.

A chart showing new releases by year (left) and cumulative releases from 2015 to 2025 (right).

Time distribution of open Korean corpora. The left panel shows new releases by year, with two clear peaks in 2022 (21 datasets) and 2024 (25 datasets). The right panel shows cumulative growth, rising from 1 dataset in 2015 to 100 in 2025..

Recent Growth by Task Type

A stacked area chart with colors separated by task category. The hate speech and bias, and question answering and dialogue areas grow noticeably thicker after 2022.

Breaking the cumulative growth down by task type shows how priorities in the Korean research community have shifted. Hate speech and bias detection stands out as the largest single category at 15 datasets, most released after 2020, tracking growing social awareness of online hate speech on Korean platforms and rising demand for content moderation. It also scores well on accessibility, with 66.7 percent allowing commercial use and 86.7 percent allowing redistribution.

Benchmarks changed character as well as grew in number. Starting around 2021, benchmark work shifted from evaluating discriminative tasks on encoder-based models (KLUE, KoBEST) to evaluating generative capabilities on decoder-based models (KMMLU, HAE-RAE Bench). Interest has moved toward factuality, cultural appropriateness, and complex reasoning, beyond simple pattern matching, and several recent benchmarks now use held-out test sets or contamination detection to guard against this shift.

Classic pipeline tasks like parsing and tagging, by contrast, have stalled since 2022, as community attention moved toward higher-level language understanding tasks closer to modern LLM applications.

The Challenge of Keeping a List Current

Keeping a resource list current is inherently hard. New datasets keep coming out, and links and licenses for existing ones keep changing. Even the third and most recent version of this report doesn't cover every benchmark released across the Korean community in 2025 and 2026.

That makes a list like this closer to a living document than a paper written once and left alone. When the paper text changes, Open Korean Corpora posts revisions on arXiv, but it also maintains a separate, community-edited public version alongside it. A dedicated data file for the resource list lets researchers filter by the type of work they're doing, and an acknowledgments section keeps track of new contributors as updates come in.

That data file includes each corpus's name, task category, license, and documentation status in a consistent format. This lets other resource-tracking sites like nlpprogress.com update their listings automatically instead of copying tables by hand. Other language communities can build on the same approach to collect and update their own resources, and formats like this are already making datasets easier to access through repositories like Koco, Korpora, and Hugging Face Datasets.

Two Things Left Out: Raw Text and Synthetic Data

What gets excluded matters as much as what gets included. Open Korean Corpora deliberately leaves out two types of material. The first is raw, unprocessed web text. Large-scale pretraining text differs from annotated task data in cleanliness, format, and purpose, and it's better covered by reports on Korean pretraining projects like Polyglot.

The second is fully synthetic datasets, meaning data generated entirely by language models without meaningful human curation. Synthetic data offers a fast way to build datasets, but it raises specific concerns for Korean. Most models used to generate synthetic data are trained primarily on English, so they may not capture Korean linguistic and cultural nuance well, and can carry over bias and errors in the process. Quality and diversity are also hard to verify at scale. Mixing synthetic data in with human-made resources also makes it harder to tell how much of the Korean data landscape is actually human-curated.

Hybrid resources like CareCall and KoCoSa, where LLM-generated content goes through strict human filtering and annotation, are included, since human review provides some quality assurance. As methods for validating synthetic data mature, future revisions may reconsider this exclusion for resources with transparent generation methods and rigorous quality checks.

So, Is Korean Really Low-Resource?

Most papers are written in English, in an academic register. But the people who actually need Korean resources are students and industry practitioners starting research in Korean. Pulling scattered resources together matters, but presenting them in a form people can actually read matters just as much. Organizing public corpora by category, and marking each one by documentation level, usage scope, and redistribution terms, gives researchers a much clearer view of the Korean data landscape. A single table can show what's usable and under what conditions, including the two things researchers care about most: whether commercial use is allowed and whether modified redistribution is permitted.

Calling Korean a low-resource language is only half true. The data isn't missing. It's scattered, closed off, and not well known. High-quality institutional resources exist behind a narrow door, and freely usable public resources are spread thin across the web with little visibility. What's missing is something to map the space between them. The government continues to invest real money in building these databases. For that investment to pay off and reach international use, the work needs to start with updating these lists regularly, standardizing how resources are described, and setting license terms clearly.

References

Paper: Open Korean Corpora: A Practical Report
Repository: ko-nlp/Open-korean-corpora
Ethnologue: Korean

Blog

Engineering

Is Korean really a low-resource language?

Is Korean really a low-resource language?

What 'Low-Resource' Leaves Out

Good Data Exists, But the Door Is Narrow

KAIST

Linguistic Data Consortium(LDC)

National Institute of Korean Language

ETRI

National Information Society Agency (NIA)

Sorting Out What's Actually Usable

Documentation

Usage

Redistribution

Sorting Open Datasets by Category

What the Numbers Show

Recent Growth by Task Type

The Challenge of Keeping a List Current

Two Things Left Out: Raw Text and Synthetic Data

So, Is Korean Really Low-Resource?

We value your privacy