A corpus refers to a large collection of texts. The texts within the corpus (known as corpora) are typically organized with established formats and labels, specifically referring to digitized corpora stored on computers.
A general corpus refers to a collection of texts, sounds, images, and videos stored on computers with specific formats and labels.
A parallel corpus refers to a collection of language texts that correspond in meaning across two languages.
A corpus is also known as a "corpus collection" and is often used interchangeably with the term "dataset."
A multimodal corpus is a multimedia corpus that encompasses language, sound, images, and actions from the entire spectrum of speech activities. It takes speech activities as its research object, employs information and knowledge extraction from raw data as its method, and is driven by contextual models (e.g., text-image corpora, text-video corpora).
eCorpus Inc., started in 2022, is a research and development company specializing in natural language processing (NLP) technologies. It is also recognized as the company with the most comprehensive Chinese-centered parallel corpus resources, earning the title of China's premier parallel corpus supplier.
With the rise of data technology, traditional linguistic methods have become inadequate in empowering artificial intelligence (AI) research. Corpora have now become essential materials for modern linguistics, machine learning, NLP, machine translation, and AI research. Utilizing the latest data labeling and organization techniques, we have collected a vast amount of bilingual texts from real translators or translation teams, forming various parallel corpora that can be used for a multitude of computerized research purposes. Additionally, we have gathered substantial amounts of text, audio, images, videos, and multimodal data (e.g., video-text data) from recording teams, filming crews, government agencies, non-governmental organizations, and other sources of raw data. This data has undergone structured processing and labeling to create corpora or datasets that can be directly utilized for AI development.


Off-the-Shelf Datasets
Our existing off-the-shelf datasets include a 200,000-hour speech dataset, an 800TB computer vision dataset, approximately 2 billion natural language processing (NLP) data entries, and a 5TB unlabeled text dataset (LLM). The data quality has been tested and trusted by global AI companies.

Data Service
Equipped with professional data collection equipment, tools, and environments, along with experienced project managers in data collection and quality control, we can meet data collection requirements for various scenarios and types.

Experience in Industries
Most of our employees possess years of experience in data processing and have an in-depth understanding of data needs across different scenarios. eCorpus boasts reliable data collection and labeling tools, as well as automated data processing capabilities. We provide multi-scenario data solutions for foreign language education, machine translation engine development, large language model research, conversational robot corpus creation, automotive, smart home, AR/VR, and more.

Core Value
We work to provide a solid infrastructure for the rapid development of the artificial intelligence industry through high-quality data services.