Data Products



eCorpus Inc., started in 2022, is a research and development company specializing in natural language processing (NLP) technologies. It is also recognized as the company with the most comprehensive Chinese-centered parallel corpus resources, earning the title of China's premier parallel corpus supplier.


With the rise of data technology, traditional linguistic methods have become inadequate in empowering artificial intelligence (AI) research. Corpora have now become essential materials for modern linguistics, machine learning, NLP, machine translation, and AI research. Utilizing the latest data labeling and organization techniques, we have collected a vast amount of bilingual texts from real translators or translation teams, forming various parallel corpora that can be used for a multitude of computerized research purposes. Additionally, we have gathered substantial amounts of text, audio, images, videos, and multimodal data (e.g., video-text data) from recording teams, filming crews, government agencies, non-governmental organizations, and other sources of raw data. This data has undergone structured processing and labeling to create corpora or datasets that can be directly utilized for AI development.

auto_1348.jpg

Off-the-Shelf Datasets

Our existing off-the-shelf datasets include a 200,000-hour speech dataset, an 800TB computer vision dataset, approximately 2 billion natural language processing (NLP) data entries, and a 5TB unlabeled text dataset (LLM). The data quality has been tested and trusted by global AI companies.

auto_1349.jpg


Data Service

Equipped with professional data collection equipment, tools, and environments, along with experienced project managers in data collection and quality control, we can meet data collection requirements for various scenarios and types.

auto_1350.jpg


Experience in Industries

Most of our employees possess years of experience in data processing and have an in-depth understanding of data needs across different scenarios. eCorpus boasts reliable data collection and labeling tools, as well as automated data processing capabilities. We provide multi-scenario data solutions for foreign language education, machine translation engine development, large language model research, conversational robot corpus creation, automotive, smart home, AR/VR, and more.

auto_1351.jpg

Core Value

We work to provide a solid infrastructure for the rapid development of the artificial intelligence industry through high-quality data services.



Support customized personalized data collection and labeling requirements

Millions of customers respond to various needs in a timely manner, supporting complex collection tasks and specialized data annotation

I want to customize

Copyright by ecorpus.cn eCorpus china