OLAC Record
oai:www.ldc.upenn.edu:LDC2025T03

Metadata
Title:The Xi’an Multi-Language Learner Corpus
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Zhang, Xiao, et al. The Xi’an Multi-Language Learner Corpus LDC2025T03. Web Download. Philadelphia: Linguistic Data Consortium, 2025
Contributor:Zhang, Xiao
Zhang, Ling
Dang, Tian
Feng, Yuanzhao
Ji, Yujing
Jiang, Xiaohui
Kang, Zhewen
Lu, Yan
Nie, Wen
Ren, Hanyu
Wang, Canjun
Wang, Jiayi
Wang, Yu
Wu, Chen
Wu, Mei
Xu, Tingting
Yang, Ruhai
Zhao, Kai
Zhao, Ran
Zhou, Quanjie
Zhu, Lei
Date (W3CDTF):2025
Date Issued (W3CDTF):2025-03-17
Description:*Introduction* The Xi’an Multi-Language Learner Corpus was developed by Xi'an International Studies University (XISU). It is comprised of 526 argumentative essays in 15 languages by Chinese L1 university students studying second languages, along with student metadata and writing prompts. It was developed to support second language learner research and to provide a database for cross-linguistic comparison of second languages. *Data* The essays were produced by undergraduate students at XISU and Yunnan Minzu University (YMU) in response to writing prompts prepared by the corpus development team. Data was collected in 2023 and 2024. Participating students were linguistic majors or studying one of the foreign languages available at XISU and YMU. Off-topic essays and incomplete texts were excluded All texts were cleaned and formatted. No changes were made to the texts in relation to grammatical tense or turn of phrase accuracy. Text and token counts by language are as follows: Language texts tokens Arabic 8 1,762 English 107 32,822 Filipino 10 1,371 French 129 39,944 German 78 10,941 Hindi 16 2,972 Indonesian 14 2,630 Korean 24 2,630 Malay 36 5,208 Persian 12 1,751 Russian 33 8,018 Swahili 10 1,840 Thai 12 1,661 Turkish 22 3,719 Urdu 15 3,645 LancsBox X 4.0 was used for counting Swahili, Persian, French, Urdu, and Hindi tokens. AntConc 4.2.4 was used for counting tokens in the other languages. The essays and writing prompts are stored in UTF-8 encoded plain text files. Metadata is presented in .csv files. *Samples* Sample text file (French) *Updates* None at this time
Extent:Corpus size: 4735 KB
Identifier:LDC2025T03
https://catalog.ldc.upenn.edu/LDC2025T03
ISLRN: 615-404-265-320-6
DOI: r333-vr13
Language:Arabic
Filipino
English
French
German
Hindi
Indonesian
Korean
Malay (macrolanguage); Malay
Persian
Russian
Swahili (macrolanguage); Swahili
Thai
Turkish
Urdu
Language (ISO639):ara
fil
eng
fra
deu
hin
ind
kor
msa
fas
rus
swa
tha
tur
urd
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2025T03
Rights Holder:Portions © 2025 Xi’an International Studies University, © 2025 Trustees of the University of Pennsylvania
Type (DCMI):Text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2025T03
DateStamp:  2025-03-17
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Zhang, Xiao; Zhang, Ling; Dang, Tian; Feng, Yuanzhao; Ji, Yujing; Jiang, Xiaohui; Kang, Zhewen; Lu, Yan; Nie, Wen; Ren, Hanyu; Wang, Canjun; Wang, Jiayi; Wang, Yu; Wu, Chen; Wu, Mei; Xu, Tingting; Yang, Ruhai; Zhao, Kai; Zhao, Ran; Zhou, Quanjie; Zhu, Lei. 2025. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_DE country_FR country_GB country_ID country_IN country_KR country_PH country_PK country_RU country_TH country_TR dcmi_Text iso639_ara iso639_deu iso639_eng iso639_fas iso639_fil iso639_fra iso639_hin iso639_ind iso639_kor iso639_msa iso639_rus iso639_swa iso639_tha iso639_tur iso639_urd


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2025T03
Up-to-date as of: Tue Mar 18 1:01:12 EDT 2025