Jinbiao Yang

Publications

Displaying 1 - 4 of 4
  • Yang, J. (2024). Rethinking tokenization: Crafting better tokenizers for large language models. International Journal of Chinese Linguistics, 11(1), 94-109. doi:10.1075/ijchl.00023.yan.

    Abstract

    Tokenization significantly influences language models (LMs)’ performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while controlling complexity. Despite subword tokenizers like Byte Pair Encoding (BPE) overcoming many word tokenizer limitations, they encounter difficulties in handling non-Latin languages and depend heavily on extensive training data and computational resources to grasp the nuances of multiword expressions (MWEs). This article argues that tokenizers, more than mere technical tools, should drawing inspiration from the cognitive science about human language processing. This study then introduces the “Principle of Least Effort” from cognitive science, that humans naturally seek to reduce cognitive effort, and discusses the benefits of this principle for tokenizer development. Based on this principle, the paper proposes that the Less-is-Better (LiB) model could be a new approach for LLM tokenizer. The LiB model can autonomously learn an integrated vocabulary consisting of subwords, words, and MWEs, which effectively reduces both the numbers of tokens and types. Comparative evaluations show that the LiB tokenizer outperforms existing word and BPE tokenizers, presenting an innovative method for tokenizer development, and hinting at the possibility of future cognitive science-based tokenizers being more efficient.
  • Jin, H., Wang, Q., Yang, Y.-F., Zhang, H., Gao, M. (., Jin, S., Chen, Y. (., Xu, T., Zheng, Y.-R., Chen, J., Xiao, Q., Yang, J., Wang, X., Geng, H., Ge, J., Wang, W.-W., Chen, X., Zhang, L., Zuo, X.-N., & Chuan-Peng, H. (2023). The Chinese Open Science Network (COSN): Building an open science community from scratch. Advances in Methods and Practices in Psychological Science, 6(1): 10.1177/25152459221144986. doi:10.1177/25152459221144986.

    Abstract

    Open Science is becoming a mainstream scientific ideology in psychology and related fields. However, researchers, especially early-career researchers (ECRs) in developing countries, are facing significant hurdles in engaging in Open Science and moving it forward. In China, various societal and cultural factors discourage ECRs from participating in Open Science, such as the lack of dedicated communication channels and the norm of modesty. To make the voice of Open Science heard by Chinese-speaking ECRs and scholars at large, the Chinese Open Science Network (COSN) was initiated in 2016. With its core values being grassroots-oriented, diversity, and inclusivity, COSN has grown from a small Open Science interest group to a recognized network both in the Chinese-speaking research community and the international Open Science community. So far, COSN has organized three in-person workshops, 12 tutorials, 48 talks, and 55 journal club sessions and translated 15 Open Science-related articles and blogs from English to Chinese. Currently, the main social media account of COSN (i.e., the WeChat Official Account) has more than 23,000 subscribers, and more than 1,000 researchers/students actively participate in the discussions on Open Science. In this article, we share our experience in building such a network to encourage ECRs in developing countries to start their own Open Science initiatives and engage in the global Open Science movement. We foresee great collaborative efforts of COSN together with all other local and international networks to further accelerate the Open Science movement.
  • Frank, S. L., & Yang, J. (2018). Lexical representation explains cortical entrainment during speech comprehension. PLoS One, 13(5): e0197304. doi:10.1371/journal.pone.0197304.

    Abstract

    Results from a recent neuroimaging study on spoken sentence comprehension have been interpreted as evidence for cortical entrainment to hierarchical syntactic structure. We present a simple computational model that predicts the power spectra from this study, even
    though the model's linguistic knowledge is restricted to the lexical level, and word-level representations are not combined into higher-level units (phrases or sentences). Hence, the
    cortical entrainment results can also be explained from the lexical properties of the stimuli, without recourse to hierarchical syntax.
  • Yang, J., Zhu, H., & Tian, X. (2018). Group-level multivariate analysis in EasyEEG toolbox: Examining the temporal dynamics using topographic responses. Frontiers in Neuroscience, 12: 468. doi:10.3389/fnins.2018.00468.

    Abstract

    Electroencephalography (EEG) provides high temporal resolution cognitive information from non-invasive recordings. However, one of the common practices-using a subset of sensors in ERP analysis is hard to provide a holistic and precise dynamic results. Selecting or grouping subsets of sensors may also be subject to selection bias, multiple comparison, and further complicated by individual differences in the group-level analysis. More importantly, changes in neural generators and variations in response magnitude from the same neural sources are difficult to separate, which limit the capacity of testing different aspects of cognitive hypotheses. We introduce EasyEEG, a toolbox that includes several multivariate analysis methods to directly test cognitive hypotheses based on topographic responses that include data from all sensors. These multivariate methods can investigate effects in the dimensions of response magnitude and topographic patterns separately using data in the sensor space, therefore enable assessing neural response dynamics. The concise workflow and the modular design provide user-friendly and programmer-friendly features. Users of all levels can benefit from the open-sourced, free EasyEEG to obtain a straightforward solution for efficient processing of EEG data and a complete pipeline from raw data to final results for publication.

Share this page