I am a PhD student in Computational Linguistics at Stony Brook University (since Fall 2022) and a junior researcher of the Institute for Advanced Computational Science (IACS). I am lucky to be advised by Dr. Owen Rambow.
Over years, I have worked on projects that study actual language use at large scale, data augmentation, text clustering, as well as how and how well neural networks learn and generalize. I am currently working on the evaluation of Large Language Models (LLMs) with a human-centric perspective. I consider myself as a lifelong learner.
Bio
I was born and raised in Fuqing, a small southeastern town of China. Prior to coming to Stony Brook, I completed a bachelor's degree in Chinese Language and Literature from Hunan University, and a master's degree in Applied linguistics from University of Saskatchewan.
I am a proud self-taught and self-motivated programmer. I started learning programming in 2020, and have since managed to make programming relevant to and then part of my daily life. Looking back, I am glad to find my experiences with NLP align well with the three major phases of the field featured as: rule-based (symbolic) methods, statistical machine learning, and deep learning. I also find myself fortuante to witness the rapid development of the field over the past few years.
I am actively looking for collaborators and like-minded people. Feel free to reach out if you find my research interesting and relevant to yours and would like to brainstorm with me.
While I am not doing research, I enjoy reading/watching random stuffs, doing some sports, and exploring new things.
CV
Here is my Curriculum Vitae.
Research
For a full and up-to-date list of publications, please check my Google Scholar page.
Clustering Document Parts: Detecting and Characterizing Influence Campaigns from Documents
Zhengxiang Wang, Owen Rambow,
the 6th workshop on NLP+CSS at NAACL 2024
 
 
Learning Transductions and Alignments with RNN Seq2seq models
Zhengxiang Wang, ICGI 2023
 
 
Developing literature review writing skills through an online writing tutorial series: Corpus-based evidence
Zhi Li, Makarova Veronika, Zhengxiang Wang, Frontiers in Communication, 2023
Random Text Perturbations Work, but not Always
Zhengxiang Wang, AACL-IJCNLP 2022 Workshop Eval4NLP
Thirty-Two Years of IEEE VIS: Authors, Fields of Study and Citations
Hongtao Hao, Yumian Cui, Zhengxiang Wang, Yea-Seul Kim,
IEEE Transactions on Visualization and Computer Graphics, 2022
 
Linguistic Knowledge in Data Augmentation for Natural Language Processing: An Example on Chinese Question Matching
Zhengxiang Wang, ICNLSP 2022
 
Resources
Deep Learning
- PyTorch Tutorial : simple PyTorch Tutorial for a guest lecture I gave, suitable for beginners
- RNN Seq2seq transduction : customized pipeplines to model language transduction tasks using RNN seq2seq models
- RNN transduction : customized pipeplines to model language transduction tasks using RNNs
- Text matching explained & Text classification explained : building and training deep learning models for text (matching) classification tasks from scratch using paddle, PyTorch, and TensorFlow.
- Notes for Stanford CS224N : Natural Language Processing with Deep Learning.
- Hands on gradients derivations tutorials for common machine learning loss functions.
- Deep-learning-based Natural Language Processing using paddlenlp : covering a wide range of essential NLP tasks (both classification and non-classification) for industry and the SOTA practices.
- Word embedding resources, application, visualization, and training (word2vec in python).
Text Processing
- Text augmentation techniques : from random text-editing perturbations, back translation, to model-based transformations. Also see: data augmentation programs (plus ngram language model).
- Historical English Language Processing Toolkit : An efficient toolkit and a general framework for early modern & modern English Language Processing (multi-label annotation) in XML.
- Linguistic Feature Extractor : A corpus-linguistic tool to extract and search for linguistic features (with 95 builtin features), which generates both feature statistics and the extracted instances.
- Unfilled Pause Classifier : a rule-based syntactic parser classifying unfilled pauses based in the British Academic Spoken English corpus.
Miscellaneous
- Lstar Python : Python Implementation of the Lstar Algorithm by Angluin (1987).
- Google Scholar Analyzer : Auto-aggregating academic profiles of researchers on Google Scholar.
- YouTube Info Collector : An interface to scrape information (video titles, post dates, view counts, like counts, and comments etc.) from YouTube videos based on queries, video links, or channel links.
Chinese-related
- Gender predictor : Predicting gender of given Chinese names with over 93% (up to 99%) test set accuracy using Naive Bayes, multi-class Logistic Regression, neural networks models.
- CCNC : A Comprehensive Chinese Name Corpus (3.65M unique name samples).
- Chinese Ngrams Counts : character-based and word-based from large-scale corpora.
- Corpus of Chinese synonyms : from multiple reputable sources with over 70k base examples.
- Corpus of Chinese fixed phrases and idioms : rich dictionary-like accounts for 30310 instances.