Charin Polpanumas

日本 東京都 連絡先情報
1556人のフォロワー つながり: 500人以上

登録してプロフィールを閲覧

概要

Data scientist with track records in retail and healthcare. Delivered data products that…

アクティビティ

登録してすべてのアクティビティを表示

職歴 & 学歴

  • Amazon

Charinさんの職歴をすべて表示

役職、在職期間などを確認できます。

または

登録またはサインインするために [続行] をクリックすることにより、LinkedInの利用規約プライバシーポリシーCookieポリシーに同意したものとみなされます。

資格

ボランティア経験

  • Data Science BKK グラフィック

    Organizer

    Data Science BKK

    – 現在 7年

    Education

    Data Science BKK is a no-nonsense, no-agenda meetup for data science and data engineering practitioners in Thailand. We welcome speakers and participants from all companies and industries. We believe in a place where we can truly share ideas and best practices without commercial interests--data people to data people.

  • pyThaiNLP グラフィック

    Main Developer

    pyThaiNLP

    6年1ヶ月

    Science and Technology

    * Finetune Thai ASR model (XLSR-Wav2Vec2-large-53-th) that achieves WER (0.136) and CER (0.028) comparable to Thai speech-to-text services from Microsoft (WER 0.126;CER 0.050), Google (WER 0.137;CER 0.074) and Amazon (WER 0.219;CER 0.071).
    * Pretrain WangchanBERTa, a large, monolingual Thai language model with state-of-the-art downstream performance in sequence and token classification.
    * State-of-the-art English-Thai machine translation model using transformers with BLEU score of 29.0…

    * Finetune Thai ASR model (XLSR-Wav2Vec2-large-53-th) that achieves WER (0.136) and CER (0.028) comparable to Thai speech-to-text services from Microsoft (WER 0.126;CER 0.050), Google (WER 0.137;CER 0.074) and Amazon (WER 0.219;CER 0.071).
    * Pretrain WangchanBERTa, a large, monolingual Thai language model with state-of-the-art downstream performance in sequence and token classification.
    * State-of-the-art English-Thai machine translation model using transformers with BLEU score of 29.0 (th->en; vs 17.93 by Google Translate) and 17.77 (en->th; vs 15.36 by Google Translate).
    * Implement the first NLP transfer learning for Thai text classification. The ULMFit model outperformed Google's BERT in the 5-class multi-class classification dataset wongnai-corpus (F1-score of 0.60925 vs BERT's 0.57057). The encoder is trained on Thai Wikipedia dump using pyThaiNLP's newmm tokenizer and transferred to other domains.

出版物

  • WangChanGLM🐘 — The Multilingual Instruction-Following Model

    WangChanGLM is a multilingual, instruction-finetuned Facebook XGLM-7.5B using open-source, commercially permissible datasets (LAION OIG chip2 and infill_dbpedia, DataBricks Dolly v2, OpenAI TL;DR, and Hello-SimpleAI HC3; about 400k examples), released under CC-BY SA 4.0. The models are trained to perform a subset of instruction-following tasks we found most relevant namely: reading comprehension, brainstorming, and creative writing. We provide the weights for a model finetuned on an…

    WangChanGLM is a multilingual, instruction-finetuned Facebook XGLM-7.5B using open-source, commercially permissible datasets (LAION OIG chip2 and infill_dbpedia, DataBricks Dolly v2, OpenAI TL;DR, and Hello-SimpleAI HC3; about 400k examples), released under CC-BY SA 4.0. The models are trained to perform a subset of instruction-following tasks we found most relevant namely: reading comprehension, brainstorming, and creative writing. We provide the weights for a model finetuned on an English-only dataset (wangchanglm-7.5B-sft-en) and another checkpoint further finetuned on Google-Translated Thai dataset (wangchanglm-7.5B-sft-enth). We perform Vicuna-style evaluation using both humans and ChatGPT (in our case, gpt-3.5-turbo since we are still on the waitlist for gpt-4) and observe some discrepancies between the two types of annoators. All training and evaluation codes are shared under the Apache 2.0 in our Github, as well as datasets and model weights on HuggingFace. In a similar manner to Dolly v2, we use only use open-source, commercially permissive pretrained models and datasets, our models are neither restricted by non-commercial clause like models that use LLaMA as base nor non-compete clause like models that use self-instruct datasets from ChatGPT.



    From this set of experiments, we have learned that large language models that are pretrained in with enough data contain the capabilities to become instruction followers. Granularity of subword tokens do not hinder language understanding, but may limit the model’s ability to generate texts for languages whose subwords are too granular. On the other hand, if the subword tokens are well-shaped, we might be able to observe cross-lingual knowledge transfer in the similar manner to other zero-shot tasks.

    出版物を表示
  • AI Builders: Teaching Thai Students to Build End-to-End Machine Learning Projects Online

    IEEE Tale 2021

    We organized a nine-weeks online summer school, called AI Builders, aiming to teach ML and AI to Thai students. We combined existing ML and AI curricula in conjunction with end-to-end ML projects. We provided recap classes in native language, locally collected datasets, mentorship for projects, guest lectures, and career discussions in our program. Students were able to satisfactory understand ML concepts and produced meaningful projects even though some might not fully had the necessary…

    We organized a nine-weeks online summer school, called AI Builders, aiming to teach ML and AI to Thai students. We combined existing ML and AI curricula in conjunction with end-to-end ML projects. We provided recap classes in native language, locally collected datasets, mentorship for projects, guest lectures, and career discussions in our program. Students were able to satisfactory understand ML concepts and produced meaningful projects even though some might not fully had the necessary background at the start of the program. We discussed possible improvements for future iterations.

    出版物を表示
  • “Worse Than What I Read?” The External Effect of Review Ratings on the Online Review Generation Process: An Empirical Analysis of Multiple Product Categories Using Amazon.com Review Data

    Sustainability

    In this paper, we study the online consumer review generation process by analyzing 37.12 million online reviews across nineteen product categories obtained from Amazon.com. This study revealed that the discrepancy between ratings by others and consumers’ post-purchasing evaluations significantly influenced both the valence and quantity of the reviews that consumers generated. Specifically, a negative discrepancy (‘worse than what I read’) significantly accelerates consumers to write negative…

    In this paper, we study the online consumer review generation process by analyzing 37.12 million online reviews across nineteen product categories obtained from Amazon.com. This study revealed that the discrepancy between ratings by others and consumers’ post-purchasing evaluations significantly influenced both the valence and quantity of the reviews that consumers generated. Specifically, a negative discrepancy (‘worse than what I read’) significantly accelerates consumers to write negative reviews (19/19 categories supported), while a positive discrepancy (‘better than what I read’) accelerates consumers to write positive reviews (16/19 categories supported). This implies that others’ ratings play an important role in influencing the review generation process by consumers. More interestingly, we found that this discrepancy significantly influences consumers’ neutral review generation, which is known to amplify the effect of positive or negative reviews by affecting consumers’ search behavior or the credibility of the information. However, this effect is asymmetric. While negative discrepancies lead consumers to write more neutral reviews, positive discrepancies help reduce neutral review generation. Furthermore, our findings provide important implications for marketers who tend to generate fake reviews or selectively generate reviews favorable to their products to increase sales. Doing so may backfire on firms because negative discrepancies can accelerate the generation of objective or negative reviews.

    出版物を表示
  • WangchanBERTa: Pretraining transformer-based Thai Language Models

    arXiv

    Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features…

    Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features for Thai. To overcome these limitations, we pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size), curated from diverse domains of social media posts, news articles and other publicly available datasets. We apply text processing rules that are specific to Thai most importantly preserving spaces, which are important chunk and sentence boundaries in Thai before subword tokenization. We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance. Our model wangchanberta-base-att-spm-uncased trained on the 78.5GB dataset outperforms strong baselines (NBSVM, CRF and ULMFit) and multi-lingual models (XLMR and mBERT) on both sequence classification and token classification tasks in human-annotated, mono-lingual contexts.

    その他の著者
    出版物を表示
  • scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

    arXiv

    The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation…

    The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.

    その他の著者
    出版物を表示
  • Performance of cytokine models in predicting SLE activity

    Arthritis Research & Therapy

    The heterogeneity of SLE pathogenesis leads to different signaling mechanisms and mediates through several cytokines. The monitoring of cytokines increases the sensitivity and specificity to determine SLE disease activity. IL-18 predicts the risk of active renal SLE while IL-6 and IL-8 predict the risk of active non-renal. The sensitivity and specificity of these cytokines are higher than the anti-dsDNA or C3. We propose to use the serum level of IL-18, IL-6, and IL-8 to monitor SLE disease…

    The heterogeneity of SLE pathogenesis leads to different signaling mechanisms and mediates through several cytokines. The monitoring of cytokines increases the sensitivity and specificity to determine SLE disease activity. IL-18 predicts the risk of active renal SLE while IL-6 and IL-8 predict the risk of active non-renal. The sensitivity and specificity of these cytokines are higher than the anti-dsDNA or C3. We propose to use the serum level of IL-18, IL-6, and IL-8 to monitor SLE disease activity in clinical practice.

    その他の著者
    出版物を表示
  • The Impact of Word of Mouth via Twitter On Moviegoers' Decisions and Film Revenues

    Journal of Advertising Research

    This study drew on the existing decision process theory to empirically examine the effect of word of mouth (WOM) generated by social media. Particularly, the prospect theory was adopted to illustrate how both the volume and valence of WOM influence a person’s decision to watch a movie through the movie quality evaluation stage. Hypotheses were formulated according to the prospect theory’s core assumptions, and these hypotheses were tested using U.S. movie industry data and online post data from…

    This study drew on the existing decision process theory to empirically examine the effect of word of mouth (WOM) generated by social media. Particularly, the prospect theory was adopted to illustrate how both the volume and valence of WOM influence a person’s decision to watch a movie through the movie quality evaluation stage. Hypotheses were formulated according to the prospect theory’s core assumptions, and these hypotheses were tested using U.S. movie industry data and online post data from the microblog service Twitter, commonly known as “tweets.” The findings strongly support the hypothesis that the effect of online WOM is well explained by the prospect theory. The findings imply the importance of managing social media at the initial stage for marketing managers. They also suggest that intensively advertising a movie before its release to attract moviegoers could backfire by raising the moviegoers’ expectations.

    その他の著者
    • Yeujun Yoon
    • Yong Joon Park
    出版物を表示
  • The Data-Driven Guide to Bangkok Prostitutes

    Several Thai National Channels

    I got on national television thanks to a fun, pet project on Bangkok prostitutes. (no field research conducted)

    出版物を表示

言語

  • English

    母国語またはバイリンガル

  • Japanese

    母国語またはバイリンガル

  • Chinese

    ビジネス初級

  • Thai

    母国語またはバイリンガル

Charinさんによるその他のアクティビティ

Charinさんのプロフィールを表示

  • 共通の知り合いをチェックする
  • この方への紹介をリクエストする
  • Charinさんに直接コンタクトする
登録してプロフィールを閲覧

類似するその他のプロフィール

これらのコースで新しいスキルを追加