11institutetext: HI3 Tech Lab, University of South Carolina, Columbia SC, USA 11email: dezhiwu@cec.sc.edu
22institutetext: School of Medicine, Washington University, St. Louis MO, USA 33institutetext: Center of Translational AI Excellence and Applications in Medicine, University of Texas Health Science Center, Houston TX, USA
33email: ming.huang@uth.tmc.edu

Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach

Sai Krishna Revanth Vuruma 11 0009-0009-3741-9343    Dezhi Wu 11 0000-0002-3554-1136    Saborny Sen Gupta 11 0009-0008-0937-5366    Lucas Aust 11    Valerie Lookingbill 11 0000-0003-1453-2633    Wyatt Bellamy 11 0009-0000-4616-9715    Yang Ren 11 0000-0002-6128-5826    Erin Kasson 22    Li-Shiun Chen 22 0000-0001-6762-5054    Patricia Cavazos-Rehg 22 0000-0003-3352-1198    Dian Hu 33 0000-0003-1277-142X    Ming Huang 33 0000-0001-7367-3626
Abstract

In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users’ quit-vaping intentions. Leveraging OpenAI’s latest large language model GPT-4 for sentence-level quit vaping intention detection, this study compares the outcomes of this model against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, we developed 8 prompts with varying levels of detail to explain the task to GPT-4 and also evaluated the performance of the strategies against each other. These preliminary findings emphasize the potential of GPT-4 in social media data analysis, especially in identifying users’ subtle intentions that may elude human detection.

Keywords:
Vaping Cessation Large Language Models GPT-4 Annotation Social Media Analytics Natural Language Processing Reddit Data

1 Introduction

Studies indicate that epidemic levels of consumption was observed among adolescents and young adults during the last decade, with a massive increase in sale and usage of e-cigarettes and other disposable vaping products [5, 24] leading to the EVALI outbreak in 2019. With the nation’s youth emerging as the high risk population, research suggests that many e-cigarette users are unaware of the potential dangers of vaping such as Vape Dependence[23] and Stealth-vaping[30], with vape frequency directly associated with perceived satisfaction while being indirectly associated with perceived danger[13]. Vaping products contain cancer-causing agents, toxins, heavy metals, and other harmful particles that are substantially linked to lung, heart, and brain damage [18]. Recent efforts towards educating the young populace about the negative impacts of vaping have resulted in a large number of e-cigarette users intending to quit vaping [25], with about 45% of young vapers reporting interest in quitting, while 25% attempted to quit in 2020-2021 [22]. The goal is to now identify these users and help them quit vaping by proving the necessary resources to support them along the way.

Contemporary research studies have leveraged popular social media platforms such as Twitter and Reddit for public surveillance of health topics. Approximately, more than 70% of people use at least one social media platform and the number of new users in any of these popular platforms is increasing everyday especially among users aged 18-29[14]. The utilization of social media data emerges as a nascent source of public health information, offering novel insights into public health trends and enhancing the capabilities of public health surveillance.

Previous vaping studies such as [29, 10] used topic modeling and sentiment analysis along with clinical insights on social media data to show users on these platforms might benefit from digital intervention programs for vaping cessation. For clinicians to potentially employ proactive outreach strategies to engage vaping patients for education and treatment on social media platforms, it is imperative to conduct further research into the analysis of vaping discourse on these platforms [12], aiming to develop Artificial Intelligence (AI)-based approaches to more efficiently and accurately identify these users’ vaping behaviors and develop targeted vaping prevention and intervention programs for the youth population.

In this preliminary study, we aim to employ and evaluate OpenAI’s GPT-4 model against layman and clinical expert annotators on a sentence-level annotation task to identify vaping cessation interests among Reddit users. Our preliminary findings indicate that the GPT-4 model performs impressively, but it still has a ways to go before replacing human annotators.

2 Literature Review

Interpretation of natural language data extracted from social media platforms requires deep contextual knowledge and understanding, lack of which can lead to incorrect labeling and annotations [16]. Manual annotations of these texts can be challenging for humans as they are often short, informal and contain different socio-cultural opinions and perceptions [17]. Care must be taken while using state-of-the-art Machine Learning (ML) algorithms and Natural Language Processing techniques in tasks requiring complex inferences as shown in [27, 4]. Traditional ML and Deep Learning models like CNNs, RNNs and pretrained Language Models like BERT require a high-quality annotated corpus to develop an effective model for sentiment analysis. On the other hand, advanced and intuitive Large Language Models (LLMs) such as OpenAI’s Generative Pre-Trained Transformer models GPT-3 [3] and GPT-4 [1] among others, allow zero-shot learning, one-shot learning, and few-shot learning which could be used for detecting quit vaping intention without intensive training. These LLMs have shown proficiency in in-context learning where they outperformed traditional methods [2, 6] and can generate quick results while not being susceptible to some of the limitations observed in human annotation [20].

On Data Annotation tasks, studies have shown that ChatGPT’s performance is promising in classifying and generating explanations for implicit sentiment analysis such as hate speech detection [9], zero-shot sentence-level annotation of legal documents [21], political tweet labeling [27] and identifying adverse events about a cannabis-derived product [15]. Although works such as [7] reiterates ChatGPT is emerging as a potential alternative to human annotation as it is faster and cheaper, some researchers advise caution and argue that human-in-the-loop validation must be maintained to guarantee the reliability of its results [26]. In contrast, ChatGPT-generated Natural Language Explanations (NLEs) can influence human perceptions and can result in a risk of misleading common people in case of incorrect predictions [9].

OpenAI’s GPT-4 model has shown remarkable capabilities in a multitude of domains, even clearing the bar exam according to a recent study[11]. Research indicates that GPT-4 can act as an alternative to layman annotation in many diverse areas[27, 4].

3 Methodology

The workflow adopted for this study is illustrated in Figure 5. Each stage of the pipeline will be discussed in detail in the respective subsections. First the data is extracted from Reddit and cleaned, then it is sent to the annotators: layman, expert and the GPT-4 model for annotation. The performance of all three annotators is compared at the end to draw conclusions. With the expert annotated dataset as the ground truth, we will use two types of metrics: qualitative and quantitative for formulating the results.

3.1 Data Collection & Preparation

In the popular social media platform Reddit, r/QuitVaping is the largest subreddit dedicated to help users quit vaping and other tobacco products with around 40,000 subscribers. Using Reddit’s Async PRAW API, we extracted a total of 1000 posts from the aforementioned r/QuitVaping subreddit. These posts ranged from users talking about their progress towards quitting vaping to users looking for help or motivation to quit or reduce vape use. Out of these 1000 posts, approximately 120 were randomly selected to form a sample dataset. From each post in the sample dataset, two columns, namely title and body were extracted and broken down into sentences using the Sentence Tokenizer from the NLTK library [19]. Any sentence that had less than 3 tokens was dropped and so were duplicates. A total of 1059 sentences were available for annotation.

3.2 Human Annotation

3.2.1 Layman Annotation

Two layman annotators were tasked with labeling the cleaned sentences as ’YES’ if the speaker explicitly mentions their idea, desire, decision, plan, or action to quit vaping. And to label them as ’NO’ otherwise. For a sentence to be labeled as ’YES’, there must be a clear indication that the speaker intends to quit vaping. Discrepancies (n=28) were resolved internally with an Inter-coder Reliability score (ICR) of 0.78.

3.2.2 Expert Annotation

Two clinical experts from the School of Medicine, Washington University were asked to perform the expert annotation by following the same guidelines mentioned above. The coders independently reviewed the dataset and coded all the sentences. The second coder resolved discrepancies (n=22).

3.3 GPT-4 Annotation

Interaction with the GPT-4 model can be done via prompts that must be carefully constructed to get the best performance out of the model. Each prompt let’s you assign a ’role’ which indicates who the sender of that message (prompt) is. Taking inspiration from the prompt templates used in [8, 31] we devised the prompts for our study using approaches like zero-shot, one-shot, few-shot and chain-of-thought prompting.

Given a sentence, the model was tasked to return a label, a numerical confidence score and its reasoning for choosing that label for that sentence. Figure 6 shows the system prompt that we used to introduce the context of the task to the GPT-4 model, while Figure 1 contains a sample user prompt that passes the input data along with instructions on how the model should respond.

As shown in Table 1, we developed 8 prompts using different prompting strategies plus another variable called ’detail’. The low detail prompts (P1-P4) have the structure shown in Figure 1(a) with the question phrased using simpler language, i.e., "Does the speaker intend to quit vaping?", while the high detail prompts (P5-P8) use a more directed question as shown in Figure 1(b). The one-shot and few-shot variants include examples in the user prompt, while the chain-of-thought prompts include the phrase "think step-by-step" in the question.

Table 1: Prompts Used. Here, detail column determines how vague (Figure 1(a)) or specific (Figure 1(b)) the question is phrased in the user prompt.
Prompt ID Strategy Detail
P1 zero-shot low
P2 zero-shot, chain-of-thought low
P3 one-shot low
P4 few-shot low
P5 zero-shot high
P6 zero-shot, chain-of-thought high
P7 one-shot high
P8 few-shot high
Refer to caption
(a) Low Detail
Refer to caption
(b) High Detail
Figure 1: Sample User Prompts

4 Results

Figure 2 shows the class distribution after annotation by all three annotators: layman, expert and GPT-4. Here, P1-P8 denote which prompt was sent to the GPT-4 model, while Layman and Expert denote which human annotator annotated the records. From the figure, we can infer that while both the human annotators were more conservative in assigning the YES label to a sentence, GPT-4 was more sensitive across all 8 prompts. Another key observation is that the model sensitivity goes down with increased detail in the prompt, while the number of examples provided did not have a significant impact.

Considering the clinical expert annotated dataset as the ground truth or baseline, we perform two types of evaluation to compare the performance of GPT-4 against layman annotators using qualitative and quantitative metrics. In addition, we also make comparisons between the 8 prompts that were used.

Refer to caption
Figure 2: Class Distribution afer Annotation

4.1 Qualitative Evaluation

We calculated the Cohen’s Kappa and Jaccard’s similarity scores for the layman and GPT-4 annotated datasets for each label individually. As shown in Figure 3, the layman annotators’ labels were much closer to those of the experts with a Jaccard Similarity score of 0.95 and Cohen’s Kappa of 0.8. In contrast, GPT-4 had weak agreement with the expert annotators with the best performing prompt getting scores of 0.71 and 0.22 respectively.

Comparing the individual prompts, all four high detail prompts (P5-P8) scored higher on both similarity metrics than their low detail counterparts (P1-P4).

Refer to caption
(a) For label ’YES’
Refer to caption
(b) For label ’NO’
Figure 3: Qualitative Results

4.2 Quantitative Evaluation

For quantitative evaluation, we used standard classification metrics namely accuracy, precision, recall and f1 score to compare the performance of each annotator. From Figure 4, we can infer that the layman annotators’ annotations were closest to the ground truth with an overall F1 score of 0.97, while the best performing prompt for GPT-4 had an overall F1 score of 0.84. Breaking down the classification label-wise, although both annotators made false annotations (predictions), GPT-4 predicted more False Positives (FPs) than the layman annotator resulting in the low precision scores seen in Figure 4(a).

Looking at the prompts, although the high detail prompts (P5-P8) perform better in terms of accuracy and f1 score across both the labels, their recall values are lower than that of the low detail prompts (P1-P4) for the positive (YES) case as seen in Figure 4(a). This is in contrast to the negative (NO) cases (Figure 4(b)) where the high detail prompts outperform the low detail ones on accuracy, recall and f1 score. This indicates that the high detail prompts are predicting more FPs than the low detail prompts.

Refer to caption
(a) For label ’YES’
Refer to caption
(b) For label ’NO’
Figure 4: Quantitative Results

4.3 Discussion

Although the results from GPT-4 aren’t up to the mark of the layman or expert annotators, there are positives that we can build on. As shown in Figure 1, along with its prediction for each sentence, GPT-4 is tasked to return a numerical confidence score and its reasoning for that prediction. Observing the annotated dataset in the context of these two columns provides some valuable insights. In addition, the prompting strategy employed has also affected model performance as discussed in the previous sections.

4.3.1 Prompting Strategy

From our earlier experiments, we noticed that GPT-4 performs best when it has more data to work with and this is supported in the fact that all of the high detail prompts (P5-P8) that we employed did better than the low detail variants (P1-P4). However, too much data can also hurt the model as seen in Figure 4(a) where the one-shot (P7) and few-shot (P8) prompts have a better recall but a similar F1 score to the zero-shot prompts (P5, P6). Chain-of-Thought prompting is known to improve LLM performance on analytical tasks by asking the model to think step-by-step [28]. Given that our task was sentence-level, the model didn’t have enough context to fully exploit the benefits of this prompting strategy.

4.3.2 Model Confidence & Reasoning

Whenever the model is not confident about the context of the sentence, it makes certain assumptions to arrive at a conclusion (or annotation in this case). And this is reflected in the confidence score attached with that annotation plus the reasoning the model provides. For example, in Table 2 we can see how the model assigns a low confidence score while making assumptions about the context of a sentence and also mentioning the same in the reasoning. The rich data from these two columns can be used to optimize the performance further.

4.3.3 Error Analysis

Evaluating the performance of GPT-4 on the best performing prompt so far, i.e., P5 (high detail, zero-shot), the model predicted 149 false positives which greatly decreased its precision and f1 score. Upon careful observation, of the sentences that GPT-4 falsely predicted as YES instances, the speaker:

  • Has Already Quit Vaping or

  • Is talking about Negative Health Outcomes, side effects of vaping or

  • Is planning on Reducing Vaping

These sub-categories don’t fit into the hypothesis of this study for identifying users that are actively trying to quit vaping. However, this presents an interesting dynamic of the discourse on vaping and quitting in general. Users have different quitting behaviors - some choose to quit outright while others may prefer a more gradual approach.

5 Conclusion

Through this preliminary study, we compared the performance of OpenAI’s GPT-4 model against layman and clinical experts on a sentence-level annotation task to identify users that are trying to quit vaping on Reddit. We found that although GPT-4’s performance doesn’t match that of either human annotator, the results are promising.

In the future, we plan to expand this study by building a larger and more diverse dataset with posts and comments from popular vaping subreddits and randmoized data from unrelated subreddits to make the model more robust. As mentioned in the Discussion section, different users have different quitting behaviors. Expanding the task to a multi-label or multi-layer classification will provide more granular insights and help identify users that are at different stages of their quitting journey. In addition, to address hallucinations by the GPT-4 model, post-level annotation can be used to give more context to the model through in-context learning and thus improve its performance.

{credits}

5.0.1 Acknowledgements

This research was generously funded by a NIH R34 grant (Grant No: CL040 155600 F1000 202 USCSP 10012339 1) and another research grant by the University of South Carolina (USC) (PI: Dr. Dezhi Wu, Grant No: 80002838).

5.0.2 \discintname

The authors have no conflict of interests to declare that are relevant to the content of this article.

References

  • [1] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
  • [2] Alhamed, F., Ive, J., Specia, L.: Using large language models (llms) to extract evidence from pre-annotated social media data. In: Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024). pp. 232–237 (2024)
  • [3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [4] Cheng, L., Li, X., Bing, L.: Is gpt-4 a good data analyst? arXiv preprint arXiv:2305.15038 (2023)
  • [5] Dai, H.: Prevalence and Factors Associated With Youth Vaping Cessation Intention and Quit Attempts. Pediatrics 148(3), e2021050164 (09 2021). https://doi.org/10.1542/peds.2021-050164, https://doi.org/10.1542/peds.2021-050164
  • [6] Deng, X., Bashlovkina, V., Han, F., Baumgartner, S., Bendersky, M.: Llms to the moon? reddit market sentiment analysis with large language models. In: Companion Proceedings of the ACM Web Conference 2023. pp. 1014–1019 (2023)
  • [7] Ding, B., Qin, C., Liu, L., Chia, Y.K., Joty, S., Li, B., Bing, L.: Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450 (2022)
  • [8] Han, Z., Zhou, G., He, R., Wang, J., Xie, X., Wu, T., Yin, Y., Khan, S., Yao, L., Liu, T., et al.: How well does gpt-4v (ision) adapt to distribution shifts? a preliminary investigation. arXiv preprint arXiv:2312.07424 (2023)
  • [9] Huang, F., Kwak, H., An, J.: Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. In: Companion proceedings of the ACM web conference 2023. pp. 294–297 (2023)
  • [10] Kasson, E., Singh, A.K., Huang, M., Wu, D., Cavazos-Rehg, P.: Using a mixed methods approach to identify public perception of vaping risks and overall health outcomes on twitter during the 2019 evali outbreak. International Journal of Medical Informatics 155, 104574 (2021). https://doi.org/https://doi.org/10.1016/j.ijmedinf.2021.104574, $https://www.sciencedirect.com/science/article/pii/S1386505621002008$
  • [11] Katz, D.M., Bommarito, M.J., Gao, S., Arredondo, P.: Gpt-4 passes the bar exam. Philosophical Transactions of the Royal Society A 382(2270), 20230254 (2024)
  • [12] Ketonen, V., Malik, A.: Characterizing vaping posts on instagram by using unsupervised machine learning. International journal of medical informatics 141, 104223 (2020)
  • [13] Kozlowski, L.T., Homish, D.L., Homish, G.G.: Daily users compared to less frequent users find vape as or more satisfying and less dangerous than cigarettes, and are likelier to use non-cig-alike vaping products. Preventive medicine reports 6, 111–114 (2017)
  • [14] Kwon, M., Park, E., et al.: Perceptions and sentiments about electronic cigarettes on social media platforms: systematic review. JMIR public health and surveillance 6(1), e13673 (2020)
  • [15] Leas, E.C., Ayers, J.W., Desai, N., Dredze, M., Hogarth, M., Smith, D.M.: Using large language models to support content analysis: A case study of chatgpt for adverse event detection. Journal of Medical Internet Research 26, e52499 (2024)
  • [16] Liyanage, C., Gokani, R., Mago, V.: Gpt-4 as a twitter data annotator: Unraveling its performance on a stance classification task. Authorea Preprints (2023)
  • [17] Maceda, L.L., Llovido, J.L., Artiaga, M.B., Abisado, M.B.: Classifying sentiments on social media texts: A gpt-4 preliminary study. In: Proceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval. pp. 19–24 (2023)
  • [18] McKay, F., Chan, L., Cerio, R., Rickards, S., Hastings, P., Reakes, K., O’Brien, T., Dunn, M.: Assessing the quality and behavior change potential of vaping cessation apps: Systematic search and assessment. JMIR mHealth and uHealth 12, e55177 (2024)
  • [19] NLTK Contributors: NLTK sentence tokenizer, $https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html$, accessed: March 19, 2024
  • [20] Pangakis, N., Wolken, S., Fasching, N.: Automated annotation with generative ai requires validation. arXiv preprint arXiv:2306.00176 (2023)
  • [21] Savelka, J.: Unlocking practical applications in legal domain: Evaluation of gpt for zero-shot semantic annotation of legal texts. In: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law. pp. 447–451 (2023)
  • [22] Smith, T.T., Nahhas, G.J., Carpenter, M.J., Squeglia, L.M., Diaz, V.A., Leventhal, A.M., Dahne, J.: Intention to quit vaping among united states adolescents. JAMA pediatrics 175(1), 97–99 (2021)
  • [23] Soule, E.K., Lee, J.G., Egan, K.L., Bode, K.M., Desrosiers, A.C., Guy, M.C., Breland, A., Fagan, P.: “i cannot live without my vape”: Electronic cigarette user-identified indicators of vaping dependence. Drug and alcohol dependence 209, 107886 (2020)
  • [24] Stalgaitis, C.A., Djakaria, M., Jordan, J.W.: The vaping teenager: understanding the psychographics and interests of adolescent vape users to inform health communication campaigns. Tobacco Use Insights 13, 1179173X20945695 (2020)
  • [25] Struik, L., Yang, Y.: e-cigarette cessation: content analysis of a quit vaping community on reddit. Journal of Medical Internet Research 23(10), e28303 (2021)
  • [26] Thapa, S., Naseem, U., Nasim, M.: From humans to machines: can chatgpt-like llms effectively replace human annotators in nlp tasks. In: Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media (2023)
  • [27] Törnberg, P.: Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588 (2023)
  • [28] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models (2023)
  • [29] Wu, D., Kasson, E., Singh, A.K., Ren, Y., Kaiser, N., Huang, M., Cavazos-Rehg, P.A.: Topics and sentiment surrounding vaping on twitter and reddit during the 2019 e-cigarette and vaping use–associated lung injury outbreak: Comparative study. J Med Internet Res 24(12), e39460 (Dec 2022). https://doi.org/10.2196/39460, https://www.jmir.org/2022/12/e39460
  • [30] Yingst, J.M., Lester, C., Veldheer, S., Allen, S.I., Du, P., Foulds, J.: E-cigarette users commonly stealth vape in places where e-cigarette use is prohibited. Tobacco Control 28(5), 493–497 (2019)
  • [31] Zhang, C., Wang, S.: Good at captioning, bad at counting: Benchmarking gpt-4v on earth observation data. arXiv preprint arXiv:2401.17600 (2024)

Appendix

Refer to caption
Figure 5: Workflow
Refer to caption
Figure 6: System Prompt
Table 2: False Positive Examples
Sentence Annotation Confidence Reasoning
Today at 6 days clean, my resting heart rate is now in the low 50’s. YES 0.6 Though the speaker does not directly mention vaping, the statement ’6 days clean’ suggests they have quit some sort of substance. However, without more context, it can’t be said with certainty that the substance was vaping specifically.