Background: The use of artificial intelligence (AI) in peer review presents both opportunities
and challenges. In neurosurgery, where advancements are rapid, the traditional peer
review process can lead to delays in disseminating critical findings. This study evaluates
the effectiveness of AI in predicting the acceptance or rejection of neurosurgical
manuscripts, offering insights into its potential integration into the peer review
process.
Methods: Preprint.org and medRxiv.org were queried for skull based articles with the following
topics: adjuvant therapy, anterior skull base/orbit, basic science, cavernous sinus/middle
fossa, clivus/craniocervical junction, meningioma, paraganglioma, pediatrics, training
and education, value based care/quality of life, vestibular schwannoma, head and neck
tumors-nonsinonasal, malignancy, lateral skull base/CPA/jugular foramen, pituitary
adenoma, sinonasal malignancy, surgical approaches and technology, and vascular. Preprints
that were later published were compared with preprints that have not been published
and were uploaded on the preprint servers for at least 12 months; presumed to be rejected.
Each article was uploaded in an independent ChatGPT 4o query with the following prompt:
“Based on the literature up to the date this article was posted, will it be accepted
or rejected for publication following peer review? Please provide a yes or no answer.”
Impact factor and cite score of journals at the time of publication of the assessed
accepted articles were collected. T-tests were used to compare journal metrics corresponding to the accepted articles
between ChatGPT accepted and rejected articles. Chi square analysis was used to compare
between ChatGPT assessment of article acceptance or rejection with preprints accepted
or presumed to be rejected.
Results: A total of 31 preprints, 18 published and 13 presumed to be rejected, were included
in the analysis. The average impact factor and cite score of accepted articles were
4.36 ± 2.07 and 6.38 ± 3.67, respectively. The impact factor and cite score of journals
corresponding to accepted articles that were also accepted by ChatGPT were not significantly
different than those rejected by ChatGPT (p = 0.932 and p = 0.490, respectively). ChatGPT had significantly low performance in correctly accepting
(66.67%) published articles or rejecting (61.54%) presumed to be rejected articles
(p < 0.001).
Discussion: This study’s findings indicate that while ChatGPT currently demonstrates only moderate
accuracy in predicting peer review outcomes, there is significant potential for improvement.
Current generative AI models are limited to publicly available date for natural language
processing and model training. This presents a notable challenge in appraising the
utility of AI in the peer review process as rejected manuscripts are not publicly
disclosed for confidentiality and privacy purposes. Future iterations of generative
AI models in collaboration with journals and with the consent of submitting authors
may provide a more balanced training pool of data for specific models designed to
facilitate a more efficient peer review process.
Conclusion: ChatGPT shows moderate accuracy in predicting peer review outcomes, but with continued
refinement, AI has the potential to assist in streamlining the peer review process.