Efficacy of artificial intelligence-based skin analysis for calculating wrinkle improvement and skin firmness after simultaneous radiofrequency and high-intensity focused ultrasound therapy: a retrospective clinical study
Article information
Abstract
Background
Quantitative skin assessments have transitioned from subjective evaluations to objective approaches. However, clinical application has remained limited due to high costs and reliance on specialized equipment. High-intensity focused ultrasound and radiofrequency are the two most widely used noninvasive modalities for skin tightening and wrinkle improvement. This study investigated investigate the efficacy of artificial intelligence (AI)-based skin analysis as a more accessible and cost-effective tool for assessing skin firmness and wrinkle improvement.
Methods
A retrospective analysis was conducted on 34 patients treated simultaneously with high-intensity focused ultrasound and bipolar radiofrequency between January and February 2025. AI-based skin assessments, evaluating firmness and wrinkle scores, were obtained pre-treatment, immediately post-treatment, and at a 2-month follow-up. Standardized clinical photographs were independently evaluated by two blinded human raters. Logistic regression and correlation analyses were conducted to determine alignment between AI and human evaluations.
Results
AI analysis showed significant improvements in both firmness and wrinkle scores immediately after treatment and at the 2-month follow-up (P<0.05). Human evaluations demonstrated high inter-rater agreement (Cohen’s κ=0.72–0.91). Logistic regression analyses indicated that changes in AI scores significantly predicted human-rated treatment effectiveness at both time points (area under the curve [AUC] for firmness=0.86; AUC for wrinkles=0.73–0.93). Spearman correlation coefficients and the Mann-Whitney U test further supported strong alignment between AI and human assessments.
Conclusions
This study validates the clinical utility of AI-based skin analysis as a reliable quantitative measure for evaluating wrinkle improvement and skin tightening following energy-based rejuvenation treatments. Its predictive validity aligns well with expert human judgment, particularly at delayed follow-up.
INTRODUCTION
Facial aging is a complex, multifactorial process characterized by progressive loss of skin elasticity, wrinkle formation, and tissue sagging. These changes result from intrinsic factors, such as chronological aging and genetic predisposition, as well as extrinsic factors, including ultraviolet radiation, pollution, and lifestyle habits [1,2]. The desire to restore a youthful appearance has driven continuous advancements in aesthetic treatments, with surgical interventions traditionally considered the gold standard for addressing significant age-related changes. However, despite their effectiveness, surgical procedures have considerable drawbacks, including high costs, potential complications, prolonged recovery times, and patient reluctance due to their invasive nature. Thus, there is a growing demand for noninvasive and minimally invasive alternatives that provide effective rejuvenation with minimal downtime [3].
Responding to this demand, various energy-based technologies have emerged, among which high-intensity focused ultrasound (HIFU) and radiofrequency (RF) are two of the most widely adopted modalities for nonsurgical skin tightening and collagen remodeling. HIFU uses focused ultrasound waves to penetrate deeply into the dermis and superficial musculoaponeurotic system, generating thermal coagulation points that stimulate neocollagenesis and tissue contraction. This targeted energy delivery allows significant lifting effects without damaging the epidermis [4-6]. RF, meanwhile, employs electromagnetic waves to produce controlled thermal energy within dermal and subdermal layers, inducing collagen denaturation and subsequent remodeling. This process improves skin elasticity and reduces the appearance of fine lines and wrinkles [7,8]. While each modality independently offers substantial skin-tightening benefits, recent advancements have explored their simultaneous application, hypothesizing superior and more comprehensive rejuvenation effects from their combined use [9,10].
Despite their increasing popularity and clinical use, treatment outcomes remain highly variable and subjective. Factors such as patient age, skin type, baseline collagen levels, and individual biological responses contribute to inconsistent results. Traditionally, treatment efficacy has been assessed through clinical observation, patient self-reporting, and photographic documentation. Recently, objective evaluations have been facilitated by advanced facial skin analysis systems, such as Mark-Vu (PSI Plus Corp.) or Morpheus 3D (Morpheus Co., Ltd.), which quantitatively measure skin texture, elasticity, and wrinkle depth [11]. However, these systems are costly, require specialized equipment, and are often restricted to high-end clinics and research facilities, limiting their widespread adoption.
With advancements in artificial intelligence (AI), there is increasing interest in using AI-based skin analysis as a cost-effective and accessible alternative for evaluating treatment outcomes. AI-powered analysis tools leverage deep learning algorithms and image processing techniques to assess skin parameters such as texture, tone, elasticity, and wrinkle severity [12]. These systems have the potential to standardize assessments, reduce inter-observer variability, and provide quantitative metrics comparable to those obtained from specialized imaging systems. However, the reliability and validity of AI-generated scores remain largely unexplored in the context of facial rejuvenation treatments [13].
This study investigated the feasibility of AI-based skin analysis as an objective assessment tool by evaluating its correlation with human evaluation. Specifically, we sought to determine whether AI-generated scores align with clinical assessments, establishing a reliable and scalable method for measuring treatment effectiveness. By validating AI-driven assessments, this research may contribute to broader adoption of AI technologies in aesthetic medicine, enhancing clinical decision-making and patient satisfaction, and increasing access to objective treatment evaluations across diverse practice settings.
METHODS
A retrospective review was conducted using electronic medical records and clinical photographs of patients who underwent simultaneous treatment with HIFU and bipolar RF (V-RO Advance, Hironic Co.) for facial skin sagging and laxity at the Department of Plastic and Reconstructive Surgery, Dongguk University Gyeongju Hospital, Republic of Korea, between January and February 2025. Patients with local skin diseases, connective tissue disorders, or those who had undergone other skin treatments, including energy-based devices, laser therapy, or botulinum toxin injections within 6 months prior to treatment, were excluded. Patients lost to follow-up were also excluded. Clinical photographs and AI skin analysis (Perfect Corp.) scores were obtained pre-treatment, immediately post-treatment, and at a 2-month follow-up. Demographic data (sex, age, and race), underlying diseases, previous energy-based device and cosmetic treatment history, and procedure details (HIFU and bipolar RF parameters, shots) were collected. Treatments followed the manufacturer’s recommended protocol.
Pre-treatment preparation
Topical anesthetic ointment (EMLA, Wells Pharmtech) was applied to the face for 30 minutes before treatment and then thoroughly removed with soap and water immediately prior to the procedure.
Treatment protocol
Ultrasound gel was evenly applied to the skin. The transducer was securely positioned against the targeted skin area and evenly pressed to ensure optimal contact. For the neck, chin, and cheeks, focused linear HIFU transducers of 3.0 mm, 7 MHz and 4.5 mm, 4 MHz were used. Simultaneous pen-type transducers combining HIFU (3.0 mm, 7 MHz and 4.5 mm, 4 MHz) with bipolar RF (2 MHz) were utilized for the neck, chin, cheeks, and mid-face areas. Additionally, a 1.5 mm, 7 MHz HIFU with a 2 MHz bipolar RF transducer was used for periorbital regions, temples, and forehead. Complete facial treatment required approximately 10 to 15 minutes.
Evaluation
The AI facial skin analysis system (Perfect Corporation) iOS application was installed on a mobile phone (iPhone 15, iOS 18.1). The phone was secured on a stand at face level, and patients were seated facing the device on a height-adjustable stool. Photographs were consistently taken from approximately 50 cm distance, using a plain jade-colored background to minimize distractions. Standardized ambient lighting was maintained using ceiling-mounted LED lights at a color temperature of 5,500 K in a windowless room to eliminate variability from natural lighting. Prior to analysis, patients removed makeup, glasses, and face coverings, and wore headbands. With eyes open and a neutral expression, the application automatically detected the patient’s face, assessed adequate lighting, and captured an image using the rear camera without flash. Raw scores were recorded at baseline, immediately after treatment, and at a 2-month follow-up for total wrinkles, firmness, and six specific wrinkle subtypes: “crow’s feet,” “forehead,” “glabellar,” “marionette,” “nasolabial,” and “periocular” (Fig. 1).

Artificial intelligence (AI) facial skin analysis application (Perfect Corporation) captures patient photographs and automatically calculates scores for various skin parameters including firmness, wrinkles, eyebags, radiance, spots, texture, dark circles, droopy upper eyelids, pores, droopy lower eyelids, tear trough, acne, redness, moisture, and oiliness.
Independent human evaluators, who were not involved in the treatment or image acquisition, assessed the procedure’s effectiveness. Grayscale pre- and post-treatment photographs from AI analyses were used to minimize selection bias from post-treatment redness. Evaluators assessed treatment efficacy using the Global Aesthetic Improvement Scale, a validated 5-point clinical scale. Each patient’s pre- and post-treatment photographs (immediately post-treatment and at 2-month follow-up) received a score ranging from –1 (worsened), 0 (no change), 1 (improved but additional correction needed), 2 (significant improvement but incomplete correction), to 3 (optimal cosmetic results). For analysis, scores ≥1 were classified as “effective,” while scores ≤0 were “non-effective.” Evaluations were performed blinded and randomized, without access to patient identity or clinical data.
Statistical analysis
All statistical analyses were performed using Python (v3.11). AI skin analysis scores were collected at baseline, immediately after treatment, and at 2-month follow-up. Differences in AI scores (post-treatment minus pre-treatment) for firmness and wrinkles were calculated. Binary classifications from two independent, blinded human evaluators served as the reference standards. Inter-rater agreement between human evaluators was assessed using Cohen’s kappa statistic.
To evaluate alignment between AI scores and human assessments, binary logistic regression analyses were performed, with AI score differences as independent variables and human evaluations (individually or combined using OR logic) as dependent variables. Receiver operating characteristic analysis and area under the curve (AUC) values were calculated to determine predictive performance.
Spearman rank correlation coefficients were computed to assess monotonic relationships between AI improvement scores and human evaluations. Mann-Whitney U tests compared AI score differences between groups categorized as effective versus non-effective by human evaluators. Statistical significance was set at P<0.05 with two-tailed 95% confidence intervals.
RESULTS
A total of 40 patients were initially screened for the study. Three patients were excluded due to having undergone skin-related procedures, including energy-based device treatments, laser therapy, or botulinum toxin injections, within the preceding 6 months. Three additional patients were lost to follow-up, resulting in 34 participants included in the final analysis. Among these participants, five were male and 29 were female. Patient ages ranged from 29 to 68 years, distributed as follows: 20–29 (n=1), 30–39 (n=4), 40–49 (n=9), 50–59 (n=10), and 60 and older (n=10). Nineteen patients (56%) had no previous history of cosmetic procedures. Among the remaining 15 patients, prior treatments included HIFU (n =12), RF (n =4), botulinum toxin injections (n =10), dermal fillers (n = 11), skin boosters (n = 2), and thread lifting (n = 1); some patients had undergone multiple treatments. All prior procedures were performed more than 6 months before study participation (Table 1).
Quantitative assessments of skin firmness and wrinkles were obtained using AI-based skin analysis at three time points: baseline (pre-treatment), immediately after treatment, and at 2-month follow-up. Higher scores indicated better skin status, representing fewer wrinkles and reduced facial sagging. The median AI firmness score improved from 83 (interquartile range [IQR], 76–84) at pre-treatment to 84 (IQR, 82–88) immediately post-treatment and 85 (IQR, 83–88) at 2 months. Median AI wrinkle scores similarly improved from 75 (IQR, 70–77) at pre-treatment to 77 (IQR, 73–78) immediately post-treatment and 78 (IQR, 75–80) at 2 months (Fig. 2). Wilcoxon signed-rank tests showed statistically significant increases in both firmness and wrinkle scores at immediate and 2-month follow-ups compared to baseline (P <0.05), indicating measurable improvements based solely on AI analysis (Table 2). Descriptive statistics for individual wrinkle subtypes across all time points are provided in Supplementary Table 1.

Artificial intelligence (AI)-derived scores for firmness and wrinkles across three evaluation points: pre-treatment, immediately post-treatment, and at the 2-month follow-up. Each boxplot illustrates artificial intelligence (AI)-measured changes per time point, with group means indicated by black “X” markers inside each box. Both parameters demonstrated significant improvements sustained through the 2-month follow-up.
Two independent blinded human evaluators assessed treatment effectiveness using grayscale photographs. Inter-rater reliability, calculated via Cohen’s kappa (κ), demonstrated substantial to almost perfect agreement across all domains: firmness (κ=0.82 [immediate], κ =0.72 [2-month follow-up]) and wrinkles (κ =0.84 [immediate], κ=0.91 [2-month follow-up]). These results support the consistency and reliability of human assessments (Fig. 3).

Confusion matrices illustrating inter-rater agreement between two independent human evaluators assessing treatment effectiveness for firmness and wrinkles at both immediate and 2-month follow-up evaluations. Each cell presents the absolute number of cases and corresponding percentages relative to total evaluations. X- and Y-axes represent binary classifications (effective or non-effective) made by Rater 1 and Rater 2, respectively. High concentration along the diagonal reflects substantial to almost perfect agreement across all domains, confirming the reliability and consistency of human evaluations used for comparison with artificial intelligence (AI)-derived outcomes.
The degree of alignment between AI assessments and human judgments was analyzed using binary logistic regression. The dependent variable was the binary classification of treatment effectiveness (effective vs. non-effective), and the independent variable was the change in AI score (post-treatment minus pre-treatment). Regression models were constructed for each evaluator individually as well as a combined human evaluation, using OR logic, classifying treatment as effective if at least one evaluator rated it as such.
Immediately post-treatment, improvements measured by AI in both firmness and wrinkle scores significantly predicted human-rated effectiveness. For firmness, the logistic regression yielded an odds ratio (OR) of 2.90 (95% confidence interval [CI], 1.26–6.65; P = 0.012) and an AUC of 0.86, indicating strong predictive performance. For wrinkles, the OR was 1.79 (95% CI, 1.00–3.21; P = 0.048) with an AUC of 0.73 (Table 3, Fig. 4A). These results suggest meaningful alignment between AI-derived scores and immediate human clinical perceptions.

Logistic regression model results predicting treatment effectiveness using AI-measured skin improvement scores

Logistic regression curves illustrating the relationship between artificial intelligence (AI)-derived score changes and the probability of treatment effectiveness based on combined human evaluations (OR logic). (A) Immediate post-treatment; (B) 2-month follow-up. Dots represent individual human-rated binary outcomes. Solid lines depict fitted logistic regression models. Wrinkle improvement at the 2-month follow-up exhibited the strongest predictive performance.
At the 2-month follow-up, the predictive capability of AI assessments improved significantly. AI firmness score changes strongly predicted human-rated effectiveness (OR, 2.28; 95% CI, 1.94–4.38; P =0.013) with an AUC of 0.86. For wrinkles, the logistic regression showed even stronger predictive power (OR, 5.34; 95% CI, 1.23–9.05; P=0.019) with an AUC of 0.93 (Table 3, Fig. 4B). These findings demonstrate increased alignment between AI assessments and human judgment as clinical improvements become more pronounced over time. Collectively, the results support AI-based skin analysis as a valid and reliable surrogate for human clinical assessment, particularly at delayed follow-up intervals.
To supplement the logistic regression results, Spearman rank correlation analysis assessed the monotonic relationship between AI score changes and binary human evaluations. Statistically significant positive correlations were observed for both firmness and wrinkle assessments at immediate and 2-month follow-ups (correlation coefficients ranging from 0.36 to 0.62; P < 0.05) (Fig. 5), reinforcing directional agreement between AI assessments and clinical impressions.

Heatmap displaying Spearman correlation coefficients between artificial intelligence (AI)-derived score differences and human-rated treatment effectiveness at two evaluation points (immediate and 2-month follow-up), for both firmness and wrinkle assessments. Each cell indicates the correlation coefficient (r) and the associated P-value. Positive and statistically significant correlations were observed across all domains, with generally stronger correlations at the 2-month follow-up.
Additionally, Mann-Whitney U tests evaluated whether AI score changes differed significantly between groups classified as effective versus non-effective by human evaluators. AI-measured score changes for firmness were significantly higher in the “effective” group immediately post-treatment (U = 233.5, P = 0.001) and at the 2-month follow-up (U =179.0, P =0.002). Similarly, AI wrinkle scores differed significantly between effective and non-effective groups at both immediate (U=165.0, P=0.039) and 2-month follow-up evaluations (U =155.5, P =0.001) (Fig. 6). These findings reinforce the logistic regression results, confirming AI-measured treatment outcomes are both predictive and statistically distinguishable based on human clinical classifications.

Violin plots demonstrating the distribution of artificial intelligence (AI)-derived score changes for firmness and wrinkles, stratified by evaluation time point (immediate and 2-month follow-up) and binary human-rated treatment effectiveness (OR logic). Each panel represents a specific condition, categorizing human evaluations as “non-effective” or “effective.” Violin shapes illustrate the density distribution of AI score changes. Individual patient data points are represented as jittered black “X” markers overlaying each violin. AI score improvements were notably higher in the effective group across all domains, aligning with Mann-Whitney U test outcomes.
DISCUSSION
The findings of this study demonstrate that AI-based skin analysis closely aligns with human evaluations. Our statistical analyses revealed a significant correlation between AI-measured improvements in wrinkle and firmness scores and human evaluations of treatment effectiveness. Additionally, logistic regression analyses indicated that AI-generated firmness scores were robust predictors of human evaluation outcomes. These results suggest that AI-driven assessments can serve as objective and scalable tools for evaluating the effectiveness of noninvasive facial rejuvenation treatments.
The ability of AI to provide quantitative, standardized measurements holds important implications for clinical practice, particularly in the growing field of precision and personalized medicine within aesthetics [14]. Given the inherent subjectivity in clinical evaluations and patient-reported outcomes, the objective and quantifiable metrics provided by AI represent a significant advancement [15].
Several sophisticated 3D imaging systems currently exist on the market, including Mark-Vu (PSI Plus Corp.), Morpheus 3D (Morpheus Co., Ltd.), VECTRA WB360 (Canfield Scientific), and the VISIA Skin Analysis System (Canfield Scientific). These systems offer high-precision evaluations and automated analyses of features such as texture, pigmentation, and vascularity [16]. Moreover, advanced systems combining optical coherence tomography, confocal microscopy, and AI can diagnose and monitor skin cancers such as basal cell carcinoma and melanoma [17-20] .
However, the high cost and limited accessibility of these sophisticated devices hinder their widespread adoption. Conversely, AI-based analysis using standard mobile devices offers an affordable and convenient solution, expanding the availability of objective skin assessments beyond specialized clinics. Studies by Kontzias et al. [21] and Cook et al. [22] demonstrated that AI skin analysis (Perfect Corporation) yields comparable results to traditional facial skin analysis systems like VISIA Skin Analysis (Canfield Scientific). Integrating AI into clinical practice has the potential to standardize outcome measurements, facilitate treatment monitoring, and enhance patient communication through visual and numerical feedback on treatment progress [23].
Despite promising results, this study has several limitations. First, the AI system used in this study relies exclusively on 2D anterior-posterior image analysis, while high-end imaging systems like 3D facial scanners provide more detailed assessments of skin volume and topography. Future research should include comparative analyses between 2D AI-based systems and 3D imaging systems to evaluate their relative accuracy. Second, unlike conventional facial analysis systems that use standardized positioning guides and controlled environments, the AI-based system in this study permits greater flexibility in facial positioning. This flexibility introduces variability in image capture conditions, potentially affecting measurement consistency. Third, the AI analysis system (Perfect Corporation) is a proprietary commercial platform with undisclosed internal algorithms. Although the manufacturer was contacted for technical details, disclosure was restricted due to intellectual property considerations. Thus, specific computational principles, such as feature weighting, decision thresholds, or image preprocessing techniques, remain unknown. Furthermore, automatic software updates and algorithm changes may influence future outputs, limiting reproducibility and transparency in a scientific context. Future validation studies involving different AI-based skin analysis systems, multiple human raters, and larger sample sizes would help establish broader generalizability.
Despite these limitations, this study significantly advances AI-based facial skin evaluation. By demonstrating alignment between AI-generated assessments and human evaluations, this research highlights AI’s potential to reduce subjectivity in treatment outcome measurements. This could be particularly beneficial in clinical trials and routine practice, where standardization of outcome assessments is essential. Additionally, traditional high-end imaging devices require specialized equipment and trained personnel, limiting their broader adoption. AI-based skin analysis using commercially available mobile applications presents an accessible, cost-effective alternative beneficial for both clinicians and patients. As AI technology continues evolving, it may be integrated with predictive modeling tools that optimize treatment protocols based on individual patient characteristics. The ability to monitor treatment responses in real-time could further refine personalized approaches for noninvasive facial rejuvenation.
This study provides a foundational step toward validating AI-based skin analysis as a reliable assessment tool in aesthetic medicine. Future research involving larger datasets, different AI-based platforms, and comparative studies with multiple 3D imaging will be crucial for refining AI-driven evaluation methods. Although AI-based skin analysis is still in its early stages, its ability to provide standardized, quantitative, and objective assessments represents a promising advancement in aesthetic medicine. With further validation and technological improvements, AI-driven evaluation systems could play a pivotal role in improving the precision and accessibility of noninvasive facial rejuvenation treatments.
Notes
No potential conflict of interest relevant to this article was reported.
Ethical approval
The study was approved by the Institutional Review Board of Dongguk University Hospital (IRB approval No. 110757-202503-HR-01-02) and was performed in accordance with the principles of the Declaration of Helsinki. The requirement for written informed consent was waived by the IRB.
Patient consent
The patient provided written informed consent for the publication and use of her images.
Supplemental material
Supplementary materials can be found via https://doi.org/10.14730/aaps.2025.01340.
AI-measured wrinkle improvement by subtype at immediate and 2-month follow-up