This research delves into the extensive field of user profiling by examining a large dataset encompassing 200,000 healthcare professionals within both business and sociological settings. Our main objective is to construct detailed profiles based on essential data points such as gender, age, place of residence, type of medical facility, and area of specialization. Our approach is distinctly data-driven, underscoring the importance of modern research methodologies in deriving insightful conclusions. To transcend the basic correlations typically observed in user activities, we utilize Latent Dirichlet Allocation (LDA) to analyze textual data. This technique efficiently extracts significant topics that, once integrated with the medical registry, reveal specific interests and trends prevalent among physicians. We further employ neural network-based clustering methods to group these professionals into well-defined categories, facilitating the identification of behavior patterns linked to demographic factors and reading preferences. Our dataset originates from a relational database and includes records of health-related articles accessed through a web interface over three years. This data supports the creation of a term-frequency matrix vital for subsequent analyses. By integrating personal data with article consultations via a many-to-many relationship, we achieve a granular reconstruction of each physician’s reading habits. Throughout, we maintain rigorous data control and preprocessing to ensure the integrity of our dataset and the validity of our analyses. This sophisticated proach not only validates the accuracy of our machine learning techniques but also showcases their practical effectiveness in efficiently deciphering and leveraging user profile data in real-world scenarios.
Advancing User Profiling: A Comprehensive Analysis of 200k+ Physicians Using LDA Topic Extraction
Antonio, Agliata;Angelo, Ciaramella;Di Nardo, Emanuel;
2026-01-01
Abstract
This research delves into the extensive field of user profiling by examining a large dataset encompassing 200,000 healthcare professionals within both business and sociological settings. Our main objective is to construct detailed profiles based on essential data points such as gender, age, place of residence, type of medical facility, and area of specialization. Our approach is distinctly data-driven, underscoring the importance of modern research methodologies in deriving insightful conclusions. To transcend the basic correlations typically observed in user activities, we utilize Latent Dirichlet Allocation (LDA) to analyze textual data. This technique efficiently extracts significant topics that, once integrated with the medical registry, reveal specific interests and trends prevalent among physicians. We further employ neural network-based clustering methods to group these professionals into well-defined categories, facilitating the identification of behavior patterns linked to demographic factors and reading preferences. Our dataset originates from a relational database and includes records of health-related articles accessed through a web interface over three years. This data supports the creation of a term-frequency matrix vital for subsequent analyses. By integrating personal data with article consultations via a many-to-many relationship, we achieve a granular reconstruction of each physician’s reading habits. Throughout, we maintain rigorous data control and preprocessing to ensure the integrity of our dataset and the validity of our analyses. This sophisticated proach not only validates the accuracy of our machine learning techniques but also showcases their practical effectiveness in efficiently deciphering and leveraging user profile data in real-world scenarios.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


