echemi logo
Product
  • Product
  • Supplier
  • Inquiry
    Home > Active Ingredient News > Infection > NC-scientific research is a perfect example for clinical use

    NC-scientific research is a perfect example for clinical use

    • Last Update: 2021-03-23
    • Source: Internet
    • Author: User
    Search more information of high quality chemicals, good prices and reliable suppliers, visit www.echemi.com
    Artificial intelligence is the frontier of contemporary computer applications, and the combination of artificial intelligence and clinical diagnosis is also a hot topic.
    Therefore, today I want to share with you an article published in the journal Nature Communications (IF:12.
    121) in January this year.
    Article about the application of artificial intelligence in the early prediction and diagnosis of sepsis.

    Artificial intelligence for early prediction and diagnosis of sepsis using unstructured data in medical care 1.
    Research background Sepsis in hospitals is one of the main causes of death, and the treatment of sepsis is highly time-sensitive, so the early prediction and diagnosis of sepsis Diagnosis is essential to reduce mortality.

    However, since the signs and symptoms of many patients are similar to other less serious conditions, the early prediction and diagnosis of sepsis is a great challenge.

    Nowadays, most of the existing methods for diagnosis and early prediction of sepsis use only structured data stored in an electronic medical record (EMR) system.

    However, studies have shown that approximately 80% of clinical data in the EMR system consists of unstructured data.

    Therefore, today I want to introduce an article about developing artificial intelligence algorithms that use structured data and unstructured clinical records to predict and diagnose sepsis.

    2.
    Data and methods 1.
    Data sample: The study uses MySQL to extract patient data from the EMR system to establish a prediction algorithm.

    The sample included 5317 patients admitted from 2015 to 2017, including 114,602 clinical records.

    The author divides the samples into training and validation samples and a test sample of independent set-out method.

    The training and validation samples included 3722 patients (80,162 clinical record entries), while the independent set aside sample included 34,440 clinical record entries for 1595 patients.

    The study uses each patient consultation instance (clinical record) as the unit of analysis.

    2.
    Text mining technology: In addition to the structured data used in the predictive model, the author also uses unstructured free-form text clinical record data.

    Before unstructured free-form text can be analyzed and used as part of a predictive model, the author first uses the LDA method to encode unstructured free-form text into numerical values.

    LDA is an unsupervised technique that creates topics based on experience based on patterns of words found in the analyzed documents.

    Therefore, topics are generated from clinical records by LDA authors, and these topics are represented by a vector.

    Next, the author uses the following five steps to process unstructured free-form text.

    1) According to HIPAA guidelines and data anonymization standards (Health Insurance Portability and Accountability Act, HIPPA), delete all possible identifiers in clinical records, including: names, geographic regions, postal codes and other elements.

    2) Identify the text contained in the document by deleting all punctuation marks; use part-of-speech tagging to remove definite articles and prepositions.

    The author also deleted a long list of medically-related words or phrases that are common in these texts but have no practical meaning.

    Such as report, progress, etc.

    After these two steps, a term document matrix is ​​created, where the rows represent the occurrence of a certain term in the document, and the columns represent the document.

    3) Apply a text filter to the processed text to reduce the number of words by eliminating rare words and weighting words that appear multiple times.

    4) Use the LDA topic clustering algorithm to determine different topics.

    Based on the existing literature, the number of topics that can be developed in a text is highly subjective, and is usually a function of the number of observations or the expected diversity of topics in the data set.

    As part of the robustness check, the author tried five different iterations of 25, 50, 75, 100, and 150 topics.

    Subsequent analysis of the results found that similar results were produced in five iterations; therefore, 100 topic models were reported for the sake of brevity.

    On the basis of the 100 topic model, 100 topics were given to 3 researchers, and they were divided into 7 categories.

    All text mining is processed using SAS Enterprise Miner 14.
    2.

    5) The steps of developing the theme only need to be carried out during the model development and verification phase.

    When using text mining algorithms to evaluate new clinical records, the 100 topics developed will be used as benchmarks to help calculate the degree of match between the new clinical records and these 100 topics.

    The high fit metric represents a high degree of similarity between the new clinical record and the benchmark theme.

    3.
    Machine learning algorithm: The ensemble method is a machine learning algorithm that uses multiple classifiers to vote on the prediction results (weighted) to determine the prediction results.

    These methods generally perform better than any single classifier.

    The authors used voting integration in the study.

    Voting combines predictions from multiple other models.

    Two basic classifiers are used in the research: logistic regression based on stochastic gradient descent (SGD) and random forest algorithm.

    The combination rule is the average of the probabilities, that is, the average probability of the two basic models is calculated as the voting probability.

    SGD is an optimization algorithm that seeks to minimize errors in predictions by iterative learning from previous fitting estimates.

    The method iteratively extracts random samples from training samples to estimate the parameters of the model, which is used to classify patients as suffering from sepsis or not suffering from sepsis.

    It learns from each sampling iteration to determine the accuracy of the classification and adjusts the parameter estimates until the further improvement of the prediction result is minimal.

    For each iteration, calculate the prediction parameter β and update the model using the following equation: where β is the optimization parameter, lr is the learning rate, is the prediction made by the coefficient, and x is the input value.

    The input variables are structured variables and 100 topics extracted during the text mining process.

    The second classifier used for voting is the random forest classifier, and the case of sepsis is the target variable.

    The probabilities of the two classifiers are averaged to get the final probability used in the voting ensemble model.

    4.
    Alternative estimation: In order to facilitate comparison, the study also used two different alternative estimators, namely dagging and GBT.

    Dagging is an integrated method that is widely used in the existing machine learning literature, especially when the data is "noisy".

    In dagging, the training data samples are divided into a set of disjoint stratified samples.

    Then, select the basic classifier logistic regression using SGD for this process, and use this basic classifier training data in each disjointed sample.

    Next, the ensemble method applies the results of the base classifier training to the verification data samples, calculates the average of all sub-samples, and predicts the results based on the votes of all sub-samples.

    The GBT uses a collection of multiple trees to create a more powerful classification and regression prediction model.

    The key idea is to build a series of trees, each of which has been trained to try to correct the error of the previous tree in the series.

    The author adds a fixed number of trees, or stops training when the loss reaches an acceptable level or the validation data set no longer improves.

    All integrated machine learning is performed using the KNIME analysis platform.

    5.
    Other model diagnosis: The author also observes the trade-off between PPV (precision) and sensitivity (recall) and the calibration curve of the SERA algorithm.

    6.
    Establish the SERA algorithm in the clinical environment: The author proposes two possible modes, in which the SERA algorithm can be operated in the clinical environment.

    1) Background model: The SERA algorithm is designed to run in the background.

    Specifically, it is configured to run with the latest patient clinically available data during critical events, such as during ward handovers.

    If the risk score exceeds the specified threshold, the EMR system will alert the doctor.

    If there are more computing resources available, the hospital can choose to run it at regular hourly intervals.

    For a large hospital with 500 beds, if all cases are run individually, the SERA algorithm will take about 90 seconds to score all 500 patients.

    This method ensures that patients undergo continuous and regular sepsis risk assessments in the hospital.

    2) Ad-hoc model: The algorithm can also be designed to run immediately after the doctor updates the clinical record in the EMR system.

    In this case, the SERA algorithm will operate in a special way because the score will only be calculated after the doctor has updated the patient's status.

    It can be seen from the research that the algorithm is superior to doctors in the early prediction of sepsis, so it can be used as an important early warning indicator to help doctors diagnose and care for patients.

    3.
    The main results of the study 1.
    Data and data processing This study investigated patients in a public hospital in Singapore.
    The author used the ICD-10 code to construct the dependent variables of sepsis, severe sepsis, septic shock and ICU hospitalization.

    The hospital will transfer patients diagnosed with sepsis to the ICU ward; therefore, patients with at least one ICD-10 code and admitted to the ICU ward are assigned to the sepsis case cohort.

    All other patients who did not meet these criteria were assigned to the non-sepsis control group.

    There were 240 patients with sepsis in the training and validation samples, and 87 patients with sepsis in the test samples.

    The author defined the time of onset of sepsis as the time of admission to the ICU ward.

    Because the data is unbalanced, the authors used the Synthetic Minority Sampling Technique (SMOTE) to achieve a 1:1 balance of sepsis cases and non-sepsis controls.

    Since the previous literature believed that oversampling would lead to a more accurate model, and earlier studies have used the SMOTE algorithm, machine learning classifiers have been developed for other clinical situations.

    Therefore, the authors also developed, tested, and reported the model without any oversampling to show the possibility of operating the algorithm in a normal clinical environment where the incidence of sepsis is relatively low.

    2.
    Sepsis Early Risk Assessment Algorithm (SERA) First, Figure 1 outlines the steps of the SERA algorithm.

    Since the algorithm uses clinical records and structured data to evaluate the risk of sepsis, the analysis unit of the algorithm is the consultation instance of each patient.

    The author uses this analysis unit to ensure that the algorithm can operate in a typical clinical environment where clinicians consult, evaluate, and diagnose patients.

    The SERA algorithm is composed of two interrelated algorithms: the diagnosis algorithm and the early prediction algorithm.

    The diagnosis algorithm determines whether the patient has sepsis during the consultation.
    If not, the early prediction algorithm will assess the patient's risk of sepsis in the next 4, 6, 12, 24, and 48 hours.

    For the diagnosis algorithm, the author combines the clinical records of the EMR system of each consultation with the latest structured variables available in the EMR system, as shown in Table 1.

    These data are used to classify whether the patient has sepsis during the consultation.

    For the early prediction algorithm, its data structure is similar to that of the diagnosis algorithm.
    The difference is that when the patient is diagnosed with sepsis (ie transferred to the ICU), the early prediction algorithm does not consider any consultation from the sample.

    Since positive sepsis consultation has characteristics closely related to the condition of sepsis, it is necessary to prevent the algorithm's predictive ability from being biased.

    Figure 1 SERA algorithm steps Table 1 List of structural variables used in the prediction algorithm 3.
    Processing clinical medical records and machine learning This part Before using unstructured clinical notes as predictors, the author first applies NLP to these notes, especially LDA Theme modeling algorithm.

    Find the main theme in the progress record, extract it, weight it with numbers, and combine it with the structured variable data in Table 1.

    By applying the NLP LDA model to the clinical note data set, the author identified 100 common text topics, which were divided into the following seven categories: (1) clinical status; (2) communication; (3) laboratory testing; (4) ) Non-clinical state; (5) Social relations; (6) Symptoms; (7) Treatment.

    In the diagnosis algorithm and early prediction algorithm, the numerical load and structured data of 100 topics are used as predictors.

    In the diagnosis algorithm, the data adopts voting integrated machine learning algorithm.

    For comparison, dagging and GBT are also used as two alternative classifiers.

    When a patient is diagnosed as not suffering from sepsis, the early prediction algorithm will use voting-integrated machine learning methods to predict whether the patient is at risk of sepsis in the next 4, 6, 12, 24, and 48 hours.

    The dependent variables of the early prediction algorithm are whether the patient will develop sepsis in the next 4, 6, 12, 24, and 48 hours.

    The trained and validated model is then tested using independent test samples.

    Then the author showed the test results of the SERA algorithm on oversampling and non-oversampling data.

    The SMOTE model gives the results of a typical machine learning prediction model, in which the incidence of sepsis cases is very high, which is equivalent to the incidence of non-sepsis cases (Table 2).

    The non-SMOTE model shows the performance of this model in a typical clinical setting with a low incidence of sepsis (Table 3).

    The analysis found that for the diagnostic algorithm, the AUC of the test sample was 0.
    94, the sensitivity was 0.
    89, the specificity was 0.
    85, and the positive predictive value (PPV) was 0.
    85.

    The AUC of the diagnostic algorithm is higher than previous studies.

    In addition, before being diagnosed as sepsis by a doctor, the algorithm can predict whether the patient is at high risk of infection with sepsis.

    The algorithmic AUC for predicting the occurrence of sepsis 48 hours before the occurrence is 0.
    87.

    AUC increased to 0.
    90 24 hours before sepsis and 0.
    94 12 hours before sepsis. In addition, the algorithm's prediction 12 hours ago has higher AUC, sensitivity, specificity and PPV than previous studies.

    The 24-hour lead time early prediction has a high AUC of 0.
    90 and a sensitivity, specificity and PPV value of at least 0.
    80.
    This additional early prediction time is essential to improve clinical outcomes.

    Table 2 Statistics of diagnosis and early prediction algorithm (SMOTE) Table 3 Statistics of diagnosis and early prediction algorithm (under low prevalence conditions, no SMOTE) 4.
    SERA algorithm and human prediction Next, the author compares in the clinical environment, The algorithm compares the performance of other predictive scoring systems with the standard clinical practice of diagnosing sepsis or predicting infection mortality.

    Clinical practice usually uses standardized scoring systems such as SIRS, SOFA, qSOFA and MEWS to predict sepsis or mortality caused by infection.

    The meta-analysis found that the typical true positive rate (TPR) of these four scoring systems 4 hours before the onset of sepsis is 0.
    50 ~ 0.
    78, the TPR is 0.
    56 ~ 0.
    8, and the false positive rate (FPR) is 0.
    16 ~ 0.
    50.

    The author drew the ROC curve of the early prediction algorithm based on the reported TPR and FPR scores (Figure 2a).

    In addition, the author also evaluated the TPR and FPR for sepsis diagnosed by hospital doctors.

    In Figure 2a, it can be seen that 4 hours before the onset of sepsis, the algorithm is superior to hospital doctors in predicting sepsis in the test sample.

    In addition, the accuracy of the SERA algorithm exceeds the human-based scoring methods reported in previous studies, such as qSOFA, MEWS, SIRS, and SOFA.

    Figure 2b shows the ROC curve of the early prediction algorithm at 48, 24, 12, 6, and 4 hours before sepsis.

    It can be observed that the ROC increased from 48 hours to 12 hours and remained at a similar level after 12 hours.

    In all time periods, the ROC value of the algorithm exceeded the prediction of the hospital doctor.

    The author further compared the predictions of the algorithm at 48, 24, 12, 6, and 4 hours with the prediction accuracy of MEWS, qSOFA, SIRS, and SOFA reported in the literature at the 4-hour mark.

    It can be observed that the SERA algorithm is generally better than MEWS, qSOFA, SIRS and SOFA prediction scores for all four stages even if the assessment is performed 4 hours before sepsis.

    In order to test the usefulness of the early prediction algorithm, the author also calculated the TPR and FPR of the early prediction algorithm at the 48, 24, 12, 6 and 4 hour marks (Figure 3).

    The author also observed that 48 hours after the onset of sepsis, the algorithm can detect 0.
    78 of all final patients, and within 12 hours before the onset of sepsis, this prediction (TPR) improved to at least 0.
    86.

    On the other hand, hospital doctors can only detect about 0.
    53 patients with final sepsis after 48 hours, and this proportion only slightly increased to 0.
    58 after 6 hours.

    At 4 hours, the hospital doctors observed a significant increase in TPR to 0.
    65.

    In all periods, the algorithm predicts that the TPR of sepsis is 0.
    21 ~ 0.
    32 higher than that of hospital doctors.

    This shows that the algorithm has the potential to increase early sepsis detection compared to only relying on hospital doctors’ evaluation.

    The FPR of the SERA algorithm changed from 0.
    23 at 48 hours to 0.
    13 at 12 hours (Figure 3b).

    However, the FPR predicted by hospital doctors is significantly higher than the algorithm.
    From 48 hours to 4 hours, FPR varies from 0.
    34 to 0.
    27.

    Therefore, it is possible to reduce false alarms by using the SERA algorithm.

    Figure 3 shows that the TPR and FPR of hospital doctors only reach their peak at 4 hours, which means that doctors have a shorter lead time for medical intervention.

    On the contrary, the SERA algorithm can reach a fairly early prediction rate 48 hours before the onset of sepsis, which means that doctors can warn of the onset of sepsis earlier and thus have more time for more effective intervention.

    Figure 2 ROC curve of early prediction Figure 3 Comparison of the performance of the SERA algorithm and physicians 5.
    The benefit of unstructured clinical texts in early sepsis prediction In this section, the author, in order to quantify the predictive value of unstructured clinical records in the SERA algorithm, First use only structured variables to diagnose and predict sepsis, and then compare the performance of the algorithms (Figure 4).

    It can be observed that although increasing the use of clinical text does improve the performance of SERA diagnosis and early prediction algorithms, the improvement is minimal.

    In addition, when considering the time frame of 12 to 48 hours before sepsis, it can be noted that the use of clinical text for early prediction has considerable benefits.
    It can be observed that for the SERA algorithm for 12 to 48 hours early prediction, the added clinical text will ( 1) AUC increased by 0.
    10 to 0.
    15, (2) sensitivity increased by 0.
    07-0.
    13, (3) specificity increased by 0.
    08-0.
    14.

    The results also showed that in the time period close to the onset of sepsis, the measurable symptoms of sepsis manifested as structural variables, such as a drop in blood pressure.

    In this case, increasing the use of clinical text provides only a small predictive benefit to the SERA algorithm, because the structural variables capture most of the sepsis symptoms.

    Therefore, the doctor's judgment and the qualitative input of the patient's prognosis provide additional key data that can be used to predict sepsis.

    Figure 4 Comparison of the performance of the SERA algorithm and the non-clinical text model.
    6.
    The application of the SERA algorithm in a hyposeptemia epidemic environment.
    The author found that sepsis was affected by a retrospective cohort study of 409 US hospitals from 2009 to 2014.
    The morbidity rate is between 1.
    8% and 12% of all hospitalized patients, with an average prevalence rate of 6%, and has remained relatively stable over time.

    Although the natural prevalence of sepsis in the clinical setting is low, most studies use oversampled data sets to develop sepsis prediction algorithms, and the prevalence is significantly higher than 50%.

    Therefore, in the study, the authors developed a model that is suitable for both oversampling environments and low prevalence situations (as seen in typical clinical environments).

    For the purpose of clinical application, the authors simulated how changes in the prevalence of sepsis affect PPV (Table 4).

    For example, in the 12-hour early prediction, the estimated algorithm PPV for a hospital with a sepsis prevalence rate of less than 1.
    8% is 8.
    3%.

    If the prevalence of sepsis increases to 12%, it is estimated that PPV increases to 40.
    25%.

    The simulation results illustrate the ability to apply the SERA algorithm in a natural clinical environment, where the prevalence of sepsis depends on the type of clinical specialty and/or the location of the institution.

    Table 4 simulates PPV under different prevalence levels of sepsis So far, the main content of the article has been introduced.

    It can be seen that the article addresses the issue of early prediction and diagnosis of sepsis.
    It uses structured and unstructured data from clinical records combined with machine learning methods to develop a prediction and diagnosis model for sepsis, and analyzes the model from multiple perspectives and situations.
    Evaluation.

    Whether it is the study of sepsis or other complex diseases, with the development of computers and medical treatment, the analysis of the combination of artificial intelligence and medical treatment may become a very important trend.

    Therefore, both doctors and other scientific researchers can learn from this research perspective of using clinical data resources combined with artificial intelligence methods to develop practical algorithms or models.

    Welcome to follow Shengxinren transcriptome | methylation | resequencing | single cell | m6A | multiomics cytoscape | limma | WGCNA | water bear legend | linux electrophoresis | PCR | a brief history of sequencing | karyotype | NIPT | basic experimental genes | 2019-nCoV | Enrichment Analysis | Joint Analysis | Microenvironmental Plague Pursuit | Summary of Ideas | Scholars | Scientific Research | Withdrawal | PhD Reading | Work
    This article is an English version of an article which is originally in the Chinese language on echemi.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to service@echemi.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

    Contact Us

    The source of this page with content of products and services is from Internet, which doesn't represent ECHEMI's opinion. If you have any queries, please write to service@echemi.com. It will be replied within 5 days.

    Moreover, if you find any instances of plagiarism from the page, please send email to service@echemi.com with relevant evidence.