On the importance of collaboration with SMEs in Data Science/AI projects.

During these strange days of emergency we could observe, among many others from the academy and the industry, lots of initiatives by individuals or groups of Data Scientists using public available data sets to make predictions about the evolution of the COVID-19 pandemic or other healthcare related applications to help in diagnosing the symptoms of the virus whether test kits for it wouldn't be available. Every little helps, it is also wonderful to see this high level of genuine interest on this matter and I am one of those that encourage people being curious and altruistic. But, as others have already started warning lately, any personal initiative, to be effective, needs to be evaluated by subject matter experts. In this post I am going to provide a concrete example about the importance of this kind of collaboration.
At the end of 2019, I got a very bad flu which then ended up in an acute bronchitis, from which I recovered very slowly. During the rest of my life, after my childhood, I didn't use to get flu often and any way it always didn't last more than 3 days, with no aftermath. After this experience I became curious to learn more about lung diseases and, being my job in Data Science, also try to address a specific use case. I found on Kaggle this data set of chest X-Ray images to play with: I tried to build a first simple model (CNN) to perform binary classification (normal patient or pneumonia) for those images and to use it to go on with my experimentation about XAI techniques. The first release of the model, in spite of a 90% accuracy, showed through the confusion matrix and other metrics such as precision and recall, that it was very good at recognizing X-ray images from normal patients as normal, but not the same on classifying pneumonia X-ray images as such. This behavior was confirmed by starting using the model to make predictions on test images. Before going further, I started applying some XAI techniques, such as SHAP (SHapley Additive exPlanations) and then asked for support to a friend of mine who's a an experienced Radiologist. This collaboration gave me insights than I (as a Biomedical Engineer with extensive experience in Software Engineering and Data Science in other sectors (such as Biotech Manufacturing, Healthcare Insurance and Cloud Operations, just to mention a few), but not a Medical Doctor nor a X-ray expert) couldn't have figure out myself and that led me to find better solutions. First of all, I started with the wrong assumption, after manually reviewing the training, validation and test data set, that all of the X-ray images have been done using the same projection, while most part of the normal, such this one in figure 1

Figure 1

are in PA (posteroanterior) projection, and others related to pneumonia have been taken in AP (anteroposterior) projection (patient most probably not in condition to stand), which then resulted in a different contrast and slight different position of the lungs if compared to PA views. With reference to figure 2 (patient affected by bacterial pneumonia)

Figure 2

I have been pointed out to several details (such as, among others, a clear asymmetry of the rib cage or a distortion of the mediastinum structure) that give evidence that it has been taken in AP projection. Then I also received other suggestions on how to improve the images contrast, that have been useful in their pre-processing phase before starting the training of the model. Furthermore, with reference to the image in figure 2, which was wrongly classified as normal by the first version of the model, by submitting to the Radiologist the SHAP values for the model prediction (figure 3)

Figure 3

which always used to show a red area (which represents a group of features (pixels) in the input image that tried to make the model diverging from the predicted value) below the right lung, I got notification of something that my eyes, being not trained to analyze X-ray images, didn't catch: that red area highlights the presence of something external, such as a plastic tube. These are just few examples, but the continuous feedback from a Radiologist led me to learn a lot on this subject and start to achieve better results before moving to something more complex such as a multi-classifier which could detect also COVID-19, but that's for another story. This post is a reminder for Data Scientists to always try to improve their knowledge in the specific sector and related problems they focus on by starting a collaboration with SMEs. This way any ongoing effort to try to solve COVID-19 related problems could be really productive and not just a simple exercise of style or a Kaggle competition surrogate.

Googlielmo's blog

Search This Blog

On the importance of collaboration with SMEs in Data Science/AI projects.

Labels

Comments

Post a Comment

Popular posts from this blog

jOOQ: code generation in Eclipse

Turning Python Scripts into Working Web Apps Quickly with Streamlit

Exporting InfluxDB data to a CVS file