Skip to main content

On the importance of collaboration with SMEs in Data Science/AI projects.



During these strange days of emergency we could observe, among many others from the academy and the industry, lots of initiatives by individuals or groups of Data Scientists using public available data sets to make predictions about the evolution of the COVID-19 pandemic or other healthcare related applications to help in diagnosing the symptoms of the virus whether test kits for it wouldn't be available. Every little helps, it is also wonderful to see this high level of genuine interest on this matter and I am one of those that encourage people being curious and altruistic. But, as others have already started warning lately, any personal initiative, to be effective, needs to be evaluated by subject matter experts. In this post I am going to provide a concrete example about the importance of this kind of collaboration.
At the end of 2019, I got a very bad flu which then ended up in an acute bronchitis, from which I recovered very slowly. During the rest of my life, after my childhood, I didn't use to get flu often and any way it always didn't last more than 3 days, with no aftermath. After this experience I became curious to learn more about lung diseases and, being my job in Data Science, also try to address a specific use case. I found on Kaggle this data set of chest X-Ray images to play with: I tried to build a first simple model (CNN) to perform binary classification (normal patient or pneumonia) for those images and to use it to go on with my experimentation about XAI techniques. The first release of the model, in spite of a 90% accuracy, showed through the confusion matrix and other metrics such as precision and recall, that it was very good at recognizing X-ray images from normal patients as normal, but not the same on classifying pneumonia X-ray images as such. This behavior was confirmed by starting using the model to make predictions on test images. Before going further, I started applying some XAI techniques, such as SHAP (SHapley Additive exPlanations) and then asked for support to a friend of mine who's a an experienced Radiologist. This collaboration gave me insights than I (as a Biomedical Engineer with extensive experience in Software Engineering and Data Science in other sectors (such as Biotech Manufacturing, Healthcare Insurance and Cloud Operations, just to mention a few), but not a Medical Doctor nor a X-ray expert) couldn't have figure out myself and that led me to find better solutions. First of all, I started with the wrong assumption, after manually reviewing the training, validation and test data set, that all of the X-ray images have been done using the same projection, while most part of the normal, such this one in figure 1





Figure 1

are in PA (posteroanterior) projection, and others related to pneumonia have been taken in AP (anteroposterior) projection (patient most probably not in condition to stand), which then resulted in a different contrast and slight different position of the lungs if compared to PA views. With reference to figure 2 (patient affected by bacterial pneumonia)



Figure 2


I have been pointed out to several details (such as, among others, a clear asymmetry of the rib cage or a distortion of the mediastinum structure) that give evidence that it has been taken in AP projection. Then I also received other suggestions on how to improve the images contrast, that have been useful in their pre-processing phase before starting the training of the model. Furthermore, with reference to the image in figure 2, which was wrongly classified as normal by the first version of the model, by submitting to the Radiologist the SHAP values for the model prediction (figure 3)




Figure 3 

which always used to show a red area (which represents a group of features (pixels) in the input image that tried to make the model diverging from the predicted value) below the right lung, I got notification of something that my eyes, being not trained to analyze X-ray images, didn't catch: that red area highlights the presence of something external, such as a plastic tube. These are just few examples, but the continuous feedback from a Radiologist led me to learn a lot on this subject and start to achieve better results before moving to something more complex such as a multi-classifier which could detect also COVID-19, but that's for another story. This post is a reminder for Data Scientists to always try to improve their knowledge in the specific sector and related problems they focus on by starting a collaboration with SMEs. This way any ongoing effort to try to solve COVID-19 related problems could be really productive and not just a simple exercise of style or a Kaggle competition surrogate.       

Comments

Popular posts from this blog

Streamsets Data Collector log shipping and analysis using ElasticSearch, Kibana and... the Streamsets Data Collector

One common use case scenario for the Streamsets Data Collector (SDC) is the log shipping to some system, like ElasticSearch, for real-time analysis. To build a pipeline for this particular purpose in SDC is really simple and fast and doesn't require coding at all. For this quick tutorial I will use the SDC logs as example. The log data will be shipped to Elasticsearch and then visualized through a Kibana dashboard. Basic knowledge of SDC, Elasticsearch and Kibana is required for a better understanding of this post. These are the releases I am referring to for each system involved in this tutorial: JDK 8 Streamsets Data Collector 1.4.0 ElasticSearch 2.3.3 Kibana 4.5.1 Elasticsearch and Kibana installation You should have your Elasticsearch cluster installed and configured and a Kibana instance pointing to that cluster in order to go on with this tutorial. Please refer to the official documentation for these two products in order to complete their installation (if you do

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

Using Rapids cuDF in a Colab notebook

During last Spark+AI Summit Europe 2019 I had a chance to attend a talk from Miguel Martinez  who was presenting Rapids , the new Open Source framework from NVIDIA for GPU accelerated end-to-end Data Science and Analytics. Fig. 1 - Overview of the Rapids eco-system Rapids is a suite of Open Source libraries: cuDF cuML cuGraph cuXFilter I enjoied the presentation and liked the idea of this initiative, so I wanted to start playing with the Rapids libraries in Python on Colab , starting from cuDF, but the first attempt came with an issue that I eventually solved. So in this post I am going to share how I fixed it, with the hope it would be useful to someone else running into the same blocker. I am assuming here you are already familiar with Google Colab. I am using Python 3.x as Python 2 isn't supported by Rapids. Once you have created a new notebook in Colab, you need to check if the runtime for it is set to use Python 3 and uses a GPU as hardware accelerator. You