Skip to main content

On the importance of collaboration with SMEs in Data Science/AI projects.



During these strange days of emergency we could observe, among many others from the academy and the industry, lots of initiatives by individuals or groups of Data Scientists using public available data sets to make predictions about the evolution of the COVID-19 pandemic or other healthcare related applications to help in diagnosing the symptoms of the virus whether test kits for it wouldn't be available. Every little helps, it is also wonderful to see this high level of genuine interest on this matter and I am one of those that encourage people being curious and altruistic. But, as others have already started warning lately, any personal initiative, to be effective, needs to be evaluated by subject matter experts. In this post I am going to provide a concrete example about the importance of this kind of collaboration.
At the end of 2019, I got a very bad flu which then ended up in an acute bronchitis, from which I recovered very slowly. During the rest of my life, after my childhood, I didn't use to get flu often and any way it always didn't last more than 3 days, with no aftermath. After this experience I became curious to learn more about lung diseases and, being my job in Data Science, also try to address a specific use case. I found on Kaggle this data set of chest X-Ray images to play with: I tried to build a first simple model (CNN) to perform binary classification (normal patient or pneumonia) for those images and to use it to go on with my experimentation about XAI techniques. The first release of the model, in spite of a 90% accuracy, showed through the confusion matrix and other metrics such as precision and recall, that it was very good at recognizing X-ray images from normal patients as normal, but not the same on classifying pneumonia X-ray images as such. This behavior was confirmed by starting using the model to make predictions on test images. Before going further, I started applying some XAI techniques, such as SHAP (SHapley Additive exPlanations) and then asked for support to a friend of mine who's a an experienced Radiologist. This collaboration gave me insights than I (as a Biomedical Engineer with extensive experience in Software Engineering and Data Science in other sectors (such as Biotech Manufacturing, Healthcare Insurance and Cloud Operations, just to mention a few), but not a Medical Doctor nor a X-ray expert) couldn't have figure out myself and that led me to find better solutions. First of all, I started with the wrong assumption, after manually reviewing the training, validation and test data set, that all of the X-ray images have been done using the same projection, while most part of the normal, such this one in figure 1





Figure 1

are in PA (posteroanterior) projection, and others related to pneumonia have been taken in AP (anteroposterior) projection (patient most probably not in condition to stand), which then resulted in a different contrast and slight different position of the lungs if compared to PA views. With reference to figure 2 (patient affected by bacterial pneumonia)



Figure 2


I have been pointed out to several details (such as, among others, a clear asymmetry of the rib cage or a distortion of the mediastinum structure) that give evidence that it has been taken in AP projection. Then I also received other suggestions on how to improve the images contrast, that have been useful in their pre-processing phase before starting the training of the model. Furthermore, with reference to the image in figure 2, which was wrongly classified as normal by the first version of the model, by submitting to the Radiologist the SHAP values for the model prediction (figure 3)




Figure 3 

which always used to show a red area (which represents a group of features (pixels) in the input image that tried to make the model diverging from the predicted value) below the right lung, I got notification of something that my eyes, being not trained to analyze X-ray images, didn't catch: that red area highlights the presence of something external, such as a plastic tube. These are just few examples, but the continuous feedback from a Radiologist led me to learn a lot on this subject and start to achieve better results before moving to something more complex such as a multi-classifier which could detect also COVID-19, but that's for another story. This post is a reminder for Data Scientists to always try to improve their knowledge in the specific sector and related problems they focus on by starting a collaboration with SMEs. This way any ongoing effort to try to solve COVID-19 related problems could be really productive and not just a simple exercise of style or a Kaggle competition surrogate.       

Comments

Popular posts from this blog

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

jOOQ: code generation in Eclipse

jOOQ allows code generation from a database schema through ANT tasks, Maven and shell command tools. But if you're working with Eclipse it's easier to create a new Run Configuration to perform this operation. First of all you have to write the usual XML configuration file for the code generation starting from the database: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <configuration xmlns="http://www.jooq.org/xsd/jooq-codegen-2.0.4.xsd">   <jdbc>     <driver>oracle.jdbc.driver.OracleDriver</driver>     <url>jdbc:oracle:thin:@dbhost:1700:DBSID</url>     <user>DB_FTRS</user>     <password>password</password>   </jdbc>   <generator>     <name>org.jooq.util.DefaultGenerator</name>     <database>       <name>org.jooq.util.oracle.OracleDatabase</name>     ...

Using Rapids cuDF in a Colab notebook

During last Spark+AI Summit Europe 2019 I had a chance to attend a talk from Miguel Martinez  who was presenting Rapids , the new Open Source framework from NVIDIA for GPU accelerated end-to-end Data Science and Analytics. Fig. 1 - Overview of the Rapids eco-system Rapids is a suite of Open Source libraries: cuDF cuML cuGraph cuXFilter I enjoied the presentation and liked the idea of this initiative, so I wanted to start playing with the Rapids libraries in Python on Colab , starting from cuDF, but the first attempt came with an issue that I eventually solved. So in this post I am going to share how I fixed it, with the hope it would be useful to someone else running into the same blocker. I am assuming here you are already familiar with Google Colab. I am using Python 3.x as Python 2 isn't supported by Rapids. Once you have created a new notebook in Colab, you need to check if the runtime for it is set to use Python 3 and uses a GPU as hardware accelerator. You...