Data Application Development for Earthquake and Breast Cancer Datasets
Abstract-This report is a general study of two datasets, the first contains data from the earthquake occurred in the region of Marche, Italy in the year 2016 and the second dataset is mammography data, with mean values of measurements and structures of tumors found in patients, for both studies different techniques related to data science were applied, with the intention of revealing conclusions that a priori are impossible to visualize.
Keywords-Italy Earthquake, Mammongraphy studies, MapReduce algorithm, Python.
With the high processing power that modern computers have acquired, one of the scientific branches that have been most developing is data science, which consists of the generalized extraction of knowledge from information and data. Unlike statistical analysis, data science is more holistic, more global, for using large volumes of data to extract knowledge that adds value to an organization of any kind.
In this project, the breast cancer dataset contains information on the geometry, size and texture of tumors found in approximately 5100 patients. The main idea with this database is to construct a predictive model that will be able to detect when a tumor is carcinogenic in other words, predict whether the cancer is benign or malignant, from the descriptions of the same one. In the other hand, the second dataset contains information about the earthquake that occurred in Italy in year 2016, contains all the replicas that occurred by three days after and all earthquakes are geotagged, with this dataset the main idea is to do data mining, to visualize the information of an innovative way, applying geospatial theory and statistical techniques specific of data science.
A. Italy 2016 Earthquake Dataset
This database is Open-Source accessible to the community and is part of the extensive catalog offered free of charge by the Kaggle website, its structure is as follows:
It has 8086 records with full data history, each row represents an earthquake event. For each event, the following properties are given:
the exact timing of the event in the format “Y-m-d hh:mm:s.ms”
the exact geographical coordinates of the event, in latitude and longitude
the depth of the hypocenter in kilometers
the magnitude value in Richter scale
The dataset was collected from this real-time updated list from the Italian Earthquakes National Center. From now on we will call this dataset A
B. Breast Cancer (Diagnostic) Data Set
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in .
1) ID number 2) Diagnosis (M = malignant, B = benign)
2)Ten real-valued features are computed for each cell nucleus:
(a) radius (mean of distances from center to points on the perimeter) (b) texture (standard deviation of gray-scale values) (c) perimeter (d) area (e) smoothness (local variation in radius lengths) (f) compactness (perimeter^2 / area – 1.0) (g) concavity (severity of concave portions of the contour) (h) concave points (number of concave portions of the contour) (i) symmetry (j) fractal dimension (“coastline approximation” – 1)
3) The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
4) All feature values are recoded with four significant digits. This database was obtained from Kaggle website. It belongs to their repository and is open to scientist of the world that want to study it. From now on we will call this dataset B
Knowledge extraction is mainly related to the discovery process known as Knowledge Discovery in Databases (KDD), which refers to the non-trivial process of discovering knowledge and potentially useful information within the data contained in some information repository . It is not an automatic process, it is an iterative process that exhaustively explores very large volumes of data to determine relationships. It is a process that extracts quality information that can be used to draw conclusions based on relationships or models within the data.
A. Data selection
Both databases were carefully chosen based on the following details:
Reliable source or repository, which guarantees the reliability of the data, for this report the source is Kaggle who maintain a database open to the public and that users can comment.
Data without an excessive amount of white space, since having to fill this spaces with 0 can cause distortions in the model, making the predictions or conclusions of the studies are invalid.
That they contain at least 5000 rows, to make substantial the study and the conclusions had measurable.
B. information preprocessing
For both datasets, some simple statistical tests were performed with the intention of filling the missing data in the most effective way. For example, for the data of the B the standard deviation and the mean value was calculated, besides raising a frequency histogram to check that the data followed a Gaussian distribution, in fact the data is distributed in this way, so it was completed with values taken randomly based on the mean and standard deviation of the data, this way ensures that the missing data does not provide incorrect information.
For the data of A, the average values were obtained and the latitudes and longitudes of each exact point where the earthquake occurred, rounded off in order to be able to made a geospatial label with a region of each Italian province.
For both datasets, MapReduce algorithm was applied it is based on the HDFS data architecture. The idea is to be able to map key values, with each of the data and its header, so that the access to them is efficient, with this it is tried to give robustly to data, in addition to reducing the processing times. The main idea of this type of algorithm is to be able to maintain the data in distributed systems, although for this project only a single node was configured.
D. Data Mining
At this stage of the process, it is already clear how are data distributed, and it is where we decide which Machine Learning or Data Mining algorithms to apply. For the case of data set B, we decided Machine Learning algorithm based on logistic regression, starting from the following arguments:
It was verified that the data follow a linear distribution and are correlated with each other.
As the result is a decision, Benign or Malignant (1 or 0) The most intuitive is to apply the logistic regression to predict the diagnoses.
For the second set of data the technique used will be the a posteriori study of the cataclysm with the intention of revealing conclusions about earthquake, focused on the geospatial area, starting with the labeling WGS87 and with the coordinates of each earthquake it is possible to construct a density of earthquakes by region, With this data it is possible to determine which region was most affected, which was the epicenter of the earthquake and to determine if there is a correlation between the depth of the earthquake and the magnitude.
There is no period after the “et” in the Latin abbreviation “et al.”
The abbreviation “i.e.” means “that is,” and the abbreviation “e.g.” means “for example.”
The implementation was made in Python version 2.7. There are a few key libraries that will be used. Below is a list of the Python SciPy libraries required for implement algorithms for B: Scipy, numpy, matplotlib, pandas sklearn, patsy and statsmodels.
And other few more for implement A: Pandas, Numpy, Matplotlib, Basemap, Shapely, Pysal, Descartes, Fiona, Pylabs and Statsmodels, and the architecture for store and read the data is the Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.
HDFS is built to support applications with large data sets, including individual files that reach into the terabytes. It uses a master/slave architecture, with each cluster consisting of a single NameNode that manages file system operations and supporting DataNodes that manage data storage on individual compute nodes.
In the next image, Fig. 1 are exposed the workflow diagram for the Machine Learning algorithm applied to B dataset
Figure 1: Workflow for Machine Learning algorithm
And in the second one, Fig. 2 the workflow for dataset A, this workflow was constructed from the selected methodology, the idea is to follow this pattern of work to increase the productivity of research as they are work frames highly tested by qualified researchers in the area.
Figure 2: Workflow for Data Mining research
For the data set B, a recursion stage is considered in case the final predictions are not satisfactory, this would entail rethinking the model and to get everything values again. For data set A, the diagram is focused on maximum representation of the data to extract a substantial number of conclusions from graphs.
A. Dataset A
The first result obtained is a map of the central region of Italy with each of 8000 points where earthquakes occurred.
Figure 3: Scatter ploting with administrative subdivision
We’ve drawn a scatter plot on Italy map Fig. 3, containing points with a 50 meters’ diameter, corresponding to each point of A dataset.
This is a first step, but doesn’t really tell anything interesting about the density per region – merely that there were more earthquakes in Marche Italy region than in the outer places.
Figure 4: Density ploting with administrative subdivision
Now we can see how was the distribution Fig. 4 of the earthquake. It is clear on the map that the regions most affected were Lazio, Marche and Umbria.
Figure 5: Magnitude rolling mean
Most of the earthquakes occurred at a depth of 10km. This can be seen in next graph Fig. 6 by a frequency histogram of depth.
Figure 6: Frequency Histogram
The following table shows the 5 earthquakes with the greatest impact and their regions where they occurred.
table II: Greater magnitude earthquakes
B. Dataset B
We are going to look at two types of plots:
Univariate plots to better understand each attribute.
Multivariate plots to better understand the relationships between attributes.
) Univariate Plots: We start with some univariate plots, that is, plots of each individual variable. Given that the input variables are numeric, we can create box and whisker plots of each.
Figure 7: whisker plots
Fig. 7 gives a much clearer idea of the distribution of the input attributes
It looks like perhaps most of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption also this can be seen in Fig. 8.
Figure 8: Frequency histogram
) Algorithm evaluation: In this step we evaluated the most important algorithms of Machine Learning in search of which is best adapted to the data.
we used statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.
That is, we were held back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
We split the loaded dataset into two, 80% of which we used to train our models and 20% that we will hold back as a validation dataset.
We evaluated 6 different algorithms:
Logistic Regression (LR)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN).
Classification and Regression Trees (CART).
Gaussian Naive Bayes (NB).
Support Vector Machines (SVM).
This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
Figure 9: Algorithm comparison
LR: 0.658580 (0.027300)
LDA: 0.661676 (0.026534)
KNN: 0.606749 (0.023558)
CART: 0.569616 (0.041578)
NB: 0.621194 (0.032784)
SVM: 0.641823 (0.025195)
The LR algorithm was the most accurate model that we tested. Now we want to get an idea of the accuracy of the model on our validation set.
This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.
We can run the LR model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.
The accuracy is 0.75 or 75%. The confusion matrix provides an indication of the 25 errors made.
As we can see the data science has a wide field of work, in areas so diverse that for the case of this report ranging from medicine to cartography and seismology. With this report, it is evident how important the Machine Learning algorithms in cancer diagnosis, although this small case in study is not perfect, there are more advanced tools and more sophisticated algorithms that allow penetrating in this field of An amazing form, the author recommend a degree project where Deep Learning algorithms and deep neural networks are applied in the diagnosis of diseases. It is certainly a prominent field.
On the other hand, in the first dataset, it was possible to explore tools for the management of maps and the placement of big amounts of data on these, with the main idea of aˆ‹exposing results that looking at the raw data is impossible to observe. This allows you to find new points of view about phenomena already happened and learn from them to improve infrastructures or tools.
In short, data science is a field in full swing that will give much to talk about in recent years, we live in an age where information is power and manipulate and understand information are the tools of the future.
K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34
Williams, G. J., & Huang, Z. (1996, October). A case study in knowledge acquisition for insurance risk assessment using a KDD methodology. In Proceedings of the Pacific Rim Knowledge Acquisition Workshop, Dept. of AI, Univ. of NSW, Sydney, Australia (pp. 117-129).