1. GLOBAL DESCRIPTION OF THE TECHNOLOGICAL AREA
Big Data Analytics (BDA) is a discipline the final goal of which is extraction of knowledge and decision making based on the information processing on massive datasets.
BDA can provide new descriptions or patterns within the data domain, allows to predict data features, or even generate of new, simulated data
We can find different stages building up towards these goals: from the extraction of raw data from its primary origins, passing through their processing towards a machine-understandable format, their data quality (DQ) assessment, and automatically describing patterns by means of clustering algorithms.
The role of these methods is twofold: first, these methods provide with new insights and knowledge constituting the core of BDA, e.g., discovering new subgroups of patients within the analysis of complex diseases such as diabetes mellitus; and second, the outputs of these methods are crucial to enable further predictive modelling and ensure their reliability, e.g., by integrating medical imaging data from different hospitals while ensuring their homogeneity, to reduce confounding factors or statistical frailty, which may harm further learning.
1.1 DATA INTEGRATION
Data is not usually accessible in a format ready for applying analytics methods. In addition, data can be generated from different heterogeneous sources.
Creating a target data format for data analysis and using heterogeneity control methods are good practices to address these issues. Tabular data can be mapped to these formats with conventional database mapping or Extract, Transform and Load processes. In the case of complex data such as images or signals, reference templates can be used for co-registration or intensity homogenization. In textual data, we can use stemming and lemmatization, recur to lexical databases such as WordNet, or use generalist embeddings, to achieve an initial homogeneity.
In any case, it is important to establish proper integration protocols and data dictionaries to ensure a reliable integration and comprehension. Using data variability control methods can help delineating unexpected heterogeneity problems that could harm further reuse.
1.2 DATA GENERATION
Big data can be generated from diverse and distributed sources and formats, including tabular data, sensor streams, text or structured web data, images, video, etc.
The domain of application also puts its particularities, such as Internet of Things (IoT) and Biomedical data. IoT data is generally characterized by stream sensoring, high heterogeneity, and a small effective vs total data ratio. Biomedical data is characterized by multi-modal information, largely coded, categorical data, or large-size data individuals such as in high-throughput omics sequencing.
Lastly, transforming data into lower dimensional embeddings, optimizing processing time while extracting latent data layouts, can sometimes be part of the data generation process. That includes the application of linear and non-linear dimensionality reduction methods, some allowing the generation of transformation models for new data (such as Principal Component Analysis), and other specialized on neighbor embedding (such as tSNE) or Deep Learning autoencoders which show large potential for clustering or community detection.
1.3 DATA HANDLING
Handling massive datasets requires not only the obvious infrastructure with large storage and access capacity, but also specific methods for data cleaning or curation, transformation, and secure access.
In data analytics projects these methods should be addressed through specific protocols and supported by technology. Scalable storage and processing capacities can be supported by Cloud services and computing clusters. Data curation can be supported by DQ assurance protocols supported by DQ metrics and tools, e.g. for outlier detection or consistency checks. Most BDA programming languages, like R and Python, have specific packages for parallel and cluster computing. However, some data formatting and transformations, e.g., constructing minable tables from multiple-cardinality data, may still require manual processing. Lastly data storage and access should be controlled by the current General Data Protection Regulation.
1.4 STREAMING PROCESSING
Stream-processing big data in real-time poses a challenge due to processing and memory capacities. These issues can be addressed to some degree via algorithms and/or specific hardware, considering parallel computing and incremental-stream processing.
Parallel or distributed computing big data technologies such as Hadoop or Spark, making use of algorithms like MapReduce, can help on that task. Graphical Processing Unit (GPU) computing also provides powerful parallel processing for big data methods such as deep learning.
Lastly, algorithms can be written in an incremental manner, so that past data is stored as a summary containing most information, rather than requiring all data samples.
Clustering is an unsupervised learning branch aimed to discover natural subgroups in data in which data points tend to have similar patterns within a subgroup and distinct from those in other subgroups.
Clustering can be divided in several branches: distance-based, density-based and distribution-based clustering. Distance-based clustering algorithms groups data which are sufficiently close to each other and farther away to other observations in the dataset.
Within the distance-based family there are two main sub-families: the hierarchical clustering and the partitional clustering. Hierarchical clustering clusters data by creating a nested set of partitions of the data represented as a dendrogram tree of relationships (e.g. agglomerative or divisive). Distance-based clustering algorithms clusters data by minimizing a partition criterion iteratively relocating objects into disjoint clusters until an optimal partition is achieved (e.g. K-means, Fuzzy-Kmeans, etc).
Density-based clustering groups data by finding contiguous regions of high density of observations isolated by contiguous regions of low density of observations (e.g. DBSCAN or OPTICS algorithms).
Finally, distribution-based models find the best parameters of a statistical underlying model assumed to describe the structure of the data (e.g. finite mixture models). Normally, the number of resultant clusters requires some manual or semi-automated parameter tuning.
Complementarily, fuzzy clustering approaches allow assigning to each point a degree of pertinence to each cluster. It is worth mentioning that there exists distributed clustering calculus for some languages, such as K-means for Hadoop.
Even nowadays, BDA requires many handcrafted, expert work. Most of that work is for data preparation and handling, but also to define proper, realistic goals for our analyses.
BDA evolves to relax that manual effort. This will come from standardizing procedures, automated data preparation and curation algorithms, and from transferring knowledge between similar tasks.
To reduce the gap between data Volume and Value, DQ must be ensured from the data generation until the learning stages. As such, the quality of data itself will be incorporated into learning algorithms. E.g., Real World Data, of routine use, is currently being used in clinical research, which might substitute dedicated, costly data acquisition processes and provide more realistic target populations, however, with a cost in data quality and preparation.
Processing of large amounts of data will take benefit from newest computing paradigms, such as quantum computing, or based on large clusters of GPUs, and algorithms should be adapted to these. Particularly, the evolution of BDA is linked to that of Deep Learning algorithms, which performance increases as the volume of data does.
But, possibly, one of the major revolutions will come thanks to the information contained in the vast amount of already developed analyses and projects. That big meta-data analytics will have the potential to guide the transfer of knowledge between tasks, both for unsupervised and supervised aims. In fact, current unsupervised learning is starting to get the most from transfer learning approaches, complemented with techniques such as autoregressive learning or even to generate new data based on previous knowledge based on generative adversarial method.
3. USE CASES AND APPLICATIONS
A wide variety of use cases of Big Data technologies are included in this area. Among them, we could include
- Data integration from different databases (clients, patients, etc.)
- Intelligent monitoring and quality control of manufacturing through sensors
- Anticipation of component failures
- Data healing for statistical analysis, customer segmentation
- Search for atypical cases (e.g. in security logs or medical records)
- Advanced dashboards with real-time analysis of trends and pattern detection
- Visualization, exploration and management of massive amounts of data (both volume and variety) dynamically
DO YOU NEED SOME OF THESE TECHNOLOGIES IN YOUR PROJECT?
Get in touch with us through the form for companies and we will guide you to incorporate these technologies into your project through the partners specialized in your activity.