Tuesday, 3 February 2015

Can Big Data help research into language development?

Last Monday I attended an event at Cambridge University organised by the Cambridge Big Data Strategic Research Initiative (CBDRI) entitled "The vocabulary of Big Data".  I am cognisant of the fact that I am not a specialist in this area and, just as the title of the event suggested, I went along to dip my toe into this increasingly more useful approach to data analysis and gain at least a basic vocabulary.  Could embracing Big Data increase the possibilities for research into language development? Travelling up from the South Coast of England it was a fair distance, and I wasn't sure whether the event would be appropriate at all for me, an early career researcher with roots in clinical practice and one randomised controlled trial under my belt.  I wasn't entirely sure what Big Data even meant, perhaps it might be something to do with astrophysics, or analysis of Twitter use, or perhaps there might be other applications, such as language analysis, something I have used in my research.   This was a free event shared on Twitter and, not one to miss an opportunity, I went along.

There were 8 speakers in total, giving talks on Big Data from a range of perspectives. In this blog I am going to focus on just a few of those talks and the key aspects I learned from the event.

First we were introduced to the concept of Big Data, and why it was relevant today and in the future.  We live in a time where very large amounts of data are being produced, far more than ever before.  Social networking, genome sequencing, brain imaging, images, text and many more examples were given of the data generated.  It was also highlighted that this data generation is predicted to grow, and in 5 years time, the data we have access to now will appear as a drop in the ocean to data generated in the future.  By acquiring a vocabulary of Big Data we can begin the journey of learning how to tap into and benefit from the data that is produced.

The first talk by Professor Zoubin Ghahramani was an introduction to Machine Learning. In this talk we were introduced to the vocabulary of machine learning, which stems from the field of computer science and statistics.  Machine learning is a way to make sense of and manage large amounts of data.  An algorithm or model is created and built using the data which is input into the system.  That model can learn from new data and consequently is able to make predictions based on the data.  There are a range of different approaches used in machine learning, which include artificial neural networks, clustering and Bayesian networks.  These different approaches enable analysis and predictions in different ways.  We also learned about different applications for machine learning.  A well known example is that of the company Netflix, which used machine learning to more successfully predict consumer preferences.  Other applications of machine learning include object or photo recognition, speech recognition and natural language processing.  The benefit of machine learning is that, as it is able to learn from data it does not rely on a fixed predesigned algorithm to start with.

A real case example of the application of Big Data was presented by Dr Richard Gibbens who described how Big Data was used for road traffic modelling, demonstrating how the large amount of information gained from motorway sensors was used to predict and manage traffic flow on Britain's motorways.  He highlighted that this data was already collected for another purpose and was therefore available but through analysing the data his department were able to provide the Highways Agency with really valuable information about traffic flow, which is now contributing to road safety.  Whilst traffic data isn't something we're likely to be mining in the field of language and communication, the case study highlighted that a Big Data approach can exploit data that has already been generated for another purpose to answer questions.

A problem with handling Big Data is just that; it is big!  The issue of handling large amounts of data were addressed by Dr Anders Hansen, Dr Eiko Yonkei and Dr Jan Lellmann.  Through their talks we were introduced to the storage and processing issues encountered when dealing with Big Data.  We were introduced to the concepts of compressing data. We were shown two images of earth, one with all the data and one with the data compressed.  To the naked eye, it was impossible to see the difference between the two, and this highlighted the fact that most of the information held in a data sample may be gleaned from a small percentage of that data.  We were shown how this approach can be used in brain imaging to provide a high level of focus on an area of interest, such as a lesion, without significantly increasing the amount of data processing.  The ways in which large amounts of data are stored were also addressed. In part, Big Data can now be stored effectively thanks to the ability to use multiple servers and cloud technology.

The event ended with a case history of natural language processing presented by Dr Paula Buttery.  She showed how natural language processing could be used to gain information from large sources of text using algorithms, and how the syntax of language could be used to make predictions.

This event really did give me a basic vocabulary of Big Data and an awareness of how it might be useful in language development research.  Undoubtably, Big Data approaches will already be employed in the field of genetic research and the neuroscience of language development.  I believe Big Data may be employed in the same way concerning environmental influences on language development.  Having spent my last research project transcribing hours upon hours of parental talk to children I am very interested in how we may embrace both new technologies of data capture and the discipline of Big Data analysis to make progress in this academic field.


No comments:

Post a Comment