HDSI Feature Article Series: I Am Data Science
with Data Science Librarian Stephanie Labou
by Trista Sobeck & Bobby Gordon
Stephanie Labou is the UC San Diego Data Science Librarian and deals with data and so much more. She works with a vast amount of valuable information and co-manages the UC San Diego Library’s Data & GIS Lab. Between running workshops about coding, data management, and software tools, she also works with students helping them understand and navigate datasets.
With a background in academic research, marine resource management, and biology, one could say that her evolution to working with data was just, well, natural.
“My academic background is in ecology and after graduate school, I had a research assistant position with an interdisciplinary research group,” she says. “I wrote code, managed very large datasets, and learned a bit about high-performance computing,” she explains. “What I did wasn’t quite data science, in the private sector definition, but the hands-on experience working with big data prepared me well for the position of Data Science Librarian,” she continues.
Stephanie says that she really enjoys working with data, but she mostly enjoys helping others discover and learn new skills to help them work efficiently and effectively with their own data.
Her definition of data science is most definitely informed by her exposure to the variety of data science projects she has exposure to across the UC San Diego campus. “I think of data science as a toolbox of methodologies that can be applied to data from any discipline and in any format,” she says. Because of her current role, she truly understands how wide-reaching the field of data science can be.
Right now, she’s working on a project about machine learning (ML) from a research data curation perspective. “Research data curators in libraries want to guide researchers on how to structure, document, preserve, and share research outputs,” she says.
“However, the field of ML is moving so quickly that formal research data curation guidelines for ML outputs are scarce. Our project is an investigation into where ML practitioners are sharing components of their ML projects – data, code, etc. – and what sorts of metadata they include, such as link to training data, references to GitHub repository, documentation on hyperparameters, and so on,” she explains.
She has an end goal to identify a set of best practices that enable the reuse of ML outputs. Think of it as making data science just a tad easier for others. Every little bit of info helps.
And for those just getting started with data science, Stephanie advises to work through a data science personal project from start to finish. “This is a way to get a good sense of what data science looks like in practice: come up with a question that has a data science methodological solution; search for data; refine your question based on existing data; harmonize, clean, and wrangle data; then analyze, model, or predict; and finally visualize and communicate the results,” she says.
This hands-on diving-in-the-deep-end approach helps beginners discover where their interests lie under the big umbrella of ‘data science.’ Stephanie continues, “It doesn’t have to be something advanced like using AI to detect tumors in medical images – it can be making a recommender system based on your own Spotify data, or predicting baseball statistics, or anything else you’re interested in.”
Stephanie also notes that those new to data science will fail a lot. “Don’t get discouraged,” she encourages. “Your code won’t work, your predictive algorithm will perform poorly, and so on. That happens! Learning to be resilient and developing good problem-solving skills is a part of being successful in data science.”
What’s next on Stephanie’s data science radar? She’s closely following the research and discourse about the intersection of data science, privacy, and bias. “[I’m interested in] the use of facial recognition, which can be extremely biased against certain demographic groups; data surveillance and how it’s really hard to near impossible to truly de-identify data; and a whole host of other issues,” she says.
In addition to those big-ticket issues, Stephanie is also thinking about problems that live closer to home—like in a library. “I’m interested in the developments about copyright and AI, as well as copyright and rights clearances for text and data mining,” she says. There are a lot of topics to stay on top of in data science, Stephanie says. And she’ll be right there explaining and workshopping with students and putting them into practice.
Follow Stephanie on Twitter: @stephlabou