And Now For Something Completely Different. Sort of
The open-source code Python still has plenty of bite after all these years, so says your (data) analyst.
Everyone knows Frederick Winslow Tayler. Who? Frederick Winslow Tayler, a 19th-century American genius, considered the father of scientific management, who used analysis of data to identify productive and efficient ways to “work”.
Then there’s Guido van Rossum.
Van Rossum is a Dutch computer programmer with an apparently impish sense of humor. In the 1990s, he developed an open-source programming language and named it “Python” after the famous British comedy group and TV show “Monty Python’s Flying Circus.” Python’s easy-to-master language and versatility proved invaluable to all kinds of business applications right from the start. (Heck, some guy named Zuckerberg used it to create that social media site he still runs).
Fast-forward three decades, and “Guido van Rossum’s 'Scientific Snake,’” you might call it, is still performing its magic act in areas such as web development, gaming and desktop applications. Many large companies (e.g., Spotify, Netflix, Facebook and Google) use Python to develop apps and platforms that are constantly changing the way we work and live.
But one area where Python’s versatility has recently emerged in a big way is in the world of large and complex data analytics. When under the spell of data analytics experts, Python can be trained to snake its way through the overgrown jungle of numbers and stats that have enveloped companies and sort things out in ways that van Rossum could never have imagined.
Or maybe he could. You could ask him. He’s not dead yet.
Fetch the Comfy Chair
Companies today are sitting on more data in various formats relevant to all aspects of their business than ever before. That includes, for instance, operations, employee files, reporting requirements and prospective litigation. Looking on the bright side of life, storing all that data is no longer such an unmanageable expense.
Getting data to perform at its best is key, especially with so many processes and operations now automated. However, loading thousands or millions of disparate data records can get messy. The usual method is typically a delimited text file, because those are relatively easy to use. In many instances, Excel is the go-to choice.
We all know how that goes. Users often enter data subjectively in different columns. Consistency, thy name is mud.
To make heads or tails, you could sort these kinds of files by creating a macro that records the formatting steps for future use. Or perhaps employ Visual Basic coding to perform formatting duties. But one look at Visual Basic and you’ll want to run away! Python, on the other hand, excels at Excel — and other databases — because it can quickly read the files and convert them to comma-delimited text files with only a few lines of code.
Did we mention it’s free?
No One Expects the Spanish Inquisition
Beyond the simple tasks, Python’s cunning makes it a star in many other areas, such as those related to litigation and fact-finding. For instance, Python can efficiently leverage a vast sea of open-source code and prewritten libraries of code to perform a wide variety of actions without having to write new code. Just plug and play and find the facts.
One of the most useful of these pre-written libraries is the “Beautiful Soup Library.” A single serving enables you to parse HTML and XML documents via web scraping, and then pull relevant data for analysis.
Other Python usages include:
- Identifying potential fraud by quickly normalizing and cleansing financial data.
- Searching PDF documents combined with other manipulation and identification logic without having to put data through classic e-discovery processes.
- Creating standard and tailored client or regulator deliverables.
- Organizing messy semi-structured procedures to run across disparate data sets.
- As van Rossum might say, “My gosh, it’s an attorney’s dream come true.” Or maybe not.
If all that doesn’t convince you to adopt van Rossum’s pet project, then maybe this will:
It’s free. (Yes, we said that earlier. But “free” sounds lovely, doesn’t it?) It’s also widely supported by an active scientific community that frequently adds new libraries for public use. In fact, so many new modeling techniques are now available via Python libraries that data scientists can often devote more time to cleansing and normalizing data than to create new libraries.
And, when asked in a 2018 survey to name the language they would recommend an aspiring data scientist learn first, 75 percent of respondents said Python1. A staggering 48 percent of data scientists with five years’ and less of experience-rated Python as their preferred programming language2.
When it comes to making Python your first choice for data scraping, organizing, analysis and the like, you really can’t argue with stats like those. Or maybe you can. Just please, not during tea time.
1: Programming languages most used and recommended by data scientists. Business Over Broadway. Jan. 13, 2019.
2: Why data scientists love python. CBT Nuggets. Sep. 20, 2018.