Machine learning with Python: An introduction
Machine learning is one of our most important technologies for the future. Self-driving cars, voice-controlled speakers, and face detection software all are built on machine learning technologies and frameworks. As a software developer you may wonder how this will impact your daily work, including the tools and frameworks you should learn. If you're reading this article, my guess is you've already decided to learn more about machine learning.
In my previous article, "Machine Learning for Java developers," I introduced Java developers to setting up a machine learning algorithm and developing a simple prediction function in Java. While Java's ecosystem includes many tools and frameworks for machine learning, Python has emerged as the most popular language for this field. Figure 1 shows the result of a recent Google Trends query combining the search term "machine learning" with "Python," "Java," and "R." Although this graph is not reliable from a statistical point of view, it does allow us to visualize the popularity of Python for machine learning.
In this article, you'll learn why Python is especially successful for machine learning and other uses involving data science. I'll briefly introduce some of the Python-based tools data scientists and software engineers use for machine learning, and suggest a few ways to integrate Python into your machine learning development process--from mixed environments leveraging a Java backend to Python-based solutions in clouds, containers, and more.
A use case for machine learning
To start, let's revisit the use case from my previous introduction to machine learning. Assume you're working for a large, multinational real estate company, Better Home Inc. To support its agents, the company uses third-party software systems as well as a custom-developed core system. This system is built on the top of a massive database containing historical data on sold homes, sale prices, and descriptions of available houses. The database is updated continuously by internal and external sources, and is used to manage sales as well as estimating the market value of properties for sale.
An agent may enter features such as house size, year of construction, location, and so on to receive the estimated sale price. Internally, this function uses a machine learning model--essentially, a mathematical expression of model parameters--to calculate a prediction. (Please see my previous article for a more detailed explanation of machine learning algorithms and how to develop und use them in Java.)
Listing 1. A machine learning model based on linear regression
double predictPrice(double[] houseFeatures) {
// mathematical expression (here linear regression)
double price = this.modelParams[0] * 1 +
this.modelParams[1] * houseFeatures[0] +
this.modelParams[2] * houseFeatures[1] +
...;
return price;
}
In Listing 1, a machine learning model is implemented using a linear regression algorithm, which is very popular in machine learning. The algorithm multiplies model parameters with the feature parameters for a given property and sums them up. As is typical in machine learning, a training process determines the parameter values to be used for the model. This approach is called supervised learning.
Supervised learning consists of feeding a system labeled example records, which are then analyzed for correlations. In this case, the system is fed historical house record features that have been labeled with the sale price. The model looks for correlations between features that have some impact on sale price, as well as the weight of these relationships. Model parameters are then adjusted based on the identified correlations and weights. This is how a machine learning model "learns" to estimate the price for a given house.
Listing 2. Training model parameters
void train(double[] houseFeatures, double[] pricesOfSale) {
//.. find hidden structures and determine the
// proper model parameters
this.modelParams =...
}
Challenges in machine learning
While the code example may appear quite simple, the challenge is to find and train the appropriate algorithm. In contrast to linear regression, which is relatively simple, most algorithms used for machine learning are more complex. Many machine learning algorithms require additional (hyper) parameters, which require a deeper understanding of the mathematics behind the algorithm.
Another challenge is finding and selecting appropriate training data. Data records have to be collected and understood, and collecting the records is not always easy. In order to build and train a price-prediction model, you must first locate a large number of sold house records. In order to be useful, you need not only the sale price but other features that help define the value of each house. In many cases, this means importing and consolidating from external as well as internal data sources. As an example, you might fetch house characteristics as well as the price of sale from an internal database storing sales transactions. For additional characteristics, you might call external partner APIs that provide information regarding the transport infrastructure or income levels for the given neighborhood.
Machine learning as a scientific process
Developing machine learning models is more similar to a scientific process than to traditional computer programming. A scientific process starts with a question, or an observation. For instance, you might observe that senior estate agents at Better Home Inc. are quite good at estimating the market price of a house. By interviewing these agents, you discover that they are able to quickly enumerate the features that determine the market value of a house. Furthermore, they're well versed in market conditions for different cities and regions. From this observation, you theorize that anyone could determine the market price of a house by combining historical sales data with key features of the property. Using this data, you could develop a machine learning model capable of estimating the sale price of a house. This feature would be of value to the company because it would enable inexperienced agents to determine the expected sale price of a new offer.
In order to test your thesis, you will need to acquire and explore the selected data sets. At this point, you are seeking an overview of the data structure. To get this overview, you will likely use tools such as Tableau, KNIME, and Weka, or even simple libraries like Python Data Analysis Library (pandas) or matplotlib. Before attempting to build your machine learning models, you will also need to prepare your data records by handling invalid or missing values. Once you've built your models, you will need to test and validate them in order to know whether your assumptions are true or false. You might, for example, validate whether the Better Home Inc. machine learning model is capable of estimating the proper sale price of a house. In general, data exploration, analysis, cleaning, and validation are the most time-consuming activities of machine learning.
The role of the data scientist
Data scientists are frequently responsible for the major tasks of a machine learning process. Most data scientists have a background in mathematics and statistics, but they are also typically proficient with programming and data modeling skills. Data scientists often have a strong understanding of data-mining techniques, which helps them to understand and select data sources, as well as gaining insight from the data. Careful data analysis helps teams choose the appropriate machine learning algorithms for a given use case.
In contrast to traditional software engineers, including enterprise Java developers, a data scientist is more focused on data and the hidden patterns in data. Data scientists typically develop, train, and process machine learning models using computing environments and data platforms implemented by traditional software engineers.
Understanding the role of data scientists in machine learning helps us understand why Python is the preferred language for this field. Unlike traditional software engineers, most data scientists prefer Python as a programming language. This is because data scientists are generally closer to scientific and research communities, where R and Python are widely used. Moreover, these communities have developed Python-based scientific libraries that make it easier to develop machine learning models. Now there is a growing, Python-based tools ecosystem specifically for machine learning. This ecosystem includes Jupyter Notebook, an interactive web-based Python shell, which is the current, de facto standard in the field of data science.
Jupyter Notebook extends a command-line Python interpreter with a web-based user interface and some enhanced visualization capabilities. It integrates code and output into a single web document that combines code, explanatory text, and visualizations. The inline plotting of the output allows immediate data visualization and iterative development and analysis. A notebook is used to explore data as well as to develop, train, and test machine learning models. As an example, a data scientist working for Better Home Inc. might use a notebook to load and explore available housing data sets, as shown in Figure 3.
Figure 3. Exploring data with Jupyter Notebook
A notebook in Jupyter consists of input cells and output cells. The editable input cells contain common Python code, which will be executed by pressing the key combination Ctrl+Enter. In the notebook shown in Figure 3, the second input cell is used to load a houses.csv file into a pandas dataframe. The dataframe provides utilities to manipulate and visualize data in an intuitive way. The third cell of the notebook contains a dataframe used to plot a histogram of house prices over time.
Data scientists use histograms and other charts and visualizations to understand data, and to identify outliers and inconsistencies in the data. Identifying inconsistencies and outliers is important because it allows you to sort through and resolve them in the data preparation process. This process eventually leads to clean data sets, which you can use to develop reliable machine learning models. You use the data sets to identify the features or house properties that are most relevant to the final sale price. These are the features that will define your machine learning model. Most algorithms aren't intelligent enough to automatically extract meaningful features from the full data set, and most algorithms won't work well if there are too many features to be analyzed.
I explained in my Java-based introduction to machine learning that logistic regression algorithms require numeric values. For such a machine learning model, all of your strings or category values must be converted to numeric values. The process of conversion is done during feature extraction. One way to extract features is to develop a dedicated function that converts the raw input of house records into a vectorized representation that the algorithm can understand.
Below is a simplified extract_core_features() method written in Python. If you are unfamiliar with Python, don't be confused by the self argument. In Python, the first argument of every non-static method definition is always a reference to the current instance of the class. On the caller side, this argument will be passed automatically by executing the method.
A significant portion of the machine learning code data scientists write is for feature extraction. In the field of natural language processing, for instance, several non-trivial conversion steps are required to transform human text into a vectorized form.[Source]-https://www.javaworld.com/article/3322898/machine-learning-with-python-an-introduction.html
Advanced level Python Certification Course with 100% Job Assistance Guarantee Provided. We Have 3 Sessions Per Week And 90 Hours Certified Basic Python Classes In Thane Training Offered By Asterix Solution