Artificial intelligence for humans volume 1 pdf download
The real world observations that we provide to a machine learning algorithm must be nominal, ordinal, interval, or ratio. Nominal and ordinal observations are not inherently numeric, so we need to convert these into a number that is meaningful to the machine learning algorithm. Even if you are not required to be in a specific range, it is often a good idea to ensure that your values fall into a range. This will normalize the values so they can be compared.
Why is normalization necessary? Consider if you made two observations, one of which was the daily volume of the NYSE and the other of which was the point movement of an individual stock. The NYSE daily volume is usually in the billions, while the number of points moved by many individual stocks is typically less than The volume number could easily overwhelm the point movement, making the point movement seem meaningless, or zero.
Normalization is something we see every day. One of its most common forms is that of percentages. It might be a few dollars for a new cell phone or it might be several hundred dollars for a car, yet the size of the percentage figure stays the same. We could normalize the values in the NYSE example by considering them as percentages. The volume number is now comparable to the point number.
In the next sections we will see how to normalize nominal, ordinal, interval, and ratio observations. Normalizing Nominal Observations There are two commonly used methods for normalizing ordinal values. In this section, we will look at one-of-n encoding.
This the most simple means of normalizing nominal values. Later in this chapter, we will discuss equilateral encoding. Equilateral encoding is more complex, but is often more efficient than one-of-n encoding. One-of-n is a very simple form of normalization. For an example, consider the iris dataset that we saw in the last chapter.
A single line from the iris data set is shown here. The species is non- numeric. We will have to use one-of-n normalization for it. However, the fifth value is nominal, which is sometimes called categorical. It describes a category. The machine learning algorithm would most likely be trained to accept the four length measurements as inputs and then output three values to predict what species of iris the input data corresponds to. Normalization is used to create training data.
A machine learning algorithm is trained with training data. The actual process of training will be covered later in this book. For now, you simply need to be aware that a training set is a collection of input vectors that includes the ideal output for each vector. We can use the iris data set to generate a training set. First, to generate the inputs, we will normalize each of the four ratio observations.
Later in the chapter we will see how to normalize ratio observations. For now, we will focus on the species. Generating the ideal output for each of these input vectors is relatively easy. For example, using a normalization range of -1 to 1, the Setosa iris species would be encoded as follows: 1,-1,-1 Likewise, the Versicolor would be encoded as follows: -1,1,-1 Finally, Virginica would be encoded as follows.
Normalizing Ordinal Observations Ordinal observations are not necessarily numeric. However, they do have an implied ordering. For example, consider the levels of education progressed through by a student in the United States. One moves from preschool level to senior level before graduating. The levels have no true numerical value, but they do occur in a set order. In order to normalize an ordinal set, one must preserve the order. One-of-n encoding loses the order.
While the outputs are a vector, and it may appear that there is order, there is really no ordering, because most machine learning algorithms treat the individual outputs with no ordering at all.
Just because output 1 and output 2 are next to each other, that does not mean they are in any way related. Implying such ordering might introduce bias. To normalize ordinal observations, there two options. First, you can simply normalize with one- of-n encoding and forget the order. It could be that the order is not important, and in that case you can simply treat the observation as an unordered nominal data set.
However, if you wish to preserve the order, you need to assign a whole number to each category, starting with zero. There are fourteen total categories. However, if we are normalizing to the range -1 to 1, we need to apply the percent to that range. In order to do so, calculate the width of the range, which is the high value of the range subtracted by the low value. Simply multiply the percent by the width.
If we had achieved zero percent, we would be exactly at the lower bound Sixth grade thus normalizes to 0. The next two normalization techniques will build on that concept. For example, the next section shows how to reverse this process and denormalize a normalized ordinal observation. Denormalizing Ordinal Observations Denormalization is the opposite of normalization—it allows you to take a number that was normalized and convert it back to the original number.
This is very useful when processing the output from a machine learning algorithm. Because the algorithm was trained with normalized data, it will also output normalized data. You need to denormalize this output so that it will have meaning in the context of the knowledge that you hoped to achieve.
I will now show you how to denormalize the grade level from the previous section, which is essentially the process of running the normalization steps in reverse.
We can determine the percentage by calculating what percent into the width the above distance describes. We know the total width, because it is the range between the normalized upper and lower bounds -1 and 1. In order to denormalize, you must know the range in which the data were originally normalized.
We have thus denormalized the observation. The above process can be summarized using Equation 2. Equation 2. Normalizing Quantitative Observations Interval and ratio observations are normalized in the same way.
We simply look at the range from which either of the quantitative observation types are drawn and normalize them to the desired range. The individual qualities that make an interval observation different from a ratio observation are meaningless for normalization. Quantitative observations are always numeric, so they might not need to be normalized. While many machine-learning algorithms do require that numeric data be normalized to a specific range, some do not. It is important to know the numeric ranges required by the algorithm you are using in order to receive relevant output.
For the purposes of this example, I will estimate that the range for the weight of a car is between and 4, kilograms. As mentioned in Chapter 1, the guessed low should be lower than the real low and the guessed high should be higher than the actual high. I would like to normalize this range between -1 and 1. Plugging in the values from above provides the following values: dataHigh: 4, dataLow: normalizedHigh: 1 normalizedLow: -1 We can now normalize the data.
We need to calculate two ranges, or widths: dataHigh to dataLow and normalizedHigh to normalizedLow. First, we need to determine how far into the dataRange this value is. We now calculate how far into the normalizedRange is 0. Denormalizing Quantitative Observations To denormalize a quantitative observation, the normalization process is reversed.
First, we calculate how far it is from the lower bound. Multiply this by dataRange. To convert this to an actual weight, just add dataLow. It is not exactly equal because I rounded on some of the divisions. The above process can be summarized with Equation 2. These will be covered in the next section. Other Methods of Normalization There are many other means of normalizing observations.
The means already presented are the most common. This section will introduce some other normalization methods for both quantitative and nominal observations.
Reciprocal Normalization In this section we will look at a very simple means of normalization—reciprocal normalization. This normalization method supports both normalization and denormalization. However, reciprocal normalization is limited in that you cannot normalize into the range of your choice. Rather, reciprocal normalization always normalizes to a number in the range between -1 and 1. Reciprocal normalization is very easy to implement.
It requires no analysis of the data to determine high and low data values. Reciprocal Denormalization It is very easy to denormalize a number that has been normalized reciprocally. Because it is a reciprocal, it is the same equation. We normalized 5. Understanding Equilateral Encoding Equilateral encoding is a potential replacement for one-of-n encoding. I must admit that I am rather fond of this encoding method. Equilateral encoding brings two main features to the table.
This means that if you have ten categories to encode, one-of-n will require ten outputs while equilateral will require only nine.
This gives you a slight performance boost. The second feature is slightly more difficult to understand. Most training algorithms will score the output of a machine-learning algorithm based on the incorrectness of each output. Consider if you had categories, which would require outputs in one-of-n. The incorrectness will be centered primarily on two outputs. Recall that one-of-n specifies the selected category by which output has the highest value.
The two outputs primarily involved in the incorrect answer are the output that mistakenly had the highest output and the output that should have had the correct output. This can cause a small problem for the one-of-n normalization method.
If the algorithm had predicted a versicolor iris when it should have predicted a verginica iris, the actual output and ideal output would be as follows: Ideal output: -1, -1, 1 Actual output: -1, 1, -1 The problem is that only two of three actual outputs are incorrect. This ensures that any training correct is applied equally to all outputs. To do this, a unique set of values for each class must be determined.
Each set of values should have an equal Euclidean distance from the others. We will see more about Euclidean distance later in this chapter. The equal distance makes sure that incorrectly choosing Iris setosa for versicolor has the same error weight as choosing Iris setosa for Iris virginica. Listing 2. This causes the decreased number of outputs provided by equilateral encoding relative to one-of-n encoding. Equilateral encoding always requires one fewer output than one-of-n encoding would have.
Look at the example concerning equilateral normalization. Just as before, consider whether the algorithm had predicted a versicolor iris when it should have predicted a verginica iris.
The output and ideal are as follows: Ideal output: 0. Now all outputs are producing incorrect values. Additionally, there are only two outputs to process, which slightly decreases the amount of data processed by the machine learning algorithm. Algorithms will rarely give output that exactly matches any of their training values. To deal with this in one-of-n encoding, we simply look at what output was the highest.
This method does not work for equilateral encoding. Equilateral encoding shows what calculated class equilateral value Listing 2. What is meant by the assertion that each of the sets are equal in distance from each other? It means that their Euclidean distance is equal. Distances will be covered in greater detail in the next chapter. The Euclidean distance can be calculated using Equation 2. The next chapter will greatly expand upon Euclidean distance.
Implementing Equilateral Encoding I will now show you the means by which the Euclidean encodings are calculated. Guiver, The equilateral algorithm can be a bit confusing, so I will demonstrate it in two different ways. First, I will demonstrate it graphically. We will normalize to the range -1 to 1. Consider if we were to simply encode two categories.
Equilateral encoding requires one output less than the number of categories. For two categories, only one output is given. We can think of the single output as a single dimension. We would have a single output of either 1 or If we had three categories, there would be two outputs. Think of this as two dimensions; it creates an equilateral triangle. This is where the name equilateral encoding comes from. This allows the three categories to be represented in two dimensions.
The dimensions that you see in the above figure match the encodings that we used in Figure 2. We can also encode four categories. To encode four categories, we need three outputs, or three dimensions. This is depicted in in Figure 2. It is essentially a triangle-based pyramid. Just like all equilateral triangles, the lengths of all sides are equal.
We can also encode much larger numbers of categories. I cannot show these as dimensions because it is difficult to display a high dimensional figure in our three dimensional world, let alone on a two-dimensional book page. Hopefully the graphical representations above show how the categories are arranged equidistantly from one another.
Now I will show you how these numbers are actually calculated. The equilateral encoding algorithm is usually implemented to output a matrix. The matrix is N by N-1, where N is the number of categories. Each row contains the encodings for one category. We will always create this matrix for normalization to the range -1 to 1.
We will ultimately scale the completed matrix to whatever range we desire. To begin creating this matrix we first seed the matrix for a two class encoding. As seen in Figure 2.
We begin by seeding these two values into the matrix. We skip 1 because we already seeded for the case where we have one category only. We loop from 2 up to, but not including, N. We will recursively build each successive matrix from the previous matrix. We seed for two categories, and then scale it to build the matrix for three categories. We use the same normalization formula shown in Equation 2. We will use -1 and 1 for the low and high ranges for the data. If you wish to encode a category to equilateral, simply use the matrix as a lookup table and copy the row that corresponds to the category you are encoding.
To decode, simply find the matrix row that has the lowest Euclidean distance to the output vector from your machine-learning algorithm. Chapter Summary This chapter described several normalization processes. Normalization is the process by which data is forced to conform to a specific range. The range you choose is usually dependent on the machine-learning algorithm you are using.
This chapter covered several different types of normalization. Reciprocal normalization is a very simple normalization technique. This technique normalizes numbers to the range -1 to 1. Reciprocal normalization simply takes the reciprocal normalization and divides the number to normalize by 1. Range normalization is more complex.
However, range normalization allows you to normalize to any range you like. Additionally, range normalization must know the range of the input data. While this does allow you to make use of the entire normalization range, it also means that the entire data set must be analyzed ahead of time. This chapter also showed how Euclidean distance can be used to determine how similar two vectors are to one another. The next chapter will expand upon the concept of distance and introduce additional distance metrics.
In real life, distance measures the degree of separation between two points. In AI, distance is used to calculate the similarity of two vectors. For AI, think of a vector as a one-dimensional array—the distance between two arrays is the similarity between them. Understanding Vectors A vector is essentially a one-dimension array. Do not confuse the dimensionality of the vector array with the dimensions of your problem.
Even if your problem had 10 inputs, you would still have a vector. Vectors are always one dimension arrays.
Your ten inputs would be stored in a vector of length ten. In AI, a vector is usually used to store observations about a particular instance of something. This maps to the real world concept of distance quite well. You can think of a point on a sheet of paper as having two dimensions, which are usually referred to as x and y.
Likewise, a 3D point can be stored in a vector of length three. However, this is a manifold, and does not imply that time is a true dimension, at least in the sense of the first three.
Because higher dimensions are imperceptible to humans, it is very difficult for us to comprehend dimensional spaces higher than three. Very high dimension spaces are very common in AI, however. However, the species feature must be handled differently than the other four.
Vectors typically contain only numbers. The first four features are inherently numerical, but species is not. As demonstrated in chapter two, there are several ways to encode the species observation to additional dimensions.
Only simple numeric encoding translates the iris species to a single dimension. We must use additional dimensional encodings, such as one-of-n or equilateral, so that the species encodings are equidistant from each other. If we are classifying irises, we do not want our means of encoding to induce any biases. Thinking of the iris features as dimensions in a higher dimensional space makes a great deal of sense.
You can think of the individual samples the rows in the iris data set as points in this search space. Points closer together are likely to share similarities. The next few sections will describe several different methods to calculate the distance between two vectors.
Calculating Vector Distance The distance between two vectors tells us the degree of similarity between two vectors. There are several different ways to calculate this vector distance. Euclidean Distance The Euclidean distance measurement is based off of the real, two-dimensional distance between two vectors. That is, it is the difference between the two points if you drew them on paper and measured with a ruler. This two-dimensional distance is based on the Pythagorean Theorem.
Specifically, if you had two points x1,y1 and x2,y2 , the distance between the two would be described as follows: Figure 3. Figure 3. However, most vectors are longer than two numbers. To calculate the Euclidean distance for any sized vector, use the general form of the Euclidean distance equation.
The Euclidean distance measurement is used often in Machine Learning. Consider three vectors, named vector a, vector b, and vector c. The Euclidean distance between array a and array b is The Euclidean distance between array a and array c is In this case, the contents of array a more closely match array b than they do array c. Equation 3. Deza, Equation 3. The above equation also states that d p,q is the same as d q,p. This simply means that the distance is the same no matter which end you start at.
Calculation of the Euclidean distance is no more than summing the squares of the difference of each array element. Following this, the square root of this sum is taken. This square root is the Euclidean distance. The following shows Equation 3. Krause, You can see two-dimensional Manhattan distance depicted in Figure 3. For example, when using Euclidean distance, the distance between two vectors that differ by one unit in two dimensions the square root of two is less than the distance between two vectors that differ by two units in only one dimension two , whereas they would both be equal two using Manhattan distance.
This is described in Equation 3. When all dimensions are either normalized or in approximately the same range and the worst dimension governs the similarity between the two vectors, the Chebyshev distance is best to use. Most of these examples are of the same form.
You draw a series of characters and a very complex machine-learning algorithm often a neural network learns your characters and can recognize new ones. The program allows you to draw individual characters and add them a list of known characters. The characters you draw are images, which are a popular source of input for AI. This section describes how to normalize an image using down sampling.
There are more advanced methods than this, but down sampling is often effective. Lyons, The first step is to take a raw image that you want the program to recognize. A real OCR package would have to process the image and determine its individual characters. For this example, we will simplify the process by only using a single character. This could present a problem. The machine learning algorithm will be looking at a grid of pixels as inputs.
The algorithm would be unlikely to recognize the second drawing, because the pixels would be on different inputs. Because of this, it is necessary to crop. To crop, just drop lines from the top, left, right, and bottom until they touch a pixel. While doing so, we will also down sample. This will cause us to have fewer pixels to review and resolves issues related to size. Why do we need to decrease the number of pixels? Consider a full-color image that is x pixels.
We would have 90, pixels times the three RGB colors, giving , total pixels. Each image is a vector, and a k- dimension vector would be very large! To make the program more robust, it is necessary to decrease the number of inputs. For each grid cell, make the entire cell black, if even one underlying pixel is black.
Because we have a 7x5 grid, we now can create a vector of size 7x5 or 35 inputs. We create a vector for each digit and store its cropped, down sampled form in a table. When the user draws new characters to recognize, we will crop and down sample. A new vector is created for each image. We then find the digit in our table with the least distance from our new image. We determine the new image to be the same digit as the digit in the table with the lowest distance.
Chapter Summary Vectors are a very important component of machine learning. Both input and output are in the form of vectors in the context of a machine learning algorithm. The memory of the machine learning algorithm is typically thought of as a vector. Vectors can also be considered as coordinates in an n- dimensional space. This allows us to compare the similarity of two vectors by computing the distance between them.
There are many different ways to calculate the distance between two vectors. One of the most basic is Euclidean distance. Euclidean distance uses the regular distance formula for between two- dimensional points. This same formula can be applied to any number of dimensions. Manhattan distance and Chebyshev distance may also be used. We begin by creating a table of known characters and a table of vectors to represent these characters. These vectors are created by cropping and down sampling the character images.
New images are cropped and down sampled, as well. We then determine the new character to be the character that has the least distance from one of the vectors stored in the table. Random numbers are a very important concept to machine learning. We will often use a random vector as the initial state of a machine learning algorithm. This initial random state is then refined during training.
Random numbers can also be used for Monte Carlo techniques for training. The next chapter will cover random numbers. When performed by computers, random number generation is called pseudorandom number generation PRNG. This is the case with computer generated random numbers, as a computer is a completely logical machine that follows instructions and can only simulate randomization. Given exactly the same inputs and internal state, a computer will always produce exactly the same outputs.
Turing, Despite these logical limitations to randomization, computers can be very effective at pseudorandom number generation. An algorithm can be very random, but is not necessarily cryptographically secure. For AI, we care primarily about the randomness of an algorithm. Security is more important for an encryption algorithm.
In this context, a period is the amount of random numbers that a PRNG can produce before it begins to repeat the sequence. Each period is essentially a string of numbers, and each PRNG compiles a series of identical periods. The larger the period, the more random a generator is. The fewer repeating sequences evident inside of a period there are, the more random a generator is. If you know the internal state of the generator, you will know what the next random number will be.
This distinction is critical for cryptology, but is not as important for AI. These define how the PRNG performs and what its expected output can be.
Common concepts that all PRNGs share include the following: Seed Internal State Period The seed defines the sequence of random numbers you will get, as well as the initial internal state. You should always get the same sequence of random numbers for the same given seed, and nearly every seed should produce a different sequence of random numbers. The internal state comprises the variables that the PRGN uses to produce both the random numbers and the next internal state.
The length of each random number sequence is the period. Once the period is up, the random PRGN will repeat. That is why the PRGN is considered a periodic function—because it repeats its values at regular intervals or periods. Figure 4. Random Distribution Types You will usually want your random numbers to be uniformly distributed. PRNGs typically provide a random number between 0 and 1, with equal probability of getting any particular number in that range.
You can see the results of generating a large number of random numbers between 0 and 1 in Figure 4. This is called a uniform distribution, or a uniform random number. It provides equal probability of getting any option within the specified range. Most programming languages provide a means to generate uniformly distributed random numbers between 0 and 1.
You can scale this random number out to whatever range you desire. This works very similarly to normalization.
If the function rand returns a random number between 0 and 1, then equation 4. Equation 4. Some programming languages also provide a means to generate a normally distributed random number. There is really no defined upper and lower bound. Each whole number represents a standard deviation. As the standard deviations increase on both the positive and negative sides, the probability of getting a random number decreases greatly. Beyond 4 or -4, it is very rare to get a number.
Normally distributed random numbers are often useful when you want to vary a number by a small random amount. Not all programming languages support normally distributed random numbers, however.
The functions that support these PRNGs are listed here. C Uniform: Random. NextDouble Normal: Random. Box Muller is discussed at length later in this chapter. Roulette Wheels Another popular random number technique is called the roulette wheel Back, This technique only bears a slight resemblance to an actual roulette wheel, as seen in a casino.
Roulette wheels are useful when you would like to choose between three or more categories. For example, consider if you wanted to create a robot to randomly explore a grid and your robot can perform only the following three actions: Move forward Turn left Turn right Although the robot moves randomly, you might not want an even distribution among these three directions. You must order these choices so that they each share part of the 0 to 1 range that a uniform PRNG will generate numbers for.
If x is the random number, then the following list defines our actions. The second line is needed because x will be between 0. Likewise, x will be between 0. Often, there is a tradeoff between execution speed and randomness. Some algorithms are simply better than others. In this section, we will review some of the PRNG algorithms. Finally, we will review the Box Muller transformation.
Performance is another very important consideration for PRNG selection. Most AI algorithms require a very large quantity of random numbers. Random number generation can thus be a very important consideration that affects the overall efficiency of the algorithm.
It is not critical that you understand how all of the PRNG algorithms function internally. If your programming language does not use the algorithm you desire, the algorithm can be implemented separately in that language. Knuth, LCGs should not be used for applications where high quality randomness is needed. LCG is not typically useful for a Monte Carlo simulation because of the serial correlation.
Serial correlation is the relation of a variable to itself over time. This means LCG random numbers are not high quality, nor are they suitable for cryptographic applications. LCG is straightforward to implement and understand.
It is implemented through a linear function clipped into a defined period. The equation for LCG is shown in Equation 4. For LCG, the next seed is the internal state. These random numbers will be integers. They can be converted into the 0 to 1 range by dividing the random numbers by the maximum integer that the algorithm can produce. The maximum integer produced will depend on the settings of m, a, and c. The values chosen for m, a, and c will have a great impact on the randomness of the numbers generated by the PRNG.
Typically, you should not choose values of your own for m, a, and c. Much research has gone into finding optimal values. Wikipedia has a very good summary of the values used by various PRNGs. Marsaglia, It uses an initial seed set from two to many thousands of randomly chosen values. The main advantages of the MWC method are that it invokes simple computer integer arithmetic and leads to very fast generation of sequences of random numbers.
MWC has immense periods, ranging from around to 2 to the power of Assuming 32 bit registers, LCG uses only the lower 32 bits of the multiplication Equation 4.
MWC makes use of these higher bits through a carry. Additionally, multiple seed values are used. We must provide a number of seed values equal to r. Like the LCG algorithm Equation 4. However, there is no increment in this case. The equation used to generate the random integers for MWC is shown in Equation 4. There is an additional variable c, which represents the carry. The calculation for the carry is shown in Equation 4. It is important that n always be greater than r. We then use the floor of this result as the carry.
The floor is the largest integer part of a number. For example, the floor of 7. However, the floor of The floor of -7 would be The reason that we use the floor operator in Equation 4. Typically these operations are carried out on integer variables so the floor operator is given to make Equation 4. The code examples for MWC do not make use of a call to the floor function.
It is an improvement over LCG, but it is not a commonly used generator. It is not the default random number generator for any computer language that I am aware of. It provides for fast generation of very high quality pseudorandom integers.
The Mersenne Twister was designed specifically to address many of the flaws found in older algorithms. The Mersenne Twister algorithm creates high quality random numbers suitable for Monte Carlo simulations. Mersenne Twister is not a cryptographically secure generator.
However, its execution speed and high quality randomness make it a very attractive generator for AI. The name Mersenne Twister comes from the fact that that the period of the Mersenne Twister is always chosen to be a Mersenne prime number. A prime number is any number that can only be divided evenly by itself and one. For example, 5 is a prime number. A Mersenne prime number is any number n where M is also a prime number, as seen in Equation 4. You can find a list of the known Mersenne prime numbers at the following URL.
The code examples do contain an implementation of the Mersenne Twister based on the original C implementation provided by Matsumoto. One of the more tricky aspects of implementing the Mersenne Twister is that it relies on bit shifts. This can be tricky for languages that do not support unsigned numbers. For other languages, it is important to make sure that the variable sizes match the original C code. Box Muller Transformation Not all programming languages support the generation of random numbers in a normal distribution.
You might also be using a custom PRNG not provided by your programming language. There is an algorithm that can transform a continuous random number distribution into a normal one. This algorithm is called the Box Muller Transformation. Box, The following pseudo code can be used to transform regular continuous random numbers into normal random numbers.
Ross, The Box Muller Transformation generates numbers two at a time. To keep track of the pairs, we will place the two numbers into y1 and y2. We will use the variable useLast to track if there is a value in y2 waiting to be used. The variable y1 will ultimately be returned, so if useLast is set to true, then move y2 into y1. Because rand normally returns 0 to 1, we scale x1 and y1. We then square and sum both x1 and y1, giving w. We continue until we have a w that is greater than or equal to 1.
A scaling factor w is calculated across them; it allows x1 and x2 to be converted into a normally distributed random number. The next call to the function will return y2. Estimating PI with Monte Carlo Monte Carlo algorithms attempt to estimate through random sampling, as it can be time intensive to calculate the actual value. Monte Carlo makes use of random samples to generate a good estimate.
There are many different Monte Carlo techniques, some of which are hybrids between Monte Carlo and other techniques. One simple example of Monte Carlo is its estimation of the value of PI. The ratio of points inside to outside will tell us the value of PI. The area of the square is its length multiplied by its width. Because a square has the same width as length, the area of the square is essentially the width times itself, or width squared.
The diameter of the circle is the same as the width of the square. We will calculate the ratio of the area of the circle to the square by using Equation 4. We will take the ratio of points inside the circle to outside and multiply by 4. This will give us an approximation of PI. The pseudo code to produce this is shown below. A number of different random number generation algorithms exist. All are called pseudorandom number generators PRNG.
These random number generators vary in their ability to produce quality random numbers. A random number generator may produce high quality random numbers, yet not be cryptographically secure.
CSPRNG implies that the internal state of the algorithm cannot easily be guessed by observing the output of the algorithm. For AI, we are primarily interested in obtaining high quality random numbers and less interested in ensuring that these numbers do not reveal the internal state of the algorithm. While LCG can produce decent random numbers, the quality is not good enough for Monte Carlo simulations. The Mersenne Twister algorithm quickly produces high quality random numbers.
Mersenne Twister is acceptable for a Monte Carlo simulation. A Monte Carlo simulation estimates a large problem by sampling small pieces of it. In this chapter, we saw that we could estimate PI using Monte Carlo. We looked at a circle that was perfectly inscribed inside of a square. We randomly chose points and determined what points were inside of the circle and what were not.
The ratio of these two point sets told us the value of PI. The last two chapters introduced distance calculation and random numbers. The next chapter will show us an algorithm that combines both techniques.
It will show how observations can be divided into similar groups based on their distances from each other. K-Means clustering is one common technique for performing such divisions. Chapter 5: K-Means Clustering Clustering Centroid Unsupervised Training K-Means In previous chapters we saw that the input to a machine-learning algorithm is typically a vector of floating point numbers. Each of these vectors is called an observation, while the individual numbers that make up the vectors are called features.
In this chapter we will learn about clustering. Clustering is the process of placing together observations with similar features, and thereby creating a cluster. Clustering is a very useful means of breaking observations into a specified number of groups. Most clustering algorithms require you to specify the number of groups beforehand.
Some clustering algorithms can automatically determine an optimal number of groups. This chapter focuses on the K- Means algorithm. The process of clustering a finite number of observations into a specified number of clusters is called NP-Hard. NP-Hard is an abbreviation for non-deterministic polynomial-time. Informally, NP- Hard can be defined as problems that cannot be solved via a brute force search, when there are simply too many different combinations to search every potential solution to a problem.
Clustering a non- trivial number of observations is NP-Hard. K-Means makes use of random numbers to search for an acceptable clustering arrangement for the observations.
Because the algorithm is based on random numbers, K-Means is said to be nondeterministic. This means that multiple runs of the K-Means algorithm will result in different assignment of observations to clusters. The opposite of a nondeterministic algorithm is a deterministic algorithm. Deterministic algorithms always produce the same output, given consistent input.
Nearly all of the algorithms presented in this book are nondeterministic. Clustering can be used both independently and as a component to a larger machine-learning algorithm. Independently, clustering can be used to place similar items into groups.
For example, you may have a large amount of observations that represent the buying habits of individuals. If each observation represents a customer, customers can be clustered.
This allows you to make suggestive sales to customers based on what other customers in the same cluster have purchased. Genetic algorithms that use speciation often make use of clustering as a component of the genetic algorithm.
Genetic algorithms find solutions following a process that is loosely modeled after Darwinian evolution. Banzhaf, Potential solutions to a problem compete with each other and reproduce to create potentially better solutions that carry the desirable traits from the parents. Often it is desirable to break the potential solutions into species and only allow breeding within a species. Because the potential solutions are often vectors, K-Means can be used to speciate the potential solutions.
Green, The K-Means algorithm has been around, in various forms, since the s. This algorithm has gone through a number of revisions by various researchers. The actual idea for K-Means goes back to Hugo Steinhaus in However, his work wasn't published outside Bell labs until In , E.
Forgy published essentially the same method, which is why it is sometimes referred to as the Lloyd-Forgy algorithm. A more efficient version was proposed and published in Fortran by Hartigan and Wong in and revised in Understanding Training Sets Sets of observations are usually grouped into large collections called training sets.
This data is used to train the machine-learning algorithm. Training is the process where the machine-learning algorithm is modified so that the output from the algorithm provides the desired information. There are two very broad classes of machine learning algorithm that take very different types of training sets. Training is either supervised or unsupervised. When performing unsupervised training, you provide the algorithm with an input observation as a vector. However, you do not specify an output vector that represents the expected, or ideal, output.
Clustering is a type of unsupervised training. Unsupervised Training Recall the iris data set used in previous chapters? The iris data set has been applied to many different machine-learning algorithms. It can be used for both supervised and unsupervised training. This section shows how the iris data set can be applied to both. First, we will take a look at how we would present the iris data set for unsupervised clustering.
Recall that the iris data set consists of four individual ratio observations about the dimensions of the iris petals. Additionally, the species of the iris is specified. Listing 5. We might tag the observations to the species, perhaps for later comparison.
However, the clustering algorithm does not need to know the species, because this is unsupervised training. The algorithm will not aim to determine what species the iris is, but will rather place the observations into clusters based on similarities among the observations.
It is also important to note that it is not necessary to normalize the four observations. K-Means clustering does not inherently require normalization. To browse Academia. Skip to main content. You're using an out-of-date version of Internet Explorer. By using our site, you agree to our collection of information through the use of cookies.
To learn more, view our Privacy Policy. Goodreads helps you keep track of books you want to read. Want to Read saving…. Want to Read Currently Reading Read. Other editions. Enlarge cover. Good clean condition. A great. Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms by Jeff Heaton GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
If nothing happens, download GitHub Desktop and try again. If nothing happens, download Xcode and try again. If nothing happens, download the GitHub extension for Visual Studio and try again. These examples are part of a series of books that is currently under development.
Check the above website to see which volumes have been completed and are available. The planned list is shown here.
0コメント