Monday, December 18, 2017

Assignment 6

Part 1
 After being given a set of data that included the percentage of children who receive free lunch and crime rate per 100,000 people. In order to see if the two variables were related, more specifically, if the percentage of children who receive free lunch directly affects the crime rate of the area. The two variables have a very low R Square value. This means that the two variables are not strongly related and there is a large amount of variance among the data. Given the model below, for example, if an area of town was identified as having 30 percent of children having free lunch the crime rate would be 72.369. Given a significance level of .509, we would reject the null hypothesis as there is not a strong relationship between the two variables.
Figure 1 Relationship between Kids who receive free lunch and crime rate

Part 2

The city of Portland is interested in their ability to adequately responded to 911 calls. A company is interested in building a new hospital in the city and would like to know where to optimally place the new hospital. To begin my analysis I was given statistical data on Portland's census tracts for the following variables, number of 911 calls, jobs, renters, people without high school degree, alcohol sales, unemployment, foreign born, median income and number of college graduates. Looking for connections between different variables and their connection to the number of 911 calls. I chose to look at three variables unemployment, median income and alcohol sales. Looking at the connection between alcohol sales and 911 calls (Figure 2), the R Square value was .152. This means that alcohol sales are a poor predictor of 911 calls.
Figure 2 Relationship between 911 calls and alcohol sales


After looking a alcohol sales I examined median income (Figure 3). Median income also had a small R Squared vale being .163 meaning that alcohol sales are also a poor predictor of 911 calls. For every alcohol sale there is only a .001 change in the number of 911 calls.
Figure 3 Relationship between 911 and median income

Looking at unemployment rates and 911 calls (Figure 4), they have a R Square value of .543. This means that there is a fairly strong connection between the variables, especially when compared to the previous two variables. For every one 1.6 unemployed people per census tract, we on average one 911 call.

Figure 4 Relationship between 911 calls and unemployment

Looking at the data spatially, two maps were created. One being the number of 911 calls per census tract and the other being the standardized residuals between unemployment and 911 calls. Looking at the number of calls per census tract (Figure 5), census tracts 62 and 79 have highest number of 911 calls. Areas with the second highest amount of 911 calls are found just north of census tract 62. Looking at the standard residual map (Figure 6) the areas in red are areas with high amounts of residual. This means that there are other variables that are also influencing the amount of 911 calls per census tract. The beige, salmon and grey areas are areas that can be more accurately explained by unemployment rates as they have lower residuals.

Figure 5 Number of calls per census tract

Figure 6 Standardized residuals between 911 calls and unemployment


Tuesday, December 5, 2017

Assignment 5

Assignment 5: Correlation

Part 1

A) Correlation: Distance vs Sound Level 
Given the chart below (Figure 1) shows the relationship between distance in feet vs sound levels in decibels. A correlation matrix, that displays correlation and significance levels, was created in SPSS (Figure 2). Based on the results in both the graph (Figure 3) and the chart. Looking at the chart, there is a strong negative correlation between distance and sound levels. This can be explained by the correlation being almost -1 as the value is -.896. This indicates that there is a strong negative correlation between the two variables. Meaning as distance increases, sound levels decreases. This trend can be seen in the graph below (Figure 3). There is also a significance level of zero and this means there is an sufficient statistical difference between the two variables meaning that there is a linear relationship between distance and sound.


Figure 1. Chart displaying distance and sounds levels

Figure 2. Correlation Matrix for Figure 1
Figure 3. Displays the relationship between sound level and distance

B) Detroit Population Statistic Matrix

Given an excel file showing different social statistics by race, education, different employment types, home value and income, I  created another correlation matrix (Figure 4). When looking at the table a few different trends are evident. First, when looking at the bachelor degrees by race there is a moderate positive correlation to Whites (.698) and Asians (.559) and bachelor degrees obtained. Looking at employment statistics, there were no significant correlations between the different races and different occupations. There was however, high correlations between bachelor degrees and median household income as well as median home value with both correlations being at .754. The was also low negative correlation between Blacks and median household income at - .408. 


Figure 4 Detroit Census Tract Matrix

Part 2: Texas Voting Statistics 

Introduction
The Texas Election Commission (TEC) was interested to see if voter patterns have changed between 1980 and 2012. They would also like to see if there are any clustering patterns for these variables. To complete the analysis TEC provide voter turnout and percent Democratic vote information for both 1980 and 2012. Hispanic population statistics for 2010 from the United States were also included to see if this particular demographic has an effect of voter turnout and Democratic voting percentage. 


Methodology
To complete the analysis Moran's I, LISA maps, statistical correlations and choropleth maps. Moran's I is a spatial correlation tool that shows how a set of data is clustered (areas of similar characteristics such as high values surrounded by other high values and vise versa), dispersed or neither. LISA maps allows for data clusters to be shown on a map by running statistics on areas bordering regions to see if there are similarities between the data and if there are spatial patterns. Moran's I and LISA maps were conducted in Geoda, correlations were created in SPSS and choropleth maps were created using ArcMap.

Results 
To begin the analysis, voter turnout in 1980 statistics were run through Moran's I (Figure 5). The Moran's I value was .468 which means that the data was not very clustered with most of the data being found in the middle and being dispersed fairly evenly across the different quadrant in the chart. This pattern can also be seen in the LISA map (Figure 6) with most of the values being grey (not clustered. Areas on the map that are blue are areas of low voter turnout surrounded by other areas of lower voter turnout. Areas in the red are the inverse, being areas of high voter turnout surrounded by other areas of high voter turnout. Looking at the bottom left quadrant (Figure 5), the data points found towards the bottom represent the strongest cluster of being areas that are very low in voter turnout surround by others of similar value. On the map these are the blue areas located south Texas. Again, the inverse can be said for the red areas found in north Texas.



Figure 5 Moran's I for Voter Turnout in 1980

Figure 6 LISA Map showing clusters in voter turnout in 1980
Looking at the 2012 voter turnout data in the same formats, the data presents a small decrease in the amount of clustering trends. Looking at figure 7, the Moran's I value was .335. Again this means that the data was not very clustered and similar trends appear. There is a cluster of lower voter turnout in south Texas. There is also less clusters of areas with high voter turnout.
Figure 7 Moran's I for voter turnout in 2012
Figure 8 LISA map for voter turnout in 2012

After looking at voter turnout statistics, Democratic voting percentages were analysed. Looking Democratic votes 1980 there are clusters in both high Democratic voting areas as well as areas of low Democratic values.With high Democratic votes being located in southern and eastern parts of the state and low democratic votes being located in northern parts of the state.


Figure 9 Moran's I for Democratic voting percentages in 1980
Figure 10 LISA map for Democratic voting percentages in 1980
Democratic Voting numbers in 2012 show that there is an increase in the strength of clustering in areas of high democratic voeing percentages and decrease in the strength of the clustering for areas of low Democratic voting percentage (Figure 11). Geographically there are more clusters of low Democratic voting percentages in the northern part of the state, these areas are represented in blue in figure 12. Clusters of areas of high democratic vote, along with becoming strong also appear to be spreading in the southern portions of the state (red areas in figure 12). The eastern portion of state present an interesting trend being that there are no clusters. This means that voting trends in this area are more dispersed. These trends also align with the choropleth map below (Figure 13).

Figure 11 Moran's I for Democratic voting percentages in 2012


Figure 12 LISA map for Democratic voting percentages in 2012
Figure 13 Democratic voting percentages in 2012

Looking at Hispanic population percentages in 2010 bring about interesting trends in regards to Democratic voting percentages and voter turnout percentages. The correlation matrix below (Figure 14), there is a high positive correlation of .718 between areas with high Hispanic populations areas and areas with high democratic voting percentages. This particularly interesting when looking at the distribution of Hispanic populations in 2010 (Figure 15) and Democratic voting percentages in figure 12 as they both highly concentrated in the southern regions of the state. Combined with the increased evidence of a high correlation between the two variables, it can be assumed that high Hispanic populations have a large effect of the Democratic voting trends in the state as areas with low Hispanic populations tend to have low Democratic voting percentages. Voter turnout percentages also has a high correlation in regard to Hispanic population percentages. However, where Hispanic percentages and Democratic voting percentages are positive, Hispanic percentages and voter turnout is a negative correlation. This also matches the trends set in the previous figure 8 and in figure 14 where areas of high Hispanic populations are in the same areas as areas of low voter turnout.

Figure 14 Correlations matrix for Hispanic populations and Democratic voting percentages and voter turnout in 2012


Figure 15 Hispanic population percentages in 2010
Conclusions 
Based upon the results above, I have concluded that voting trends in Texas have become more clustered in regards to Democratic voting percetages. There has been increases of areas of low Democratic voting areas in the northern and central parts of Texas and increases of areas of high Democratic voting percentages in the southern parts of the state. This is highly connected to Hispanic population percentages as Hispanic populations have a high positive correlation with Democratic voting percentages. These areas of high Hispanic populations and high Democratic voting percentages are also areas that have low voter turnout. This could mean that Hispanic populations in the southern and western regions of Texas may be an area of high interest in coming elections for Democrats as it may significantly help their total percentages as there is a large number of nonparticipating voters who are very likily to vote Democrat in the southern part of Texas.






Tuesday, November 14, 2017

Assingment 4




Interval Type
Confidence Level
n
α
z or t?
z or t value
A
Two tailed
90
45
5%
Z
+- 1.96
B
Two Tailed
95
12
2.5%
T
+- 2.17
C
One Tailed
95
36
5%
Z
1.64
D
Two Tailed
99
180
.5%
Z
+- 1.96
E
One Tailed
80
60
20%
Z
-.84
F
One Tailed
99
23
1%
T
2.51
G
Two Tailed
99
15
.5%
T
+- 3.33

Scenario 1
The department of agriculture and Live Stock development in Kenya is curious to see how a survey of 23 farmer's crop yields per hectare compared to the rest of the countries national averages. Three crops were looked at. These crops were ground nuts, cassava, and beans. A two tailed t test was used to see compare their yields to the national averages.

  • Null Hypothesis: There is no real statistical difference between the sample averages for the sampled farmers versus the country's national averages.
  • Alternative Hypothesis: There is a real statistical difference between the sample averages for the sampled farmers versus the countries national averages.
  • Statistical Test: A two tailed t-test
  • Significance Level of .025
Crop
μ (sample)
x(population)
σ (Standard
   Deviation)
α (Significance level)
T Value
CV
P-value
Ground Nuts
.51
.55
.3
.025
+- 0.639
2.074
.53
Cassava
3.4
3.8
.74
.025
+- 0.259
2.074
.79
Beans
.33
.28
.13
.025
+- 1.844
2.074
.08
(Figure. 1)
When looking at the results (Figure. 1) above we would fail to reject the null hypothesis. This means that there was no significant statistical difference between the sample of farmers and the national average. This is because the t-values are smaller than the critical value and therefore the null hypothesis is not rejected. When looking at the p-values it is also clear that the results are consistent as the p-values are larger than the significance level of .025.

Scenario 2
A research has a suspicion that a stream has higher levels of pollution that is permitted (4.4 mg/l. To test this, sample of 17 was taken. This sample had a mean of 6.8 mg/l and standard deviation of 4.2. A one tailed t-test was conducted with a significance level of .05. 
  • Null Hypothesis: There is no significant difference between the sampled streams pollution levels and the streams allowable limit.
  • Alternative Hypothesis: There is a significant difference between the sampled streams pollution levels and the streams allowable limit.
  • Statistical Test: a one tailed t-test
  • Significance Level: level of .05 
After conducting the t-test, a t-value of 2.356 was determined and this t-value was larger than the critical value of 1.746. These means that the null hypothesis is to be rejected, meaning that there is a significant difference between the polluted streams and the allowable limit. With a p-value of .0157 we are 98.42 percent sure that our result is correct. 

Scenario 3
The question was of whether average value of homes for the city of Eau Claire is statistically different than surrounding county. This was done by looking at the statistics block groups for the city and county. The cities average property value was 151,876.51 and a standard deviation of 49,706.34. The counties average property value was 169,438.13. 
  • Null Hypothesis: There is not a significant statistical difference between the average property value in the city of Eau Claire compared to the surrounding County property value averages.
  • Alternative Hypothesis:  There is a significant statistical difference between the average property value in the city of Eau Claire compared to the surrounding County property value averages.
  • Statistical Test: one tailed z-test (left)
  • Significance Level: .05
After conducting the z-test it was deemed that there is a significant difference between the average property values in the Eau Claire and the surrounding county. It was determined that the city of Eau Claire's z-score was -2.57 which is smaller than the critical value of -1.64. The p-value was .005, meaning that that we are 99.5 percent confident in our observation. The spatial distribution of the block averages can be seen in the map below (Figure 2) .

Figure. 2 The map above shows the spatial distribution of the average property values for the City of Eau Claire and the surrounding county



Thursday, October 26, 2017

Assignment 3


Introduction
                Dane County is interested in understanding the spatial pattern of the increase of home foreclosures in 2011 and 2012 as well knowing if the trend will continue through 2013. Addresses in the county have been geocoded into the various census tracts in Dane County so that an analysis can be conducted at the census tract level.

Methodology

                To complete my analysis, ArcMap was used display the census tracts for Dane County. Then a statistical analysis was completed using mean, standard deviation, z-scores and mean center. Standard deviation is the measure of the dispersion (how spread apart) a particular set of data and how the mean (average) accurately depicts the data. In this case, the mean was calculated by the average number of foreclosures per census tract. A z-score explains the position of a particular number and where in lands on the standard deviation curve (figure. 1) and can be used to find the probability of a particular occurrence taking place. For example, a negative z-score means that the likelihood of that observation is higher than 50 percent and if a z-score is positive than its probability is less than 50 percent. Also the larger the z-score the closer to the tails of the distribution the observation will be. Mean Center shows the spatial center of a set of data that can be weighted by statistics that can help display a specific spatial pattern in a data set. 
Figure. 1 Standard Deviation Curve 
Results
                When looking at the foreclosure data for Dane County (figure 2.) there was an increase in the number of foreclosures between 2011 and 2012 of 97 foreclosures for the county as a whole, with an increase of nearly an additional foreclosure per census tract on average. For this analysis three specific census tracts, 120.01, 108, and 25 (figure 3.). For each of aforementioned census tracts, their z-score were calculated to see how these different census tracts compared to the other census tracts in Dane County. When looking at z-scores for the 2012, a couple of patterns stick out. Census tract 120.01 has a very high z-score meaning that it has far more foreclosures than the county average and census tract 25 has a much smaller z-score and therefore falls to the bottom of the distribution. To place these results in the context of probability, for 2012 the number of 3.974 foreclosures will be surpassed about 80 percent of the time (z-score of -0.84) and a number of 24.962 foreclosures will only be surpassed 10 percent of the time (z-score of 1.28).
                When looking at the map for 2011 (figure 4.) there were fewer foreclosures in the center of the county with the largest number of foreclosures being found in the northern section of the county. In 2012 (figure 5.), the census tracts that surround census tract 120.01 had less foreclosures than in 2011. Looking at the differences between 2011 and 2012 (figure 6.) the most dramatic increases came about on the eastern and western borders of the county. Also, when looking at the Weighted Mean Center amongst the three maps below (figures 4,5, and 6),  it is evident that it has shifted towards the west and means that the spatial distribution of foreclosures has moved more so to the west.  Based upon the statistics provided (figures 2 and 3) and the maps below (figure 4,5 and 6) a trend of increased foreclosures especially the eastern and western borders were the greatest increase from 2011 and 2012 can be expected in 2013.
Figure 2. Shows the statistics for Dane County in 2011 and 2012
Figure 3. The chart above displays the statistics for census tracts 120.01, 108, and 25 for 2011 and 2012
Figure 4. The map above shows the number of foreclosures by census tract in 2011
Figure 5. The map above shows the number of foreclosures by census tract in 2012
Figure 6. The map above shows the difference in foreclosures between 2011 and 2012
Conclusion
                Based upon the information presented above, it can be inferred that an increase in the total number of foreclosures should continue. This is especially true for the census tracts located in the eastern and western sections of the county as the areas in the center of the county have been fairly stable in comparison. It is important to note that without further background information about the economic situation of Dane County, this study is fairly limited in the amount of information that can be interpreted. However there are important spatial trends that can be seen and that can aid county officials in their assessments of the foreclosure issue in Dane County going forward.

Sources
http://www.muelaner.com/wp-content/uploads/2013/07/Standard_deviation_diagram.png





Wednesday, October 11, 2017

Assignment 2

Part 1

Definitions
  • Range: is the difference between the highest and lowest values in a set of data
  • Mean: is the average of the data found from dividing the sum of a data set by the total number of observations
  • Mode: is the most common observation found in a  given data set
  • Kurtosis: is the how much a set of data falls along the tails of its statistical distribution 
  • Skewness: can be described as the tendency for a set of data to fall on either the positive or negative side of the mean for a given set of data
  • Standard Deviation: is a way of describing how dispersed a set of data is
Eau Claire North's teach staff has begun to question their teaching style when comparing their Standardized Test scores to Eau Claire Memorial's test scores. The concern stems from the fact that Eau Claire Memorial has always had the highest single test score. Both of the schools Standardized Test scores are shown below (Figure 1). A statistical analysis of the two school's scores below aims to bring clarity to the question of Eau Claire North's test scores not being as good as Eau Claire Memorial.
Figure 1. The tested sample of Standardized Test Scores for both high schools
Test Score Statistics
  • Eau Claire North
    • Standard Deviation: 23.635
    • Mean 160.923
    • Median: 164.5
    • Mode: 170
    • Kurtosis: -0.557
    • Skewness: -0.579
    • Range: 74
  • Eau Claire Memorial
    • Standard Deviation: 27.157
    • Mean: 158.539
    • Median: 159.5
    • Mode: 120
    • Kurtosis: -1.174
    • Skewness: -0.185
    • Range: 91
Analysis
Based upon the statistics shown above Eau Claire North's teachers should not be worried about their test scores. When comparing the two schools, Eau Claire North has a smaller standard deviation than that of Eau Claire Memorial meaning that Eau Claire North's test scores are more consistent and differ from the mean less than Eau Claire Memorial. Eau Claire North also has a higher overall test score mean than that of Eau Claire Memorial. Also, when looking at the two data sets kurtosis, Eau Claire North's test scores Kurtosis is smaller than Eau Claire Memorial meaning fewer scores fall at the ends of the distribution. As a whole, the teachers at Eau Claire North should not be concerned about their test scores compared to Eau Claire Memorial. Eau Claire Memorial does have the highest single test score but their scores are more variable and dispersed than those of Eau Claire North meaning the teaching methods being employed at Eau Claire North produce better scores as a whole when more statistical analysis is applied to their respective samples.




Figure 2. The figure above shows the calculations used to find the Standard Deviation of the two schools test scores

Part 2: Mean Centers
Mean Centers are the calculation of is average x and y values of a given data set
The Map below (Figure 3) shows three different mean centers calculated for the state of Wisconsin. They are the geographic mean center, weighted mean center the population in 2000 and the weighted mean center for population in 2015. When looking at the map below (Figure 3.) a few patterns standout. Many populations in the Northern Counties have experienced a decline in their populations since 2015. Counties in the Northeast and Southeast have seen their populations increase with the largest increase in Dane County (blue county located in southern Wisconsin). These shifts in populations may be caused by an increase in economic productivity in the northeast and southern regions of the state leading people from northern counties to begin to move southward. 

The Geographic mean center is the center of Wisconsin calculated by the average county area. The weighted means are the geographic mean only weighted by the total population in each county. The weighted mean center for 2015 shifted slightly southwest from its location in 2000. This is mainly caused by the increases in population in the southern counties of Wisconsin with the largest pull coming from Dane County located southwest of the 2000 weighted mean center. 
Figure 3. Map displaying the Wisconsin County population changes between 2000 and 2015 as well as the Geographic Mean Center and both Weighted Mean Centers for 2000 and 2015




Thursday, September 28, 2017

Assignment 1


Part 1: Data Types

Nominal Data

  • Nominal Data is data that is classified by categories defined by characteristics/names rather than numeric quantities. An example of nominal data could include the geographic regions defined by the different Native American languages as shown in the map below (Figure 1). Other examples could include geological regions of an area or even physical characteristics of a population of a group of people, such as race or gender.   

Figure 1
(http://www.emersonkent.com/images/indian_tribes.jpg)

Ordinal

  • Ordinal data is data that can be placed in an order an each values rank is based upon the value previous or ahead creating a ranked order. An example of this could be the productivity of different soils or water penetration index of an area in the map below (Figure 2).

Figure 2
(https://www.toddklassy.com/montana-blog/soil-productivity-index-map)

Interval

  • Interval data is data that can be classified by a common interval, such as meters, so that difference between valves can accurately calculated however, there is no true definable zero in the data meaning that there can be negative values. Two very common examples of interval data are that of temperature and elevation (Figure 3).

Figure 3
(https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/California_Topography-MEDIUM.png/1200px-California_Topography-MEDIUM.png)

Ratio

·        Ratio data is similar to interval data and shares many of the same characteristics of interval data but ratio data has a no negative numbers and clear and defined zero point to were all the other data can be accurately compared to. An example of ratio data would be a percentage of a population that identifies with a particular race or nationality (Figure 4).

Figure 4 
(https://upload.wikimedia.org/wikipedia/en/3/36/Census_Bureau_2000%2C_Hispanics_in_the_United_States.png)
Part 2: Data Classification Types

Equal Interval

Figure 5


Quantile

Figure 6


Natural Breaks

Figure 7

The three maps above display the distribution of organic farms in the state of Wisconsin using three different data classification methods equal interval (Figure 5), quantile (Figure 6) and natural breaks (Figure 7). Equal interval is done by subtracting the range of the data and then dividing  that number by the total number of desired classes. The second method is the quantile method that divides the data so that there is equal number of data points within each classification. The third classification method is the natural breaks method that divides the data at points where there are larger discrepancies between data points showing were data is clustered.

Based on the maps above the map that most accurately depicts the data is Figure 7 which uses the natural breaks method. The equal interval method (Figure 5) is much too distorted due to the outlier in the Vernon County which has 168 more organic farms than the next closet county. Because of that outlier the other counties are all grouped into the first classification leading the map to show very little information. The Quantile method is more accurate than the equal interval method but also fails to show the distribution of organic farms, especially in the western sections of the state where there is a greater variability between counties. It does this by grouping counties that have a large difference in their total number of farms into the same class. The most effective classification method for this data is set the Natural Breaks method because it more accurately displays how the data is distributed amongst the counties.

The most effective place to concentrate resources to attract more organic farms would be in the eastern portions of the state. This is because of the larger population centers such as the Fox Cities and Green Bay in the northeastern region of the state along with Milwaukee and Racine  being in the southeastern section of the state. Along with having higher populations these counties lack large numbers of organic farms making them an ideal place to create a market for more organic farms. While the northern section of the state lacks organic farms it also lacks good soils and a lack of any major population centers to buy these organic goods.