Big data and analytics by seema acharya pdf download






















Introduction to Big Data. Big Data Analytics. The Big Data Technology Landscape. Introduction to Hadoop. Introduction to Mongo DB. Introduction to Cassandra. Introduction to Hive. Introduction to Pig. Jasper Report using Jasper soft. Name the most popular GUIs for R. Business intelligence systems or business intelligence tools handle all the analytical processing of a database and use different types of database systems. The tools support the relational database processing RDBMS , accessing a part of the large database, getting a summary of the database, accessing it concurrently, managing security, constraints, server connectivity and other functionality.

At present, different types of databases are available in the market for processing. They have many inbuilt tools, GUIs and other inbuilt functions through which database processing becomes easy. With the help of these packages, users can easily access a database since all Loading and Handling Data in R the packages follow the same steps for accessing data from the database.

In this section, you will go through a brief introduction on Jaspersoft and Pentaho with R. Michael Lapsley and Brian Ripley developed this package. Its package has many inbuilt functions for performing database operations on the database. The function takes a query, sends to an ODBC database and returns its result. The function writes or updates a data frame to a table in the ODBC database. The function removes a table from the ODBC database. The function closes the open connection. It is a small-sized popular database that is available for free download.

MySQL database can be downloaded and installed from its official website. The function runs the SQL queries of the open connection. The function lists the tables of the database of the open connection. The function creates the table and alternatively writes or updates a data frame in the database.

Please note that it requires a server. Users can get a server on rent, download and install the MySQL database from its official website. It is an embedded SQL database engine that does not require any server, due to which it is called a serverless database.

The database also supports all business analytical data processing. The RSQLite5 has many inbuilt functions for working with the database. Like other packages used for accessing a database, as explained in the previous sections, users can use the same methods—dbconnect and dbDisconnect for opening and closing the connection from the SQLite database, respectively. The only difference here is that users have to pass the SQLite database driver object in the dbConnect function.

It was developed by the Jaspersoft community. It provides many business intelligence tools for analytical business processing. RevoDeploy R6 provides a set of web services with security features, scripts, APIs and libraries in a single server. It easily integrates with the dynamic R-based computations into web applications.

The company provides different open source-based and enterprise-class platforms. Pentaho Data Integration PDI is one of the products of Pentaho7 used for accessing database and analytical data processing. It prepares and integrates data for creating a perfect picture of any business.

The tool provides accurate and analytics-ready data reports to the end users, eliminates the coding complexity and uses big data in one place. Through R Script Executor, users can access data and perform analytical data operations.

If users have R in their system already, then they just need to install PDI from its official website. For executing SQL queries, users can deploy the same functions for all the three databases. What is MySQL? It is an Oracle product. MySQL is a popular small-sized database that is available for free download. What is PostgreSQL? What is RSQLite? What is RevoDeploy R? Case Study 6. What is the R Script Executor? Log files keep logs to be read in future, if required.

A transaction log is a file for communication between a server and users of that system or server or a data collection method that automatically captures the types, content or time of transaction made by a person from a terminal within that system. In web searches, a transaction log file is created which is an electronic record between interactions that have occurred during a search index between the web search engine and users searching for getting any information on that web.

Continued Case Study Data Analytics using R Many operating systems, software frameworks and progress include a logging system. It is easy for the reader or user to generate their own customised reports using R that can automatically analyse Apache log files and create reports automatically as compared to other software.

Nowadays, R has become one of the most popular and powerful tool that can generate a model based on which, the requirements of the user can be tracked and searched.

Types of Log Files Event Logs Event logs record the events that are taking place in the execution of any system in order to provide an audit that can be used to enable the activities of the system and to diagnose problems or error in the system or servers. They are essential to analyse the activities of complex systems, particularly in the case of applications with little user interactions.

Transaction Logs Every database system maintains some kind of transaction log which is not mainly stored as an audit trail for later analysis, and is not intended to be human-readable.

Message Logs In these types of log files, we can see multiple types of logs like the Internet Relay Chat IRC , messaging programs, peer-to-peer file sharing clients with chat functions and multiplayer games commonly having the ability to automatically log textual communication, i.

Message logs may be referred to the third-party log storages from different channels. These are used to set the profile to access their details and enable the basic details. However, such a log is not comparable to a true IRC server event log file as it Continued Case Study Loading and Handling Data in R only records user-visible events for the period the user spent being connected to a certain channel.

In this log file, the user can set priorities in the server files to set their needs and preferences. These logs require a password to be decrypted and viewed. These logs are often handled by the respective user-friendly application that is used in mobile application for getting information from the user and to check the interest of the users. Transaction Log Analysis Data stored in transaction logs of web search engines, intranets, and websites can provide valuable information into the understanding of information searching process of online searchers.

This understanding can enlighten information designed system, interface development and devise the information architecture for content collections. The main role of these log files is to read the data provided by the user to get more information from them and set the records to identify the role and interest of different users. This is the main log files with the help of which we can track user preferences and their visits based on any transaction that they had done in the past. It also has the ability to process lots of data with advanced statistical capabilities and connect to a database, making it one of the most powerful programming languages.

Getting the Data Before being able to read the log file data, we must first import that data into R. The good thing is that R can parse log file without requiring any other additional work from the user. So, reading a Log file named log. The head logs command illusrates the first few lines from the log variables to get an idea of how we are going to store this kind of data in R. However, the most important part is analysing the data.

The most useful command we can run on a dataset with numeric values is the summary command. The summary command can give us better understanding of the output of the summary of the data. By running the summary command, we will get: d Min: This is the minimum value of the whole dataset.

If the dataset has an odd number of elements, the median is part of the dataset of elements. If the dataset has an even number of elements, then the median is the mean values of the two center elements of the dataset. The median is the mean values of the two centre elements of the dataset. The pair command is especially useful since it gives a general overview of the data.

By providing the index number in square brackets ii. By providing the column name as a string in double brackets. It is not a typo. Example 4 Let us define row names for the rows in the data frame. The output of this function returns the number of rows and columns. Exploring Data in R nrow Function The nrow function returns the number of rows in a data frame.

The str function helps in returning the basic structure of the dataset. This function provides an overall view of the dataset. Examples 1. In this example, the value of n is set as 3 and hence, the resulting output would contain the first 3 observations of the dataset.

Consider x as the total number of observations. It has the following content. We will read the content from the file but will not store its content to a data frame. In other words, the first line is not automatically treated as a column header. Let us modify the syntax, so that the first line is treated as a column header. The cells inside the table are separated by blank characters.

The merge function takes an x frame item. Statistical data type is more common in R, which is a set of observations where values for the variables are passed. These input variables are used in measuring, controlling or manipulating the results of a program. Each variable differs in size and type. Based on the specific data characteristics in R, data can be explored in different ways.

You will learn about these methods in the following section. Exploratory data analysis using R is an approach used to summarise and visualise the main characteristics of a dataset, which differs from initial data analysis. The main aim of EDA is to summarise and visualise the main characteristics of a dataset.

It focuses on: d Exploring data by understanding its structure and variables d Developing an intuition about the dataset d Considering how the dataset came into existence d Deciding how to investigate by providing a formal statistical method d Extending better insights about the dataset d Formulating a hypothesis that leads to new data collection d Handling any missing values d Investigating with more formal statistical methods.

The diagrams used in R are simple and can represent a large amount of data. Which function in R is used to obtain the values of dimension? Ans: The dim function is used to obtain the dimension of the dataset. Which function in R is used to open the data editor? Ans: The edit x function opens the data editor in R. What is the default value of n in head mydata and tail mydata function? Ans: The default value of n is 6.

State a few graphical techniques used by EDA in R. Table 4. Exploring Data in R Table 4. With the use of this function, mean, var, min, max, sd, quantile and range can be determined. The mean of the input data is found using: sapply sampledata, mean, na.

Consider the same data frame, Employee. Returns the maximum position for each row in the matrix. For summarising data, there are three other ways to group the data based on some specific conditions or variables and subsequent to this, the summary function can be applied. These are explained below.

A simple code to explain the ddply function is: d d data c c V1 V2 1 1 4 2 2 5 3 3 NA na. A slight difference can be found in some residual and prediction functions. It returns an object only when there is no missing value.

Functions for these invalid values include anyNA x anyInvalid x and is. This function is equivalent to any is. Exploring Data in R Unlike the other two functions, is.

This function is also equivalent to is. Obtain the min, max, median mean, 1st quantile, 3rd quantile values using the summary function. Median 0. Outliers 0. Figure 4. Practically, a person cannot have negative income. Negative income is an indicator of debt.

Hence, the income is given in negative values. However, such negative values are required to be treated effectively. A check is required on how to handle these types of inputs, i. Here, the values fall out of the data range of the expected values.

Outliers are considered to be incorrect or errors in input data. A negative value in the age field could be a sentinel value and an outlier could be an error data, unusual data or sentinel value. In case of missing a proper input to the field, an action is required to handle the scenario, i. The data range of the observation variable is the difference between the largest and the smallest data value in a dataset.

The value of a data range can be calculated by subtracting the smallest value from the largest value, i. Exploring Data in R For example, the range or the duration of rainfall can be computed as Calculates the duration.

In the example above, time duration of rainfall is helpful in predicting the probability of duration of rainfall. Hence, there should be enough variation in the amount of rainfall and the duration of the rainfall. In R, freq function can be used to find the frequency distribution of vector inputs. In the example given, consider sellers as the dataset and the frequency distribution of the shop variable is the summary of the number of sellers in each shop. Mode can take both numeric and character as input data.

Mode does not have any standard inbuilt function to calculate mode of the given inputs. Hence, a user-defined function is required to calculate mode in R. Here, the input is a vector value and the output is the mode value. R does not have a standard inbuilt function to determine the mode. This function will take the vector as the input and return the mode as the output value.

But in case of finding mode, a user-defined function is needed to obtain the value of mode. What are the possible na. Ans: The possible na. How are the missing values in the input vector removed?

Ans: na. How is the data range obtained from a given input? Visualisation engages the audience well and numerical values, on comparison, cannot represent a big dataset in an engaging manner. From Figure 4. The use of graphical representation to examine the given set of data is called visualisation. With this visualisation, it is easier to calculate the following: d To determine the peak value of the age of the customers maximum value d To estimate the existence of the sub-population d To determine the outlier values.

The graphical representation displays the maximum available information from the lowest to the highest value. It also presents users with greater data clarity. For better usage of visualisation, the right aspect ratio and scaling of data is needed. Basically bimodality vs unimodality Exploring Data in R Is it normal data or lognormal data? How does the given data vary? Generally, visual representation of data is helpful to grasp the shape of data distribution.

The summary statistics assumes that the data is more or less close to normal distribution. It also represents the values in a more visually understandable way.

It returns the mean customer age of about With this statistical output, it can be concluded that the customer is a middle-aged person in the age range of 38—64 years. The additional black curve in Figure 4.

Usually, if a distribution contains more than two peaks, then it is considered a multimodal. The second black curve has the same mean age as that of the grey curve. However, here the curve concentrates on two sets of populations with younger ages between 20 and 30 and the older ages above These two sets of populations have different patterns in their behaviour and the probability of customers who have health insurance also differs.

In such a case, using a logistic regression or linear regression fails to represent the current scenario. MedianMean3rd Qu. Moving forward, the histogram makes the representation simpler as compared to density plots and is the preferred method for presenting findings from quantitative analysis. It looks similar to a bar graph. However, values are grouped into continuous ranges in a histogram. The height of a histogram bar represents the number of values occurring in a particular range.

R uses hist x function to create simple histograms, where x is a numeric value to be plotted. Example 1 A simple histogram can be created by just providing the input vector where other parameters are optional. However, the area of the curve under the density plot is equal to 1. Therefore, the point on the density plot diagram matches the fraction of the data or the percentage of the data which is divided by that takes a particular value.

The resulting value of the fraction is very small. A density plot is an effective way to assess the distribution of a variable. It provides a better reference in finding a parametric distribution. The basic syntax to create a plot is plot density x , where x is a numeric vector value.

Exploring Data in R Example 1 A simple density plot can be created by just passing the values and using the plot function Figure 4. The plot function creates the density diagram. In case of widespread data range, the distribution of data is concentrated to one side of the curve. Here it is very complex to determine the exact value in the peak. Example 2 In case of non-negative data, another way to plot the curve is using the distribution diagram on a logarithmic scale, which is equivalent to the plot the density plot of log10 input value.

For Figure 4. Hence, in order to simplify the visual representation log10 scale is used. In Figure 4. In case of wide spread data this logarithmic approach can give a perfect result.

Here, the logarithmic scale is given in both the ends of the X-axis where the Y-axis denotes the density values. Both vertical and horizontal bars can be drawn using R. It also provides an option to colour the bars in different colours. The length of the bar is directly proportional to the values of the axes.

R uses the barplot function to create a bar chart. The basic syntax for creating a bar chart using R is barplot H, xlab, ylab, main, names. Wide data range: several orders of magnitude. Some basic bar charts commonly used in R are: d Simple bar chart d Grouped bar chart d Stacked bar chart 1. Simple Bar Chart A simple bar chart is created by just providing the input values and a name to the bar chart. The following code creates and saves a bar chart using the barplot function in R.

This can be shown with a sample program as follows: Categories Figure 4. Here the legend column is included on the top right side of the bar chart. Stacked Bar Chart Stacked bar chart is similar to group bar chart where multiple inputs can take different graphical representations. Except by grouping the values, the stacked bar chart stacks each bar one after the other based on the input values. Just Remember Bar charts are an efficient way of presenting a huge collection of data.

Here the values are represented in the X and Y axes and the legend function is used to summarise the data that is used in the chart, which can be positioned anywhere in the chart. What are the three types of bar charts used in R? Ans: Simple bar chart, group bar chart and stacked bar chart are the three types of bar charts are used in R.

What are the advantages of using data visualisation? Ans: The advantages of using data visualisation are: d To determine the peak value of the age of the customer maximum value d To estimate the existence of the subpopulation d To determine the outliers easily. Which function is used to create a bar chart? Ans: The barplot function is used to create a bar chart. Syntax of barplot is: barplot H, xlab, ylab, main, names.

Data in R, are sets of organised information. We deal more with statistical data type in R. Exploratory Data Analysis EDA involves analysing datasets in order to summarise the main characteristics in the form of visual representations. Continued d d d d d d d d d d d d Data Analytics using R Some of the graphical techniques used by EDA are—box plot, histogram, scatter plot, Pareto chart, etc.

Outliers are considered to be incorrect or error input data. It is neither a string nor a numeric value but used to specify the missing data. Frequency is a summary of data occurrences in a collection of non-overlapping types. Mode is similar to the frequency except the value of mode returns the highest number of occurrences in the dataset. Mean is generally referred as summing up of input values and dividing the sum by the number of inputs. Median is the middle value of the given inputs.

Histogram is a graphical illustration of the distribution of numerical data in successive numerical intervals of equal size. A bar chart is a pictorial representation of statistical data. A simple bar chart is created by just providing the input values and the name to the bar chart. Stacked bar chart is similar to group bar chart where multiple inputs can take different graphical representations.

Key Terms d d d d d Bar chart: A bar chart is a pictorial representation of statistical data. Data range: Data range is the difference between the largest and smallest data values in a dataset. Data visualisation: The use of graphical representation to examine a given set of data is called data visualisation. Histogram: Histogram is a graphical illustration of the distribution of numerical data in successive numerical intervals of equal size.

Mean: Mean is generally referred to as summing up of input values and dividing the sum by the number of inputs. Median: Median is the middle value of the given inputs. Mode: Mode is similar to frequency except the value of mode returns the highest number of occurrences in a dataset. Outliers: Outliers are considered to be incorrect or error input data. How many numbers of columns are there in the given output?

What will be the output of the following code? Which function is used to open a data editor? Which is not an invalid value in R? Which one of the following is used to drop missing values? Which parameter is used to mention the width of each bar in a histogram?

She has designed and delivered several large-scale competency development programs across the globe involving organizational competency need analysis, conceptualization, design, development and deployment of competency development programs.

She is an educator by choice and vocation, and has rich experience in both academia and the software industry. Search Advanced Search. Document-oriented data store. It can store up to 4 MB of data. Metadata: file Data chunks chunks collection scalability. Published by Mc Graw Hill. Seller Rating:. About this Item: Mc Graw Hill. Condition: NEW. End Chapter Exercises may differ.



0コメント

  • 1000 / 1000