Common mistakes in data analysis
The background of big data is that the whole society is going digital, especially the development of social networks and various sensing devices. The development of cloud computing and search engines has made it possible to efficiently analyse big data. The core issue is how to quickly obtain valuable information from a large variety of data. Realizing corporate strategic operations through data analysis has become the norm, so what are the common errors in the data analysis process?
Common errors in the process of data analysis:
1. The analysis goal is not clear
”Massive data does not actually produce massive wealth.” Because many data analysts do not have clear analysis goals, they are often confused in the massive data. Either the wrong data is collected, or the collected data is not complete, which will lead to the results of data analysis are not accurate enough.
However, if the target is locked at the beginning, what exactly do you want to analyze? If you think about results oriented, you will know what kind of data you need to support your analysis. In order to determine the source of the data, collection methods and analysis indicators.
2. Errors occur when collecting data
When the software or hardware that captures the data goes wrong, a certain error occurs. For example, if the usage log is not synchronized with the server, the user behavior information on the mobile application may be lost. Likewise, if we use hardware sensors like microphones, our recordings may capture background noise or other electrical signal interference.
3. The sample is not representative
When performing data analysis, there must be a credible data sample. This is the key to ensuring that the data analysis result is reliable. If the data sample is not representative, the final analysis result will be of no value. Therefore, for data samples, it is also required to be complete and comprehensive. Use single, non-representative data to replace all data for analysis. The analysis results obtained from such one-sided data may be completely wrong.
For example, Twitter users may be more educated and have higher incomes, and their age will be somewhat older. If such a biased sample is used to predict the box office of a movie whose target audience is young people, the analysis conclusion may not be reasonable. So make sure that the sample data you get is representative of the research population. Otherwise, your analysis and conclusions lack a solid foundation.
4. Correlation and causality confusion
Most data analysts assume that correlation directly affects causality when dealing with big data. Using big data to understand the correlation between two variables is usually a good practice method, but always using the “causal” analogy can lead to false predictions and invalid decisions. To achieve good results in data analysis, we must understand the fundamental difference between correlation and causality. Correlation often refers to observing changes in X and Y at the same time, while causality means that X leads to Y. In data analysis, these are two completely different things, but many data analysts often overlook the difference.
Correlationship in data science is not causation.If two relationships are related to each other, it does not mean that one caused the other.
5. Divorce from business reality
A professional data analyst must be very familiar with the industry situation, business process, and related knowledge of the project being analysed, because the result of data analysis is to solve the problems in the project or provide reference opinions for industry decision makers. If the business knowledge and data analysis work cannot be combined well, and the business reality is divorced and only concerned with the data, the analysis results obtained in this case will not have reference value.
6. Passionate about advanced analysis
Some data analysts will excessively pursue the so-called cutting-edge, advanced and fashionable analysis technology. When facing an analysis project, the first thing they think of is to choose a cutting-edge technology to solve it, rather than thinking from the real needs of the subject itself. Reasonable and cost-effective analysis technology. If you can get the same result in a simple way, there is no need to quote a complex data analysis model.
Leave a Reply
Want to join the discussion?Feel free to contribute!