The 8 best open source tools for data mining

Data mining is also known as data exploration. It is a step in Knowledge-Discovery in Databases, a process of mining and analyzing large amounts of data and extracting information from it. Some of these applications include market segmentation-such as identifying the characteristics of a customer buying a specific product from a specific brand, fraud detection-identifying transaction patterns that may lead to online fraud, etc. In this article, we have compiled the 8 best open source tools for data mining.


As an open data mining platform, WEKA has assembled a large number of machine learning algorithms that can undertake data mining tasks, including data preprocessing, classification, regression, clustering, association rules, and visualization on a new interactive interface.

2.Rapid Miner

RapidMiner is the world’s leading data mining solution, with advanced technology to a very large extent. Its data mining tasks cover a wide range, including various data arts, which can simplify the design and evaluation of the data mining process.

3. Orange

Orange is a component-based data mining and machine learning software package. Its functions are friendly, powerful, fast and multi-functional visual programming front end for browsing data analysis and visualization, and it is based on Python for script development. . It contains a complete series of components for data preprocessing, and provides data accounting, transition, modeling, model evaluation and exploration functions. It is developed by C++ and Python, and its graphics library is developed by the cross-platform Qt framework.

4. Knime

KNIME (Konstanz Information Miner) is a user-friendly, intelligent, and open source platform for data integration, data processing, data analysis and data exploration.

5. jHepWork

jHepWork is a complete set of object-oriented scientific data analysis framework. Jython macros are used to display one-dimensional and two-dimensional histogram data. The program includes many tools that can be used to interact with two-dimensional and three-dimensional scientific graphics.

6. Apache Mahout

Apache Mahout is a brand new open source project developed by the Apache Software Foundation (ASF). Its main goal is to create some scalable machine learning algorithms for developers to use for free under the Apache license. The project has reached its second year and currently only has one public release. Mahout contains many implementations, including clustering, classification, CP, and evolutionary programs. In addition, by using the Apache Hadoop library, Mahout can be effectively extended to the cloud.


ELKI (Environment for Developing KDD-Applications Supported by Index-Structures) is mainly used to cluster and find outliers. ELKI is a data mining platform similar to weka, written in java, with a GUI graphical interface. Can be used to find outliers.

8. Rattle

Rattle (an easy-to-learn R analysis tool) provides statistical and visual summaries of data, converts data into easy-to-model forms, constructs unsupervised and supervised models from the data, presents the performance of the model graphically, and obtains new data set.

Common mistakes in data analysis

The background of big data is that the whole society is going digital, especially the development of social networks and various sensing devices. The development of cloud computing and search engines has made it possible to efficiently analyse big data. The core issue is how to quickly obtain valuable information from a large variety of data. Realizing corporate strategic operations through data analysis has become the norm, so what are the common errors in the data analysis process?

 Common errors in the process of data analysis:

  1. The analysis goal is not clear

  ”Massive data does not actually produce massive wealth.” Because many data analysts do not have clear analysis goals, they are often confused in the massive data. Either the wrong data is collected, or the collected data is not complete, which will lead to the results of data analysis are not accurate enough.

  However, if the target is locked at the beginning, what exactly do you want to analyze? If you think about results oriented, you will know what kind of data you need to support your analysis. In order to determine the source of the data, collection methods and analysis indicators.

  2. Errors occur when collecting data

  When the software or hardware that captures the data goes wrong, a certain error occurs. For example, if the usage log is not synchronized with the server, the user behavior information on the mobile application may be lost. Likewise, if we use hardware sensors like microphones, our recordings may capture background noise or other electrical signal interference.

  3. The sample is not representative

  When performing data analysis, there must be a credible data sample. This is the key to ensuring that the data analysis result is reliable. If the data sample is not representative, the final analysis result will be of no value. Therefore, for data samples, it is also required to be complete and comprehensive. Use single, non-representative data to replace all data for analysis. The analysis results obtained from such one-sided data may be completely wrong.

  For example, Twitter users may be more educated and have higher incomes, and their age will be somewhat older. If such a biased sample is used to predict the box office of a movie whose target audience is young people, the analysis conclusion may not be reasonable. So make sure that the sample data you get is representative of the research population. Otherwise, your analysis and conclusions lack a solid foundation.

  4. Correlation and causality confusion

  Most data analysts assume that correlation directly affects causality when dealing with big data. Using big data to understand the correlation between two variables is usually a good practice method, but always using the “causal” analogy can lead to false predictions and invalid decisions. To achieve good results in data analysis, we must understand the fundamental difference between correlation and causality. Correlation often refers to observing changes in X and Y at the same time, while causality means that X leads to Y. In data analysis, these are two completely different things, but many data analysts often overlook the difference.

  Correlationship in data science is not causation.If two relationships are related to each other, it does not mean that one caused the other.

  5. Divorce from business reality

  A professional data analyst must be very familiar with the industry situation, business process, and related knowledge of the project being analysed, because the result of data analysis is to solve the problems in the project or provide reference opinions for industry decision makers. If the business knowledge and data analysis work cannot be combined well, and the business reality is divorced and only concerned with the data, the analysis results obtained in this case will not have reference value.

  6. Passionate about advanced analysis

  Some data analysts will excessively pursue the so-called cutting-edge, advanced and fashionable analysis technology. When facing an analysis project, the first thing they think of is to choose a cutting-edge technology to solve it, rather than thinking from the real needs of the subject itself. Reasonable and cost-effective analysis technology. If you can get the same result in a simple way, there is no need to quote a complex data analysis model.

Heavy! Collection of blacklists and early warning journals of all units in China!!

Recently, the First Hospital of Jilin University has compiled warning journals:

  1. Medicine
  2. International journal of clinical and experimental medicine
  3. PLoS one
  4. Scientific reports
  5. Oncology letters
  6. Experimental and therapeutic medicine
  7. Biochemical and biophysical research communications
  8. British journal of biomedical science
  9. Cancer radiotherapies
  10. International journal of molecular medicine
  11. International journal of osteopathic medicine
  12. Journal of genetic counseling
  13. Material science in semiconductor processing
  14. Journal of cellular Biochemistry
  15. Biomedicine and pharmacotherapy
  16. Journal of cellular physiology
  17. Life Sciences
  18. European review for medical and pharmacological sciences
  19. Cancer biomarkers
  20. International journal of clinical and experimental pathology
  21. Caner management and research
  22. American journal of cancer research
  23. American journal of translational research
  24. Biomed research international
  25. Bioscience reports
  26. International journal of biochemistry & cell biology
  27. International journal of oncology
  28. Journal of cancer
  29. Journal of cellular and molecular medicine
  30. Journal of clinical medicine
  31. Journal of experimental & clinical Cancer research
  32. Journal of international medical research,
  33. Molecular medicine reports
  34. Oncology research
  35. Oncotargets and therapy
  36. Theranostics
  37. World journal of Gastroenterology
  38. Artificial cells nanomedicine and biotechnology
  39. Experimental and molecular pathology
  40. Biofactors
  41. Brazilian journal of medical and biological research
  42. International journal of immunopathology and pharmacology
  43. Medical science monitor
  44. Bio-medical research Tokyo

On December 31, 2020, The Chinese Academy of Sciences officially released the “International Journal Early Warning List (Trial)” 

  1. Metal
  2. Coatings
  3. Materials
  4. Journal of nanoscience and nanotechnology
  5. Minerals
  6. Atmosphere
  7. Artificial cells nanomedicine and biotechnology
  8. Advances in civil engineering
  9. International journal of energy research
  10. Mathematical problems in engineering
  11. Sensors
  12. Energies
  13. Applied sciences-base1
  14. Polymers
  15. Electronics
  16. Processes
  17. Complexity
  18. Desalination and water treatment
  19. International journal of electrochemical science
  20. Catalysts
  21. Molecules
  22. Natural product research
  23. Sustainability
  24. Water
  25. Ieee access
  26. Agronomy-base1
  27. Journal of cellular biochemistry
  28. Journal of cellular physiology
  29. Bioscience reports
  30. Biomed research international
  31. Plant-base1
  32. Cells
  33. Boundary value problems
  34. Advances in difference equations
  35. Mathematics
  36. European review for medical and pharmacological sciences
  37. International journal of clinical and experimental pathology
  38. Medicine
  39. International journal of clinical and experimental medicine
  40. Biomedicine and pharmacotherapy
  41. Experimental and molecular pathology
  42. Brazilian journal of medical and biological research
  43. International journal of immunopathology and pharmacology
  44. Medical science monitor
  45. American journal of translational research
  46. Journal of biomaterials and tissue engineering
  47. Aging-us
  48. Life sciences
  49. Journal of clinical medicine
  50. International journal of environmental research and public health
  51. Acta medica mediterranea

The First Affiliated Hospital of Sun Yat-sen University does not support the journal catalog

1. European review for medical and pharmacologicalsciences

2. Cancer management and research

3. Bioscience reports

4. Cancer biomarkers

5. Journal of International medicalresearch

6. Journal of cellular biochemistry

7. Biochemical and biophysical researchcommunications

8. Biomedicine and pharmacotherapy

9. American journal of cancer research

10. Journal of cellular physiology

11. Life sciences

12. Journal of cellular and molecularmedicine

13. Theranostics

14. Journal of Experimental and clinicalcancer research

15. Journal of cancer

16. International journal of molecularmedicine

17. American journal of translationalresearch

18. Biomed research international

19. Journal of clinical medicine

20. Oncotarget

21. Medicine

22. Scientific reports

23. Tumor biology

24. International journal of biochemistryand cell bilogy

25. Biomedical research-INDIA

26. Cellular physiology and biochemistry

27. International journal of Clinical and experimentalmedicine

28. International journal of clinical andexperimental pathology

29. Experimental and therapeutic medicine

30. Molecular medicine reports

31. Medical science monitor

32. Oncology letters

33. International journal of oncology

34. World journal of gastroenterology

35. oncology research

36. Oncotargets and therapy

37. Plos one