Data Science
Data Science - Data Wrangling
Extracting and cleaning data
Data is transforming the world every day.
The ability to extract and clean data is called as Data Wrangling or Data Munging.
It is said that Data Scientists spend nearly 70 % of the time in data wrangling.
Let's take up a scenario to better understand.
Suppose you need to analyse,
Why life in city A is better than life in city B?
All the data is present on a website.
you collect data from the source.
you need to store the data, you need database for that.
after arranging the data, you find some values missing.
you should have a technique to complete the collected info.
Now, this technique is data wrangling.
The Goals of Data Wrangling
It should provide precise and actionable data to Business Analysts in a timely matter
Reduce the time which is being spent on collecting and arranging data
Enable Data Scientist to focus mainly on analysis rather than wrangling of data
Drive better decisions based on data in a short time span
You have a basic idea about what Data Wrangling is,
now let’s analyse what we learned.
Rearrange the following in proper order
Rearrange in the appropriate order
1. data wrangling
2. acquiring data
3. missing values
4. storing data
Answer :
2 - 4 - 3 - 1
Process of data wrangling
Now let's dive into key steps in Data Wrangling process with basic examples to get you started.
1-Acquiring Data
The first and most important step is, of course, acquiring and sorting data.
However, before finding data, you must know the following properties.
Not All Data Is Created Equal
When first exploring data, you must ask yourself a small set of questions:
Does the data appear to be regularly updated?
Is there any other source where you can verify the data?
If your answers to these questions are yes, then you are on the right track.
whereas if the answers to questions are no, then you have to dig a little more into it.
Fact Checking
Fact checking your data is paramount to the validity of your reporting.
Once you have validated and fact-checked your data,
it will be easier to determine its validity in future.
Where to Find data
You are not going to ring everyone’s telephone to collect data.
There are many sources from where you can collect your data which includes Government data,
Data from NGOs,
Educational or University Data,
Medical or Scientific Data, Crowdsourced Data and so on.
Now let’s jump to our main step, which is Data Cleaning.
2 - Data Cleaning
Cleaning up data is not much of a glamorous task.
But, it is an essential part of Data Wrangling.
To become a Data Cleaning expert you must have a precise knowledge of the particular field and on top of that patience, Yes Patience.
It refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data
and then replacing, modifying, or deleting the dirty or coarse data.
After cleansing, a data set should be consistent with other similar data sets in the system.
The data is audited with the use of statistical and database methods to detect contradictions.
This is performed by a sequence of operations on the data known as the workflow.
After executing the cleansing workflow, the results are inspected to verify correctness.
By now, you must know why Data wrangling is the most important task. Without clean and robust data, there is no Data Science.
Let's have a quick recap of what we learned
What all properties do we need to know before acquiring data?
Select the right answer
A. not all data is completely clean
B. fact checking
C. where to find data
D. all of the above
Answer : D
Data cleaning is:
Select the right answer
A. Large collection of data mostly stored in a computer system
B. The removal of noise errors and incorrect input from a database
C. The systematic description of the syntactic structure of a specific database. It describes the structure of the attributes of the tables and foreign key relationships
D. None of these
Answer : B
Category of data.
What types of data are we talking about?
Data can mean many different things, and there are many ways to classify it.
Two of the most common are:
1. Primary and Secondary:
Primary data is the data that you collect or generate.
Secondary data is created by other researchers.
2. Qualitative and Quantitative:
Qualitative refers to text, images, video, sound recordings, observations, etc.
Quantitative refers to numerical data.
There are typically five main categories that Data can be sorted into for management purposes.
Observational
- Captured in real-time
- Cannot be reproduced or recaptured. Sometimes called ‘unique data’.
- Examples include sensor readings, telemetry, survey results, images, and human observation.
Experimental
- Data from lab equipment and under controlled conditions
- Often reproducible, but can be expensive to do so
- Examples include gene sequences, chromatograms, magnetic field readings, and spectroscopy.
- Data generated from test models studying actual or theoretical systems
- Models and metadata where the input is more important than the output data
- Examples include climate models, economic models, and systems engineering.
- The results of data analysis, or aggregated from multiple sources
- Reproducible (but very expensive)
- Examples include text and data mining, compiled databases, and 3D models
Reference or canonical
- Fixed or organic collection datasets, usually peer-reviewed, and often published and curated
- Examples include gene sequence databanks, census data, chemical structures.
Data can come in many forms.
Some common ones are text, numeric, multimedia, models, audio, code, software, discipline-specific (i.e., FITS in astronomy, CIF in chemistry), video, and instrument.
telemetry, survey results fall under:
Select the right answer
A. experimental data
B. sequential data
C. observational data
D. canonical data
Answer : C.
Which category of data is completely reproducible
Select the right answer
A. experimental data
B. compiled data
C. sequential data
D. canonical data
Answer : B.