Reading tabulated data with numpy

Reading tabulated data with numpy

Numpy excels when applied to matrix type calculations (see the next section) however, it can be used to read in tabulated datasets - remember though that a numpy array must be all of the same type (e.g. float) so this won’t work if your data is a mix of int, float and string data types.

Right click this link and save the file to a location on your machine.

To read in text data in numpy, we need to use the numpy.load_txt function:

data=numpy.loadtxt('path/to/file/radar_data.csv', skiprows=1, delimiter=',', dtype='f')

It is important that you specify the delimiter (as the file is a .csv, we know this to be a ‘,’). You will also notice that we have used the skiprows option inherent of numpy.loadtxt, enabling us to miss out the header matter included in the radar_data.csv. This is important as the header matter is in string format which differs to the data in the columns, the format of which we have set to float type using the dtype option (have a look here for more information on data types in numpy). Remember that all of the documentation, including details on additional option flags, can be found by typing help(nump.loadtxt).

Note that numpy.loadtxt() doesn’t work very well with keeping header info - hence us skipping that first line. For now, just remember:

Column Data
1 UTC_Seconds_Of_Day
2 Lat(deg)
3 Long(deg)
4 WGS84_Ellipsoid_Height(m)
5 S-to-N_Slp
6 W-to-E_Slp
7 RMS_Fit(cm)
8 Num_ATM_obs_Used
9 Num_Of_ATM_obs_Removed
10 Dist_Block_To_Right_aircraft(m)
11 Track_ID

We will touch on accessing using column names later on (although this kind of approach is better handled using a package like pandas).

Getting details about the array

By assigning the function call to data, we have now created a new array object. We can now use the methods inherent of array to get some info about the data we have just read in:

>>> data.shape
>>> (6080L, 11L)

You can also find out details about specific columns within the data array object - for example, the maximum value in the second column can be found using:

>>> data[:,1].max()
>>> 82.329346000000001

and the mean value of the 4th column can be found using:

>>> data[:,3].mean()
>>> 15.891525690789475

Accessing specific parts of your data

The data stored in the array can be indexed and sliced using the methods covered here. This allows you to pull out specific data from the data object - keep in mind the dimensions as calculated above.

To get the first column:


to access the 2nd, 3rd and 4th columns:


and to access the first 10 rows:


In each of the examples above, we assigned the slice to a new variable resulting in the development of a set of new arrays. Again, you can use the functions of array to pull out some information. To calculate the maximum value in the time object:

>>> time.max()
>>> 44310.25

Manipulate the values

What we can also do is alter the values held within the array. For example, say we want to reduce the values in the time array (which have units of UTM seconds according to the header matter of radar_data.csv) by 100 - all we have to do is:


Notice that we assign this to a new array to ensure that the changes are stored - for this simple 1-dimensional case (have a look at time.shape), we could also have typed:


Indexing and subsampling using conditions

A handy trick to be aware of is the ability to pull out/manipulate data meeting specific conditions. Using the time array, we can find the mean:


If we only want to get values from the time array greater than the mean time value, we can use:

time_gt_avg=time[time > mean]

If we want to reduce all values equal to mean to zero (which may not be advisable, but for purposes of providing an example!):

time[time > mean] = 0

Notice that this last command changes the values within the actual time array.

Write out data to a new file

To write out an array to a new file, you can use numpy.savetxt(). Let’s save the xyz subset to a file called xyz_subsample.csv:

numpy.savetxt("/path/to/save/to/xyz_subsample.csv", xyz, delimiter=",")

Other things to know and consider

Alternative packages for data wrangling

If you are going to be spending a lot of time working with tabulated datasets, I would advise you to spend some time familiarising yourself with pandas. Pandas offers an extremely efficient (and arguably more readable) approach to dealing with these kinds of data sets - a handy reference for the efficient use of pandas can be found here.

Accessing data using column names in numpy (if you don’t want to use pandas)

If you know the column names, it is possible to integrate these by creating a structured array - have a look at the documentation.

For example, let’s create an array:

new_data = 
numpy.array([(4,3,3,'some'),(5,4,3,'other'),(6,3,2,'useful'), \

This will result in everything being made a string data format:

>>> new_data.dtype
>>> dtype('S11')

What we need to do is to assign the format of each column type - at the same time, we can give that column a label attribute:

new_data = np.array([(4,3,3,'some'),(5,4,3,'other'),(6,3,2,'useful'), \
					 (3,9,7,'info'), (8,4,6,'to'),(8,3,3,'use')], \
					 dtype=[('x','f'),('y','f'),('z','f'), \
	                 ('text','S11') ])

Note that the dtype S11 represents an 11 character string (see here for more data type information).

What you can now do is access elements of new_data by column name as defined when assigning the data type of each column such as:


Previous Home Next