Create a presence absence matrix from long data (R)

Using R, I needed to convert some long format data to wide format in the form of a presence/absence dataset. The why doesn’t really matter :)

Where the initial dataset is for example a vector of categories, the intended output will have as many columns as the unique values of the original. For example, where the original dataset looks like this:

"TILLD-DMTN"
"GFDUD-XSV"
"TILLD-DMTN"

We want to turn it into a matrix like this:

      "TILLD-DMTN"      "GFDUD-XSV"      "TILLD-DMTN"
1          1                               0                            0
2          0                               1                            0
3          0                               0                            1

This is achieved using the package reshape2 and the function dcast.

First we’ll create two example input datasets - one numeric and one character:

landclass=factor(c(1,2,3,3,3,4,5,6,7,8,8,8,9))
rocks=c("TILLD-DMTN", "GFDUD-XSV", 
    "TILLD-DMTN",
    "TILLD-DMTN",
    "NONE_RECORDED ",
    "TILLD-DMTN",
    "TILLD-DMTN",
    "TILLD-DMTN",
    "NONE_RECORDED ",
    "TILLD-DMTN",
    "TILLD-DMTN",
    "TILLD-DMTN",
    "PEAT-P",
    "NONE_RECORDED",
    "TILLD-DMTN",
    "TILLD-DMTN",
    "TILLD-DMTN",
    "GFDUD-XSV",
    "PEAT-P",
    "ALV-XCZSV")

If we look at rocks for example, we have:

> head(rocks)
[1] "TILLD-DMTN"     "GFDUD-XSV"      "TILLD-DMTN"     "TILLD-DMTN"    
[5] "NONE_RECORDED " "TILLD-DMTN" 

Now we import reshape2 and construc a function using dcast to do the work:

library(reshape2)

presence_absence <-function(data){
    tnt <- as.data.frame(data)
    # Create an ID column to hold the row index
    tnt$ID <- seq.int(nrow(tnt))
    out=dcast(tnt, ID ~data, length)
    
    # Get rid of the ID column from the matrix
    drops=c("ID")
    out=out[ , !(names(out) %in% drops)]

    return(out)
}

This will now work for inputs of varying type (character/numeric etc.):

rocks_mat=presence_absence(rocks)
landclass_mat=presence_absence(landclass)

If we look at rocks_mat, we now have:

> head(rocks_mat)

ALV-XCZSV GFDUD-XSV NONE_RECORDED NONE_RECORDED  PEAT-P TILLD-DMTN
1         0         0             0              0      0          1
2         0         1             0              0      0          0
3         0         0             0              0      0          1
4         0         0             0              0      0          1
5         0         0             0              1      0          0
6         0         0             0              0      0          1

Thanks to Franziska Schrodt for helping with this!

Written on May 25, 2018