Curse of Dimensionality

Giuliano Sposito included in data science

2018-01-12 1058 words 5 minutes

/2018-01-12-curse-of-dimensionality/images/lines-shapes-digital-art-abstract.jpg

Contents

This R Notebook reproduces the Curse of Dimensionality¹ in the nearest neighbor regression that here is defined as the increasing of the interval size to get 10% of the data according with the increasing of dimensions. So according more dimensions are add to the domain, greater is the size of the range to get the same proportion of data points. Conforming we increase the size of the range we lost “locality” of the information, losing the capacity to resume the information with a simple average.

Nearest Neighbor Moving Average

In this article, the curse of dimensionality comes from try to use of Nearest Neighbor as predictors for a function. We iteratively select 10% of nearest points in the data set at a specific point X~0~ and calculates the average of these points to use as the result of prediction of a function in at the X~0~ point.

With lower dimensionality (around 4) and a great number of sample this is a great technique and easy to use, to compute and to interpret, but according the number of dimensions increase, we loose the locality of the information because to get 10% of the points we need do increase the interval around the X~0~. Let’s see this effect in a simulation data.

Dataset

Let’s build a 5 dimensional data set

1
2
3


library(ggplot2)
library(tidyverse)
library(reshape2)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


# five dimentional datapoints
total_points <- 1000
dt <- tibble(
  x = runif(total_points,0,100),
  y = runif(total_points,0,100),
  z = runif(total_points,0,100),
  w = runif(total_points,0,100),
  v = runif(total_points,0,100)
)

# let's see them
ggplot(dt, aes(x=x, y=y)) + geom_point() + coord_fixed(ratio = 1) + theme_bw()

Let’s suppose that we need to get 10% of points around x=50 (100 points), what is the size of interval necessary to get these points?

1
2


x_dist <- quantile(x=dt$x,probs = .1)
x_dist

1
2


##      10% 
## 11.11978

We can see that in one dimension the size of the range necessary to get 10% of the data points are around 10% of this dimensions (uniform distributed)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# mark the points in this range
dt %>%
  mutate(
    in_interval = case_when(
      x <= 50+(x_dist/2) & x >= 50-(x_dist/2) ~ TRUE,
      TRUE ~ FALSE
    )
  ) -> dt

# plot them
ggplot(dt, aes(x=x, y=y, colour=in_interval)) +
  geom_point() +
  geom_vline(xintercept = 50-(x_dist/2), linetype="dotted") +
  geom_vline(xintercept = 50+(x_dist/2), linetype="dotted") +
  coord_fixed(ratio = 1) + theme_bw()

1
2
3
4


# how much points?
dt %>%
  filter(in_interval==T) %>%
  nrow()

1

## [1] 123

Two Dimension

Now, lets see the case in two dimensions, first we’ll use the same interval found in one dimension case: 11.1197807, but in this case, we need to check points around (x=50, y=50).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


dt %>%
  mutate(
    in_interval = case_when(
      sqrt((x-50)^2 + (y-50)^2) < x_dist ~ TRUE,
      TRUE ~ FALSE
    )
  ) -> dt
      

ggplot(dt, aes(x=x, y=y, colour=in_interval)) +
  geom_point() + coord_fixed(ratio = 1) + theme_bw()

Now wee see that the number of point get in this range is significantly lower than previous case.

1
2
3
4
5
6
7
8


# how much points?
dt %>%
  filter(in_interval==T) %>%
  nrow() -> sel_points

sel_points_pct <- sel_points/total_points

sel_points

1

## [1] 36

In fact, with the new distance, we can only reach 3.6 % of the data points adding one dimension. Can we find how much be the size of the radius to get 10% of the points?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# percents 
dists <- 1:100

dists %>%
  map(function(dist){
    dt %>%
      filter(sqrt((x-50)^2 + (y-50)^2) < dist) %>%
      nrow() -> sel_points
    
    return(data_frame(
      dist = dist,
      points = sel_points,
      pct = sel_points/total_points
    ))
  }) %>% bind_rows() -> distCases

ggplot(distCases, aes(x=dist, y=pct)) +
  geom_line() +
  geom_hline(yintercept = 0.1, linetype="dotted") + 
  theme_bw()

1
2
3
4
5


distCases %>%
  filter( pct <= .1 ) %>%
  filter( pct == max(pct) ) -> twoDim10pct

print(twoDim10pct)

1
2
3
4


## # A tibble: 1 x 3
##    dist points   pct
##   <int>  <int> <dbl>
## 1    17     89 0.089

The distance to get 10% of data points are 17 in size, about twice the original range in one dimension.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


dt %>%
  mutate(
    in_interval = case_when(
      sqrt((x-50)^2 + (y-50)^2) < twoDim10pct$dist ~ TRUE,
      TRUE ~ FALSE
    )
  ) -> dt

ggplot(dt, aes(x=x, y=y, colour=in_interval)) +
  geom_point() + coord_fixed(ratio = 1) +
  theme_bw()

More dimentions

We can generalize and see the impact in the size of range along the additional dimension?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


# percents 
dists <- 1:100

dists %>%
  map(function(dist){
        dt %>% 
          group_by() %>%
          summarise(
            d1 = sum( abs(50-x) <= dist ),
            d2 = sum( sqrt( (x-50)^2 + (y-50)^2) < dist ),
            d3 = sum( sqrt( (x-50)^2 + (y-50)^2 + (z-50)^2 ) < dist ),
            d4 = sum( sqrt( (x-50)^2 + (y-50)^2 + (z-50)^2 + (w-50)^2 ) < dist ),
            d5 = sum( sqrt( (x-50)^2 + (y-50)^2 + (z-50)^2 + (w-50)^2 + (v-50)^2 ) < dist )
          ) %>% mutate_all(function(x) x/total_points) %>% mutate( dist=dist )
    }) %>%
  bind_rows() %>% 
  melt(id.vars = c("dist")) %>%
  rename( 
    cover_pct = value,
    dimensions = variable
  ) %>%
  mutate( dimensions=as.factor(dimensions) ) -> multDimCases

multDimCases %>%
  ggplot(aes(x=dist, y=cover_pct, colour=dimensions)) +
  geom_line() +
  geom_hline(yintercept = 0.1, linetype="dotted") + 
  theme_bw() 

The distances to capture 10% of data points along the 5 dimension.

1
2
3
4
5


multDimCases %>% 
  group_by(dimensions) %>%
  filter( cover_pct <= 0.1 ) %>%
  filter( cover_pct == max(cover_pct) ) %>%
  select( dimensions, dist ) %>% print()

1
2
3
4
5
6
7
8
9


## # A tibble: 5 x 2
## # Groups:   dimensions [5]
##   dimensions  dist
##   <fct>      <int>
## 1 d1             4
## 2 d2            17
## 3 d3            28
## 4 d4            37
## 5 d5            45

Conclusion

We can see clearly how the effect of the increase of the data dimensionality affects the size of interval, at 5-dimensions is necessary that half of total length of data points to get 10% of the data set, destroying the “locality” information.

https://en.wikipedia.org/wiki/Curse_of_dimensionality ↩︎